ZeroBERTo: Leveraging Zero-Shot Text Classification by Topic Modeling

Alexandre Alcoforado; Thomas Palmeira Ferraz; Rodrigo Gerber; Enzo Bustos; André Seidel Oliveira; Bruno Miguel Veloso; Fabio Levy Siqueira; Anna Helena Reali Costa

doi:10.1007/978-3-030-98305-5_12

Chapitre D'ouvrage Année : 2022

ZeroBERTo: Leveraging Zero-Shot Text Classification by Topic Modeling

(1) , (2, 1) , (1) , (1) , (1) , (3, 4) , (1) , (1)

1
2
3
4

Alexandre Alcoforado

Fonction : Auteur
PersonId : 1123253

Escola Politecnica da Universidade de Sao Paulo [Sao Paulo]

Thomas Palmeira Ferraz

Fonction : Auteur
PersonId : 1123252

Télécom Paris

Escola Politecnica da Universidade de Sao Paulo [Sao Paulo]

Rodrigo Gerber

Fonction : Auteur
PersonId : 1123255

Escola Politecnica da Universidade de Sao Paulo [Sao Paulo]

Enzo Bustos

Fonction : Auteur
PersonId : 1123254

Escola Politecnica da Universidade de Sao Paulo [Sao Paulo]

André Seidel Oliveira

Fonction : Auteur
PersonId : 1141450

Escola Politecnica da Universidade de Sao Paulo [Sao Paulo]

Bruno Miguel Veloso

Fonction : Auteur
PersonId : 1123256

Universidade Portucalense

Institute for Systems and Computer Engineering, Technology and Science [Porto]

Fabio Levy Siqueira

Fonction : Auteur
PersonId : 1141451

Escola Politecnica da Universidade de Sao Paulo [Sao Paulo]

Anna Helena Reali Costa

Fonction : Auteur
PersonId : 1141452

Escola Politecnica da Universidade de Sao Paulo [Sao Paulo]

Résumé

Traditional text classification approaches often require a good amount of labeled data, which is difficult to obtain, especially in restricted domains or less widespread languages. This lack of labeled data has led to the rise of low-resource methods, that assume low data availability in natural language processing. Among them, zero-shot learning stands out, which consists of learning a classifier without any previously labeled data. The best results reported with this approach use language models such as Transformers, but fall into two problems: high execution time and inability to handle long texts as input. This paper proposes a new model, ZeroBERTo, which leverages an unsupervised clustering step to obtain a compressed data representation before the classification task. We show that ZeroBERTo has better performance for long inputs and shorter execution time, outperforming XLM-R by about 12% in the F1 score in the FolhaUOL dataset.

Mots clés

Low-resource NLP Unlabeled data Zero-shot learning Topic modeling Transformers Low-resource NLP

Domaines

Traitement du texte et du document Informatique et langage [cs.CL] Intelligence artificielle [cs.AI]

Fichier principal

preprint_zeroberto.pdf (447.61 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Thomas Palmeira Ferraz : Connectez-vous pour contacter le contributeur

https://telecom-paris.hal.science/hal-03628242

Soumis le : samedi 4 juin 2022-22:57:46

Dernière modification le : samedi 25 juin 2022-03:14:29

Archivage à long terme le : lundi 5 septembre 2022-18:02:25

Dates et versions

hal-03628242 , version 1 (04-06-2022)

Identifiants

HAL Id : hal-03628242 , version 1
ARXIV : 2201.01337
DOI : 10.1007/978-3-030-98305-5_12

Citer

Alexandre Alcoforado, Thomas Palmeira Ferraz, Rodrigo Gerber, Enzo Bustos, André Seidel Oliveira, et al.. ZeroBERTo: Leveraging Zero-Shot Text Classification by Topic Modeling. Vládia Pinheiro; Pablo Gamallo; Raquel Amaro; Carolina Scarton; Fernando Batista; Diego Silva; Catarina Magro; Hugo Pinto. Computational Processing of the Portuguese Language. 15th International Conference, PROPOR 2022, Fortaleza, Brazil, March 21–23, 2022, Proceedings, 13208, Springer International Publishing, pp.125-136, 2022, Lecture Notes in Computer Science, 978-3-030-98304-8. ⟨10.1007/978-3-030-98305-5_12⟩. ⟨hal-03628242⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSTITUT-TELECOM PARISTECH IP_PARIS

37 Consultations

258 Téléchargements

ZeroBERTo: Leveraging Zero-Shot Text Classification by Topic Modeling

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager