Measuring the Quality of Semantic Data Augmentation for Sarcasm Detection
Résumé
Sarcasm is a form of figurative speech where the intended meaning of a sentence is different from it literal meaning. Sarcastic expressions tend to confuse automatic NLP approaches in many application domains, making their detection of significant importance. One of the challenges in machine learning approaches to sarcasm detection is the difficulty of acquiring ground-truth annotations. Thus, human-annotated datasets usually contain only a few thousand texts, often being unbalanced. In this paper, we propose two different pipelines of data augmentation to generate more sarcastic data. The first one is SMERT-BERT, a modified SMERTI pipeline that uses RoBERTa as the language model for the text infilling module. The second one is SWORD (semantic text exchange by Word-Attribution), where we modified the masking module in the SMERTI pipeline by utilizing the word-attribution value. These approaches are combined with a SLOR (syntactic log-odds ratio) metric to filter the generated sarcastic data and only select sentences with the best score. Our experiments show that the use of a SLOR filter has a significant positive contribution to the augmentation process. In particular, we achieve the best results when using the SMERT-BERT pipeline and a SLOR
filter by improving the F-measure by 4.00% on the iSarcasm dataset, compared to the baseline models.
Domaines
Informatique [cs]
Origine : Fichiers éditeurs autorisés sur une archive ouverte
licence : CC BY NC SA - Paternité - Pas d'utilisation commerciale - Partage selon les Conditions Initiales
licence : CC BY NC SA - Paternité - Pas d'utilisation commerciale - Partage selon les Conditions Initiales