Skip to Main content Skip to Navigation

Apprendre à représenter et à générer du texte en utilisant des mesures d'information

Abstract : Natural language processing (NLP) allows for the automatic understanding and generation of natural language. NLP has recently received growing interest from both industry and researchers as deep learning (DL) has leveraged the staggering amount of available text (e.g web, youtube, social media) and reached human-like performance in several tasks (e.g translation, text classification). Besides, Information theory (IT) and DL have developed a long-lasting partnership. Indeed, IT has fueled the adoption of deep neural networks with famous principles such as Minimum Description Length (MDL), Information Bottleneck (IB) or the celebrated InfoMax principle. In all these principles, different measures of information (e.g entropy, MI, divergences) are one of the core concepts. In this thesis, we address the interplay between NLP and measures of information. Our contributions focus on two types of NLP problems : natural language understanding (NLU) and natural language generation (NLG). NLU aims at automatically understand and extract semantic information from an input text where NLG aims at producing natural language that is both well-formed (i.e grammatically correct, coherent) and informative. Building spoken conversational agents is a challenging issue and dealing with spoken conversational data remains a difficult and overlooked problem. Thus, our first contributions, are turned towards NLU and we focus on learning transcript representations. Our contribution focuses on learning better transcript representations that include two important characteristics of spoken human conversations : namely the conversational and the multi-modal dimension. To do so, we rely on various measures of information and leverage the mutual information maximization principle. The second group of contributions addresses problems related to NLG. This thesis specifically focuses on two core problems. First, we propose a new upper bound on mutual information to tackle the problem of controlled generation via the learning of disentangled representation (i.e style transfer and conditional sentence generation). Secondly, we address the problem of automatic evaluation of generated texts by developing a new family of metrics using various measuresof information.
Complete list of metadata
Contributor : Abes Star :  Contact
Submitted on : Wednesday, December 8, 2021 - 5:18:07 PM
Last modification on : Thursday, December 9, 2021 - 3:06:21 AM


Version validated by the jury (STAR)


  • HAL Id : tel-03471220, version 1



Pierre Colombo. Apprendre à représenter et à générer du texte en utilisant des mesures d'information. Document and Text Processing. Institut Polytechnique de Paris, 2021. English. ⟨NNT : 2021IPPAT033⟩. ⟨tel-03471220⟩



Les métriques sont temporairement indisponibles