Skip to Main content Skip to Navigation

Models and resources for attention-based unsupervised word segmentation : an application to computational language documentation

Abstract : Computational Language Documentation (CLD) is a research field interested in proposing methodologies capable of speeding up language documentation, helping linguists to efficiently collect and process data from many dialects, some of which are expected to vanish before the end of this century (Austin and Sallabank, 2013). In order to achieve that, the proposed methods need to be robust to low-resource data processing, as corpora from documentation initiatives lack size, and they must operate from speech, as many of these languages are from oral tradition, meaning that there is a lack of standard written form.In this thesis we investigate the task of Unsupervised Word Segmentation (UWS) from speech. The goal of this approach is to segment utterances into smaller chunks corresponding to the words in that language, without access to any written transcription. Here we propose to ground the word segmentation process in aligned bilingual information. This is inspired by the possible availability of translations, often collected by linguists during documentation (Adda et al., 2016).Thus, using bilingual corpora made of speech utterances and sentence-aligned translations, we propose the use of attention-based Neural Machine Translation (NMT) models in order to align and segment. Since speech processing is known for requiring considerable amounts of data, we split this approach in two steps. We first perform Speech Discretization (SD), transforming input utterances into sequences of discrete speech units. We then train NMT models, which output soft-alignment probability matrices between units and word translations. This attention-based soft-alignment is used for segmenting the units with respect to the bilingual alignment obtained, and the final segmentation is carried to the speech signal. Throughout this work, we investigate the use of different models for these two tasks.For the SD task, we compare five different approaches: three Bayesian HMM-based models (Ondel et al., 2016, 2019; Yusuf et al., 2020), and two Vector Quantization (VQ) neural models (van den Oord et al., 2017; Baevski et al.,2020a). We find that the Bayesian SD models, in particular the SHMM (Ondel et al., 2019) and H-SHMM (Yusuf et al., 2020), are the most exploitable for direct application in text-based UWS in our documentation setting. For the alignment and segmentation task, we compare three attention-based NMT models: RNN (Bahdanau et al., 2015), 2D-CNN (Elbayad et al., 2018), and Transformer (Vaswani et al., 2017). We find that the attention mechanism is still exploitable in our limited setting (5,130 aligned sentences only), but that the soft-alignment probability matrices from novel NMT approaches (2D-CNN, Transformer) are inferior to the ones from the simpler RNN model.Finally, our attention-based UWS approach is evaluated in topline conditions using the true phones (Boito et al., 2019a), and in realistic conditions using the output of SD models (Godard et al., 2018c). We use eight languages and fifty six language pairs for verifying the language-related impact caused by grounding segmentation in bilingual information (Boito et al., 2020b), and we present extensions for increasing the quality of the produced soft-alignment probability matrices (Boito et al., 2021).Overall we find our method to be generalizable. In realistic settings and across different languages, attention-based UWS is competitive against the nonparametric Bayesian model (dpseg) from Goldwater et al. (2009). Moreover, ours has the advantage of retrieving bilingual annotation for the word segments it produces. Lastly, in this work we also present two corpora for CLD studies (Godard et al.,2018a; Boito et al., 2018), and a dataset for low-resource speech processing with diverse language pairs (Boito et al., 2020a).
Document type :
Complete list of metadata
Contributor : ABES STAR :  Contact
Submitted on : Monday, November 15, 2021 - 4:21:12 PM
Last modification on : Wednesday, July 6, 2022 - 4:13:10 AM
Long-term archiving on: : Wednesday, February 16, 2022 - 9:03:47 PM


Version validated by the jury (STAR)


  • HAL Id : tel-03429446, version 1



Marcely Zanon Boito. Models and resources for attention-based unsupervised word segmentation : an application to computational language documentation. Computation and Language [cs.CL]. Université Grenoble Alpes [2020-..], 2021. English. ⟨NNT : 2021GRALM022⟩. ⟨tel-03429446⟩



Record views


Files downloads