Skip to Main content Skip to Navigation

Informed Audio Source Separation with Deep Learning in Limited Data Settings

Abstract : Audio source separation is the task of estimating the individual signals of several sound sources when only their mixture can be observed. It has several applications in the context of music signals such as re-mixing, up-mixing, or generating karaoke content. Furthermore, it serves as a pre-processing step for music information retrieval tasks such as automatic lyrics transcription. State-of-the-art performance for musical source separation is achieved by deep neural networks which are trained in a supervised way. For training, they require large and diverse datasets comprised of music mixtures for which the target source signals are available in isolation. However, it is difficult and costly to obtain such datasets because music recordings are subject to copyright restrictions and isolated instrument recordings may not always exist. In this dissertation, we explore the usage of prior knowledge for deep learning based source separation in order to overcome data limitations. First, we focus on a supervised setting with only a small amount of available training data. It is our goal to investigate to which extent singing voice/accompaniment separation can be improved when the separation is informed by lyrics transcripts. To this end, we propose a general approach to informed source separation that jointly aligns the side information with the audio signal using an attention mechanism. We perform text-informed speech-music separation and joint phoneme alignment to evaluate the approach. Results show that text information improves the separation quality. At the same time, text can be accurately aligned with the speech signal even if it is highly corrupted. In order to adapt the approach to the more challenging task of text-informed singing voice separation, we propose DTW-attention. It is a combination of dynamic time warping and attention that encourages monotonic alignments of the lyrics with the audio signal. The result is a novel lyrics alignment method which requires a much smaller amount of training data than state-of-the-art methods while providing competitive performance. Furthermore, we find that exploiting aligned phonemes can improve singing voice separation, but precise alignment and accurate transcripts are required. Modifications of the input text result in modifications of the separated voice signal. For our experiments we transcribed the lyrics of the MUSDB corpus and made them publicly available for research purposes. Finally, we consider a scenario where only mixtures but no isolated source signals are available for training. We propose a novel unsupervised deep learning approach to musical source separation. It exploits information about the sources’ fundamental frequencies (F0) which can be estimated from the mixture. The method integrates domain knowledge in the form of differentiable para- metric source models into the deep neural network. Experimental evaluation on a vocal ensemble separation task shows that the proposed method outperforms F0-informed learning-free methods based on non-negative matrix factorization and an F0-informed supervised deep learning baseline. Combining data-driven and knowledge-based components, the proposed method is extremely data- efficient and achieves good separation quality using less than three minutes of training data. It makes powerful deep learning based source separation usable in domains where labeled training data is expensive or non-existent.
Complete list of metadata
Contributor : Roland Badeau Connect in order to contact the contributor
Submitted on : Monday, December 20, 2021 - 7:41:19 AM
Last modification on : Thursday, January 6, 2022 - 3:11:53 AM


Files produced by the author(s)


  • HAL Id : tel-03494790, version 1



Kilian Schulze-Forster. Informed Audio Source Separation with Deep Learning in Limited Data Settings. Signal and Image processing. Institut Polytechnique de Paris, 2021. English. ⟨tel-03494790⟩



Les métriques sont temporairement indisponibles