J. Chen and D. Wang, Long short-term memory for speaker generalization in supervised speech separation, The Journal of the Acoustical Society of America, vol.141, issue.6, pp.4705-4714, 2017.

A. Liutkus, J. Durrieu, L. Daudet, and G. Richard, An overview of informed audio source separation, 14th International Workshop on Image Analysis for Multimedia Interactive Services, pp.1-4, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00958661

K. Kinoshita, M. Delcroix, A. Ogawa, and T. Nakatani, Text-informed speech enhancement with deep neural networks, Sixteenth Annual Conference of the International Speech Communication Association, 2015.

A. Katsamanis, M. Black, G. Panayiotis, L. Georgiou, S. Goldstein et al., Sailalign: Robust long speech-text alignment, Proc. of Workshop on New Tools and Methods for Very-Large Scale Phonetics Research, 2011.

G. Bordel, M. Penagarikano, L. J. Rodríguez-fuentes, A. , and A. Varona, Probabilistic kernels for improved text-to-speech alignment in long audio tracks, IEEE Signal Processing Letters, vol.23, issue.1, pp.126-129, 2015.

K. Schulze-forster, C. Doire, G. Richard, and R. Badeau, Weakly informed audio source separation, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02280472

L. Sun, J. Du, L. Dai, and C. Lee, Multipletarget deep learning for lstm-rnn based speech enhancement," in Hands-free Speech Communications and Microphone Arrays, pp.136-140, 2017.

, A fully convolutional neural network for speech enhancement, 1993.

S. Pascual, A. Bonafonte, and J. Serrà, Segan: Speech enhancement generative adversarial network, pp.3642-3646, 2017.

A. Luc-le-magoarou, N. Ozerov, and . Duong, Text-informed audio source separation using nonnegative matrix partial co-factorization, IEEE International Workshop on Machine Learning for Signal Processing, pp.1-6, 2013.

Z. Wang, Y. Zhao, and D. Wang, Phonemespecific speech separation, IEEE International Conference on Acoustics, Speech and Signal Processing, pp.146-150, 2016.

S. Shlomo-e-chazan, J. Gannot, and . Goldberger, A phoneme-based pre-training approach for deep neural network with application to speech enhancement, IEEE International Workshop on Acoustic Signal Enhancement, pp.1-5, 2016.

J. Pedro, C. Moreno, J. Joerg, O. Van-thong, and . Glickman, A recursive algorithm for the forced alignment of very long audio segments, Fifth International Conference on Spoken Language Processing, 1998.

M. Mcauliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, Montreal forced aligner: Trainable text-speech alignment using kaldi, pp.498-502, 2017.

D. Jan-k-chorowski, D. Bahdanau, K. Serdyuk, Y. Cho, and . Bengio, Attention-based models for speech recognition, Advances in neural information processing systems, pp.577-585, 2015.

R. Nishikimi, E. Nakamura, S. Fukayama, M. Goto, and K. Yoshii, Automatic singing transcription based on encoder-decoder recurrent neural networks with a weakly-supervised attention mechanism, IEEE International Conference on Acoustics, Speech and Signal Processing, pp.161-165, 2019.

M. Luong, H. Pham, and C. Manning, Effective approaches to attention-based neural machine translation, 2015.

K. Taras and . Vintsyuk, Speech discrimination by dynamic programming, Cybernetics and Systems Analysis, vol.4, issue.1, pp.52-57, 1968.

Z. Rafii, A. Liutkus, and F. Stöter, The MUSDB18 corpus for music separation, Stylianos Ioannis Mimilakis, and Rachel Bittner, 2017.

S. John and . Garofolo, Timit acoustic phonetic continuous speech corpus, Linguistic Data Consortium, 1993.

P. Diederik, J. Kingma, and . Ba, Adam: A method for stochastic optimization, 2014.

E. Vincent, R. Gribonval, and C. Févotte, Performance measurement in blind audio source separation, Speech, and Language Processing, vol.14, pp.1462-1469, 2006.
URL : https://hal.archives-ouvertes.fr/inria-00544230

W. Antony, J. G. Rix, . Beerends, P. Michael, A. Hollier et al., Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs, IEEE International Conference on Acoustics, Speech and Signal Processing, vol.2, pp.749-752, 2001.

H. Cees, . Taal, C. Richard, R. Hendriks, J. Heusdens et al., An algorithm for intelligibility prediction of timefrequency weighted noisy speech, Speech, and Language Processing, vol.19, pp.2125-2136, 2011.