A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Kumar et al., Singing voice separation with deep u-net convolutional networks, Proceedings of the International Society for Music Information Retrieval Conference, pp.23-27, 2017.

N. Takahashi and Y. Mitsufuji, Multi-scale multi-band densenets for audio source separation, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp.21-25, 2017.

S. Park, T. Kim, K. Lee, and N. Kwak, Music source separation using stacked hourglass networks, Proceedings of the International Society for Music Information Retrieval Conference, pp.289-296, 2018.

Z. Rafii, A. Liutkus, F. Stöter, S. I. Mimilakis, and R. Bittner, The MUSDB18 corpus for music separation, 2017.

F. Stöter, A. Liutkus, and N. Ito, The 2018 signal separation evaluation campaign, International Conference on Latent Variable Analysis and Signal Separation, pp.293-305, 2018.

E. Vincent, R. Gribonval, and C. Févotte, Performance measurement in blind audio source separation, IEEE Transactions on Audio, Speech, and Language Processing, vol.14, issue.4, pp.1462-1469, 2006.
URL : https://hal.archives-ouvertes.fr/inria-00544230

, Sisec mus 2018 objective evaluation results, pp.2019-2023

A. Liutkus, J. Durrieu, L. Daudet, and G. Richard, An overview of informed audio source separation, 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), pp.1-4, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00958661

S. Ewert, B. Pardo, M. Müller, and M. D. Plumbley, Scoreinformed source separation for musical audio recordings: An overview, IEEE Signal Processing Magazine, vol.31, issue.3, pp.116-124, 2014.

T. Virtanen, A. Mesaros, and M. Ryynänen, Combining pitch-based inference and non-negative spectrogram factorization in separating vocals from polyphonic music, pp.17-22, 2008.

L. L. Magoarou, A. Ozerov, and N. Q. Duong, Textinformed audio source separation. example-based approach using non-negative matrix partial co-factorization, Journal of Signal Processing Systems, vol.79, issue.2, pp.117-131, 2015.
URL : https://hal.archives-ouvertes.fr/hal-00870066

B. Rivet, W. Wang, S. M. Naqvi, and J. A. Chambers, Audiovisual speech source separation: An overview of key methodologies, IEEE Signal Processing Magazine, vol.31, issue.3, pp.125-134, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00990000

K. Kinoshita, M. Delcroix, A. Ogawa, and T. Nakatani, Text-informed speech enhancement with deep neural networks, Sixteenth Annual Conference of the International Speech Communication Association, 2015.

S. Ewert and M. B. Sandler, Structured dropout for weak label and multi-instance learning and its application to scoreinformed source separation, IEEE International Conference on Acoustics, Speech and Signal Processing, pp.2277-2281, 2017.

M. Miron, J. J. Mestres, and E. Gómez-gutiérrez, Monaural score-informed source separation for classical music using convolutional neural networks, Proceedings of the International Society for Music Information Retrieval Conference, pp.55-62, 2017.

D. Stoller, S. Durand, and S. Ewert, End-to-end lyrics alignment for polyphonic music using an audio-to-character recognition model, 2019.

D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate, 2014.

M. Luong, H. Pham, and C. D. Manning, Effective approaches to attention-based neural machine translation, 2015.

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, Listen, attend and spell: A neural network for large vocabulary conversational speech recognition, IEEE International Conference on Acoustics, Speech and Signal Processing, pp.4960-4964, 2016.

H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, Selfattention generative adversarial networks, 2018.

P. Huang, F. Liu, S. Shiang, J. Oh, and C. Dyer, Attention-based multimodal neural machine translation, Proceedings of the First Conference on Machine Translation, vol.2, pp.639-645, 2016.

J. Schlüter, Learning to pinpoint singing voice from weakly labeled examples, Proceedings of the International Society for Music Information Retrieval Conference, pp.44-50, 2016.

A. Kumar and B. Raj, Audio event detection using weakly labeled data, Proceedings of the 24th ACM international conference on Multimedia, pp.1038-1047, 2016.

D. Stoller, S. Ewert, and S. Dixon, Jointly detecting and separating singing voice: A multi-task approach, International Conference on Latent Variable Analysis and Signal Separation, pp.329-339, 2018.

M. Schuster and K. K. Paliwal, Bidirectional recurrent neural networks, IEEE Transactions on Signal Processing, vol.45, issue.11, pp.2673-2681, 1997.

S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural computation, vol.9, issue.8, pp.1735-1780, 1997.

S. I. Mimilakis, K. Drossos, J. F. Santos, G. Schuller, T. Virtanen et al., Monaural singing voice separation with skip-filtering connections and recurrent inference of timefrequency mask, IEEE International Conference on Acoustics, Speech and Signal Processing, pp.721-725, 2018.

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, 2014.

J. Schlüter and T. Grill, Exploring data augmentation for improved singing voice detection with neural networks, Proceedings of the International Society for Music Information Retrieval Conference, pp.121-126, 2015.