J. Woodcock, W. J. Davies, T. J. Cox, and F. Melchior, Categorization of broadcast audio objects in complex auditory scenes, Journal of the Audio Engineering Society, vol.64, issue.6, pp.380-394, 2016.

W. Jiang, C. Cotton, S. F. Chang, D. Ellis, and A. Loui, Shortterm audiovisual atoms for generic video concept classification, Proceedings of the 17th ACM International Conference on Multimedia, pp.5-14, 2009.

S. Chang, D. Ellis, W. Jiang, K. Lee, A. Yanagawa et al., Large-scale multimodal semantic concept detection for consumer video, Proceedings of the international workshop on Workshop on multimedia information retrieval, pp.255-264, 2007.

Y. Jiang, Z. Wu, J. Wang, X. Xue, and S. Chang, Exploiting feature and class relationships in video categorization with regularized deep neural networks, IEEE transactions on pattern analysis and machine intelligence, vol.40, pp.352-364, 2018.

Y. Jiang, S. Bhattacharya, S. Chang, and M. Shah, High-level event recognition in unconstrained videos, International journal of multimedia information retrieval, vol.2, issue.2, pp.73-101, 2013.

H. Izadinia, I. Saleemi, and M. Shah, Multimodal analysis for identification and segmentation of moving-sounding objects, IEEE Transactions on Multimedia, vol.15, issue.2, pp.378-390, 2013.

E. Kidron, Y. Schechner, and M. Elad, Pixels that sound, Computer Vision and Pattern Recognition, vol.1, pp.88-95, 2005.

A. Owens, P. Isola, J. Mcdermott, A. Torralba, E. H. Adelson et al., Visually indicated sounds, Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp.2405-2413, 2016.

A. Owens, J. Wu, J. H. Mcdermott, W. T. Freeman, and A. Torralba, Ambient sound provides supervision for visual learning, Proc. of European Conference on Computer Vision, pp.801-816, 2016.

R. Arandjelovi? and A. Zisserman, Look, listen and learn, IEEE International Conference on Computer Vision, 2017.

R. Arandjelovic and A. Zisserman, Objects that sound, Proceedings of the European Conference on Computer Vision (ECCV), pp.435-451, 2018.

G. Andrew, R. Arora, J. Bilmes, and K. Livescu, Deep canonical correlation analysis, Proc. of International Conference on Machine Learning, pp.1247-1255, 2013.

A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah et al., DCASE 2017 challenge setup: Tasks, datasets and baseline system, Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), pp.85-92, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01627981

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier et al., The kinetics human action video dataset, 2017.

A. S. Bregman, Auditory scene analysis: The perceptual organization of sound, 1994.

A. Mesaros, T. Heittola, O. Dikmen, and T. Virtanen, Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations, ICASSP, pp.151-155, 2015.

X. Zhuang, X. Zhou, M. A. Hasegawa-johnson, and T. S. Huang, Realworld acoustic event detection, Pattern Recognition Letters, vol.31, issue.12, pp.1543-1551, 2010.

S. Adavanne, P. Pertilä, and T. Virtanen, Sound event detection using spatial features and convolutional recurrent neural network," in ICASSP, pp.771-775, 2017.

V. Bisot, S. Essid, and G. Richard, Overlapping sound event detection with supervised nonnegative matrix factorization, ICASSP, pp.31-35, 2017.

A. Kumar and B. Raj, Audio event detection using weakly labeled data, Proceedings of the 2016 ACM on Multimedia Conference, pp.1038-1047, 2016.

A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah et al., DCASE2017 challenge setup: Tasks, datasets and baseline system, Proc. of Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), pp.85-92, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01627981

J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence et al., Audio set: An ontology and human-labeled dataset for audio events, Acoustics, Speech and Signal Processing, pp.776-780, 2017.

Y. Xu, Q. Kong, W. Wang, and M. D. Plumbley, Surrey-CVSSP system for DCASE2017 challenge task4, DCASE2017 Challenge, Tech. Rep, 2017.

J. Salamon, B. Mcfee, and P. Li, DCASE 2017 submission: Multiple instance learning for sound event detection, DCASE2017 Challenge, Tech. Rep, 2017.

A. Kumar, M. Khadkevich, and C. Fugen, Knowledge transfer from weakly labeled audio using convolutional neural network for sound events and scenes, 2017.

Q. Kong, C. Yu, T. Iqbal, Y. Xu, W. Wang et al., Weakly labelled audioset classification with attention neural networks, 2019.

J. Cramer, H. Wu, J. Salamon, and J. P. Bello, Look, listen, and learn more: Design choices for deep audio embeddings, ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.3852-3856, 2019.

Y. Wang, J. Li, and F. Metze, A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling, ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.31-35, 2019.

Q. Kong, Y. Xu, I. Sobieraj, W. Wang, and M. D. Plumbley, Sound event detection and time-frequency segmentation from weakly labelled data, IEEE/ACM Transactions on Audio, Speech and Language Processing, vol.27, issue.4, pp.777-787, 2019.

H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. Mcdermott et al., The sound of pixels, ECCV, 2018.

A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson et al., Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation, ACM Trans. Graph, vol.37, issue.4, pp.1-112, 2018.

R. Gao, R. Feris, and K. Grauman, Learning to separate object sounds by watching unlabeled video, ECCV, 2018.

C. Zhang, J. C. Platt, and P. A. Viola, Multiple instance boosting for object detection, Advances in neural information processing systems, pp.1417-1424, 2006.

H. Bilen, M. Pedersoli, and T. Tuytelaars, Weakly supervised object detection with posterior regularization, Proceedings BMVC 2014, pp.1-12, 2014.

M. Oquab, L. Bottou, I. Laptev, and J. Sivic, Is object localization for free?-weakly-supervised learning with convolutional neural networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.685-694, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01015140

V. Kantorov, M. Oquab, M. Cho, and I. Laptev, Contextlocnet: Contextaware deep network models for weakly supervised localization, European Conference on Computer Vision, pp.350-365, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01421772

H. Bilen and A. Vedaldi, Weakly supervised deep detection networks, CVPR, pp.2846-2854, 2016.

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, Learning deep features for discriminative localization, Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pp.2921-2929, 2016.

R. G. Cinbis, J. Verbeek, and C. Schmid, Weakly supervised object localization with multi-fold multiple instance learning, IEEE transactions on pattern analysis and machine intelligence, vol.39, pp.189-203, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01123482

H. Bilen, V. P. Namboodiri, and L. J. Van-gool, Object and action classification with latent window parameters, International Journal of Computer Vision, vol.106, issue.3, pp.237-251, 2014.

T. Deselaers, B. Alexe, and V. Ferrari, Localizing objects while learning their appearance, pp.452-466, 2010.

M. P. Kumar, B. Packer, and D. Koller, Self-paced learning for latent variable models, Advances in Neural Information Processing Systems, pp.1189-1197, 2010.

H. O. Song, Y. J. Lee, S. Jegelka, and T. Darrell, Weakly-supervised discovery of visual pattern configurations, Advances in Neural Information Processing Systems, pp.1637-1645, 2014.

A. Kolesnikov and C. H. Lampert, Seed, expand and constrain: Three principles for weakly-supervised image segmentation, European Conference on Computer Vision, pp.695-711, 2016.

G. Gkioxari, R. Girshick, and J. Malik, Contextual action recognition with r* cnn, Proceedings of the IEEE international conference on computer vision, pp.1080-1088, 2015.

C. L. Zitnick and P. Dollár, Edge boxes: Locating object proposals from edges, ECCV, pp.391-405, 2014.

J. R. Uijlings, K. E. Van-de-sande, T. Gevers, and A. W. Smeulders, Selective search for object recognition, International journal of computer vision, vol.104, issue.2, pp.154-171, 2013.

R. Girshick, Fast R-CNN, ICCV, pp.1440-1448, 2015.

K. He, X. Zhang, S. Ren, and J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, IEEE transactions on pattern analysis and machine intelligence, vol.37, pp.1904-1916, 2015.

S. Ren, K. He, R. Girshick, and J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, Advances in neural information processing systems, pp.91-99, 2015.

K. He, G. Gkioxari, P. Dollár, and R. Girshick, Mask r-cnn, 2017 IEEE International Conference on, pp.2980-2988, 2017.

J. C. Van-gemert, M. Jain, E. Gati, and C. G. Snoek, Apt: Action localization proposals from dense trajectories, Proc. of BMVC, vol.2, p.4, 2015.

D. Oneata, J. Revaud, J. Verbeek, and C. Schmid, Spatio-temporal object detection proposals, pp.737-752, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01021902

T. G. Dietterich, R. H. Lathrop, and T. Lozano-pérez, Solving the multiple instance problem with axis-parallel rectangles, Artificial intelligence, vol.89, issue.1-2, pp.31-71, 1997.

X. Wang, M. Yang, S. Zhu, and Y. Lin, Regionlets for generic object detection, Computer Vision (ICCV), 2013 IEEE International Conference on, pp.17-24, 2013.

R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, Proceedings of the IEEE conference on computer vision and pattern recognition, pp.580-587, 2014.

J. Hosang, R. Benenson, and B. Schiele, How good are detection proposals, really?, 25th British Machine Vision Conference, pp.1-12, 2014.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, Advances in neural information processing systems, pp.1097-1105, 2012.

J. Deng, W. Dong, R. Socher, L. Li, K. Li et al., Imagenet: A large-scale hierarchical image database, Computer Vision and Pattern Recognition, pp.248-255, 2009.

G. Richard, S. Sundaram, and S. Narayanan, An overview on perceptually motivated audio indexing and classification, Proceedings of the IEEE, vol.101, issue.9, pp.1939-1954, 2013.
URL : https://hal.archives-ouvertes.fr/hal-02286490

S. B. Davis and P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, Readings in speech recognition, pp.65-74, 1990.

S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen et al., CNN architectures for large-scale audio classification, ICASSP, pp.131-135, 2017.

S. Abu-el-haija, N. Kothari, J. Lee, A. P. Natsev, G. Toderici et al., Youtube-8M: A large-scale video classification benchmark, 2016.

C. Yu, K. S. Barsim, Q. Kong, and B. Yang, Multi-level attention model for weakly supervised audio classification, 2018.

C. Févotte, N. Bertin, and J. Durrieu, Nonnegative matrix factorization with the itakura-saito divergence: With application to music analysis, Neural computation, vol.21, issue.3, pp.793-830, 2009.

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, 2014.

S. Parekh, A. Ozerov, S. Essid, N. Duong, P. Pérez et al., Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision
URL : https://hal.archives-ouvertes.fr/hal-01914532

D. Lee, S. Lee, Y. Han, and K. Lee, Ensemble of convolutional neural networks for weakly-supervised sound event detection using multiple scale input, DCASE2017 Challenge, Tech. Rep, 2017.

C. Févotte, E. Vincent, and A. Ozerov, Single-channel audio source separation with NMF: divergences, constraints and algorithms, Audio Source Separation, pp.1-24, 2018.

M. Spiertz and V. Gnann, Source-filter based clustering for monaural blind source separation, Proceedings of International Conference on Digital Audio Effects DAFx'09, 2009.

, NMF Mel Clustering Code

D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley, Detection and classification of acoustic scenes and events, IEEE Transactions on Multimedia, vol.17, issue.10, pp.1733-1746, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01123760

E. Vincent, R. Gribonval, and C. Févotte, Performance measurement in blind audio source separation, IEEE transactions on audio, speech, and language processing, vol.14, pp.1462-1469, 2006.
URL : https://hal.archives-ouvertes.fr/inria-00544230

J. Lu, J. Yang, D. Batra, and D. Parikh, Hierarchical question-image co-attention for visual question answering, Advances In Neural Information Processing Systems, pp.289-297, 2016.

Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu, Audio-visual event localization in unconstrained videos, ECCV, 2018.

, He is a professor in Télécom Paris's Department of Images, Data, and Signals and the head of the Audio Data Analysis and Signal Processing team. His research interests are machine learning for audio and multimodal data analysis. He has been involved in various collaborative French and European research projects, among them Quaero, Networks of Excellence FP6-Kspace, FP7-3DLife, FP7-REVERIE, and FP-7 LASIE. He has published over 100 peer-reviewed conference and journal papers, with more than 100 distinct co-authors. On a regular basis, he serves as a reviewer for various machine-learning, signal processing, Sanjeel Parekh received B. Tech (hons.) degree in electronics and communication engineering from LNM Institute of Information Technology, India in 2014 and M.S. in Sound and Music Computing from Universitat Pompeu Fabra (UPF), 2001.

, He is currently with InterDigital, France. His research interests include various aspects of audio and image/video analysis and processing. Since 2016, he has been a Distinguished Member of the Technicolor Fellowship Network and is currently a Member of the IEEE Signal Processing Society Audio and Acoustic Signal Processing Technical Committee. He is currently an Associate Editor for the, Alexey Ozerov received the M.Sc. degree in mathematics from the Saint-Petersburg State University, 1999.

Q. K. Ngoc, He is currently a Senior Scientist with InterDigital, France. He is the co-author of more than 45 scientific papers and about 30 patent submissions. His research interests include signal processing and machine learning, applied to audio, image, and video. He was the recipient of the several research awards, including the IEEE Signal Processing Society Young Author Best Paper Award, Duong received the B.S. degree from the Posts and Telecommunications Institute of Technology, Hanoi City, Vietnam, in 2004, the M.S. degree in electronic engineering from Paichai University, 2004.