JOINT OBJECT DETECTION AND SOUND SOURCE SEPARATION

Kim, Sunyoo; Choi, Yunjeong; Lee, Doyeon; Lee, Seoyoung; Lyou, Eunyi; Kim, Seungju; Noh, Junhyug; Lee, Joonseok

doi:10.5281/zenodo.17706601

상세 보기

JOINT OBJECT DETECTION AND SOUND SOURCE SEPARATION

Kim, Sunyoo;
Choi, Yunjeong;
Lee, Doyeon;
Lee, Seoyoung;
Lyou, Eunyi;
외 3명

Citations

WEB OF SCIENCE

0

Citations

SCOPUS

0

초록

We propose See2Hear (S2H), a framework that jointly learns audio-visual representations for object detection and sound source separation from videos. Existing methods do not fully exploit the synergy between the detection and separation tasks, often relying on disjointly pre-trained visual encoders. Our S2H integrates both tasks in an endto-end trainable unified structure using transformer-based architectures. A naive combination of these approaches, however, results in suboptimal performance. We propose a dynamic filtering mechanism that selects relevant object queries from the object detector to resolve this issue. We conduct extensive experiments to verify that our approach achieves the state-of-the-art performance in audio source separation on MUSIC and MUSIC-21, while maintaining competitive object detection performance. Ablation studies confirm that the joint training of detection and separation is mutually beneficial for both tasks. © S. Kim, Y. Choi, D. Lee, S. Lee, E. Lyou, S. Kim, J. Noh, and J. Lee.

제목: JOINT OBJECT DETECTION AND SOUND SOURCE SEPARATION

저자: Kim, Sunyoo; Choi, Yunjeong; Lee, Doyeon; Lee, Seoyoung; Lyou, Eunyi; Kim, Seungju; Noh, Junhyug; Lee, Joonseok

DOI: 10.5281/zenodo.17706601

발행일: 2025-09

유형: Book chapter

저널명: Proceedings of the International Society for Music Information Retrieval Conference

권: 2025

페이지: 813 ~ 820

ScholarWorks@숙명여자대학교

상세 보기

초록