Face and voice cross-modal association with learning convex feature embedding

Kim, Taewan; Kang, Jiwoo

doi:10.1007/s00530-025-01872-9

상세 보기

Face and voice cross-modal association with learning convex feature embedding

Kim, Taewan;
Kang, Jiwoo

Citations

WEB OF SCIENCE

0

Citations

SCOPUS

0

초록

Face-and-voice association learning is one of the most challenging tasks in deep learning. In this paper, we propose a simple but powerful cross-modal feature embedding method for the association of faces and voices. Previous work has studied cross-modal association tasks to establish the correlation between voice clips and facial images. These works have addressed cross-modal discrimination but underestimate the importance of handling heterogeneity in inter-modal features between audio and video, resulting in a lot of false positives and false negatives. To tackle the problem, the proposed method learns the embeddings of cross-modal features by making another feature exist between cross-modal features, facilitating the voice and face features of the same person to be embedded in a convex hull. Moreover, the incorporation of cross-modal attention mechanisms with convex embedding techniques represents a highly effective strategy for the attenuation of false positives and false negatives, accomplished via the minimization of inter-class discrepancies. We exhaustively evaluated our method for cross-modal verification, matching, and retrieval tasks on the large-scale VoxCeleb dataset. Extensive experimental results demonstrate that the proposed method achieves notable improvements over existing state-of-the-art methods.

키워드

Face-voice association; Multi-modal matching; Cross-modal retrieval; Feature embedding; RECOGNITION; NETWORKS

제목: Face and voice cross-modal association with learning convex feature embedding

저자: Kim, Taewan; Kang, Jiwoo

DOI: 10.1007/s00530-025-01872-9

발행일: 2025-07

유형: Article

저널명: Multimedia Tools and Applications

권: 31

호: 4

ScholarWorks@숙명여자대학교

상세 보기

초록

키워드