Face and voice cross-modal association with learning convex feature embedding
Citations

WEB OF SCIENCE

0
Citations

SCOPUS

0

초록

Face-and-voice association learning is one of the most challenging tasks in deep learning. In this paper, we propose a simple but powerful cross-modal feature embedding method for the association of faces and voices. Previous work has studied cross-modal association tasks to establish the correlation between voice clips and facial images. These works have addressed cross-modal discrimination but underestimate the importance of handling heterogeneity in inter-modal features between audio and video, resulting in a lot of false positives and false negatives. To tackle the problem, the proposed method learns the embeddings of cross-modal features by making another feature exist between cross-modal features, facilitating the voice and face features of the same person to be embedded in a convex hull. Moreover, the incorporation of cross-modal attention mechanisms with convex embedding techniques represents a highly effective strategy for the attenuation of false positives and false negatives, accomplished via the minimization of inter-class discrepancies. We exhaustively evaluated our method for cross-modal verification, matching, and retrieval tasks on the large-scale VoxCeleb dataset. Extensive experimental results demonstrate that the proposed method achieves notable improvements over existing state-of-the-art methods.

키워드

Face-voice associationMulti-modal matchingCross-modal retrievalFeature embeddingRECOGNITIONNETWORKS
제목
Face and voice cross-modal association with learning convex feature embedding
저자
Kim, TaewanKang, Jiwoo
DOI
10.1007/s00530-025-01872-9
발행일
2025-07
유형
Article
저널명
Multimedia Tools and Applications
31
4