Efficient deformable modeling network for multi-view 3D object detection

Lee, Han-Lim; Aisha, Qurat Ul Ain; Kim, Byung-Gyu

doi:10.1007/s00521-026-11881-y

상세 보기

Efficient deformable modeling network for multi-view 3D object detection

Lee, Han-Lim;
Aisha, Qurat Ul Ain;
Kim, Byung-Gyu

Citations

SCOPUS

0

초록

Multi-view 3D object detection is a critical component of camera-based autonomous driving systems. While Bird’s-Eye View (BEV) methods provide strong spatial reasoning, they often suffer from vertical information loss and high computational overhead. More recent sparse query-based approaches improve efficiency but still struggle with aligning 3D queries to image features and maintaining stable optimization during training. In this work, we present a novel deformable modeling framework that advances sparse query-based 3D object detection through enhanced geometric and motion-aware representation learning. Our approach introduces (i) a 4D query encoding that jointly models object position, scale, orientation, and velocity; (ii) structured denoising across all box parameters to improve early training stability; and (iii) distance-aware feature sampling that enhances multi-view feature alignment. We further employ a lightweight 2D detector for query initialization, eliminating the need for depth supervision. Importantly, all components operate independently of the image backbone, allowing seamless integration with both Convolutional Neural Network (CNN) and Transformer-based architectures. Experiments on the nuScenes validation set demonstrate that our method achieves the highest mean Average Precision (mAP) (45.5%) and second-highest nuScenes Detection Score (NDS) (55.1%) among ResNet-50 based on camera-only detectors, slightly outperforming Stream Position Embedding Transformation (StreamPETR) and closely matching Divided View Position Embedding (DVPE), despite using fewer input frames. Our approach also converges twice as fast and achieves leading performance on key localization and scale metrics (mean Average Translation Error (mATE), mean Average Scale Error (mASE), mean Average Attribute Error (mAAE)), validating its effectiveness and efficiency as a modular enhancement for modern 3D object detection systems.

키워드

4D query denoising; Auxiliary 2D detector; Multi-view 3D Object Detection; Object Detection; Sparse query-based framework

제목: Efficient deformable modeling network for multi-view 3D object detection

저자: Lee, Han-Lim; Aisha, Qurat Ul Ain; Kim, Byung-Gyu

DOI: 10.1007/s00521-026-11881-y

발행일: 2026-03

유형: Article

저널명: Neural Computing and Applications

권: 38

호: 5