RMSF-ViT: Randomized Multi-scale Fusion Vision Transformer

Cho, Yu-jin; Lee, Ah Hyeon; Kim, Byung Gyu; Platoš, Jan

doi:10.1007/978-981-95-3141-7_12

상세 보기

RMSF-ViT: Randomized Multi-scale Fusion Vision Transformer

Cho, Yu-jin;
Lee, Ah Hyeon;
Kim, Byung Gyu;
Platoš, Jan

Citations

WEB OF SCIENCE

0

Citations

SCOPUS

0

초록

The Vision Transformer (ViT) has demonstrated remarkable performance in a wide range of computer vision tasks, such as image classification, object detection, and image generation. Unlike convolutional neural networks (CNNs), ViT benefits from a global receptive field, which enables more effective modeling of relationships between image patches. However, the lack of inductive biases makes ViT models difficult to train stably, especially on limited datasets. Without access to large-scale pre-trained weights, performance often degrades significantly. To address this issue, we propose a novel architecture called RMSF-ViT. It employs a progressive fusion strategy that incorporates fine-grained patch information beyond the fixed single patch size used in conventional ViT architectures. In addition, RMSF-ViT reduces the number of attention heads by half compared to vanilla ViT models. This design improves both performance and computational efficiency, as demonstrated on the CIFAR-10, CIFAR-100, Flowers, and Pets datasets.

키워드

Deep Learning; Image Classification; Multi-Scale Fusion; Multi-Scale Patch Embedding; Vision Transformer

제목: RMSF-ViT: Randomized Multi-scale Fusion Vision Transformer

저자: Cho, Yu-jin; Lee, Ah Hyeon; Kim, Byung Gyu; Platoš, Jan

DOI: 10.1007/978-981-95-3141-7_12

발행일: 2025-10

유형: Conference paper

저널명: Communications in Computer and Information Science

권: 2675

페이지: 125 ~ 137

ScholarWorks@숙명여자대학교

상세 보기

초록

키워드