DiffBlender: Composable and versatile multimodal text-to-image diffusion models

Kim, Sungnyun; Lee, Junsoo; Hong, Kibeom; Kim, Daesik; Ahn, Namhyuk

doi:10.1016/j.eswa.2025.129345

상세 보기

DiffBlender: Composable and versatile multimodal text-to-image diffusion models

Kim, Sungnyun;
Lee, Junsoo;
Hong, Kibeom;
Kim, Daesik;
Ahn, Namhyuk

Citations

WEB OF SCIENCE

1

Citations

SCOPUS

2

초록

In this study, we aim to enhance the capabilities of diffusion-based text-to-image (T2I) generation models by integrating diverse modalities beyond textual descriptions within a unified framework. To this end, we categorize widely used conditional inputs into three modality types: structure, layout, and attribute. We propose a multimodal T2I diffusion model, DiffBlender, which is capable of processing all three modalities within a single architecture. Importantly, this is achieved without modifying the parameters of the pre-trained diffusion model, as only a small subset of components is updated. Our approach sets new benchmarks in multimodal generation through extensive quantitative and qualitative comparisons with existing conditional generation methods. We demonstrate that DiffBlender effectively integrates multiple sources of information and supports diverse applications in detailed image synthesis. The code and demo are available at https://github.com/sungnyun/diffblender.

키워드

Diffusion model; Multimodal; Synthesis; Text-to-image

제목: DiffBlender: Composable and versatile multimodal text-to-image diffusion models

저자: Kim, Sungnyun; Lee, Junsoo; Hong, Kibeom; Kim, Daesik; Ahn, Namhyuk

DOI: 10.1016/j.eswa.2025.129345

발행일: 2026-02

유형: Article

저널명: Expert Systems with Applications

권: 297

ScholarWorks@숙명여자대학교

상세 보기

초록

키워드