Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach

1 Harbin Institute of Technology
2 Tsinghua University
3 Wuhan University
4 Harbin Institute of Technology (Shenzhen)
🎉🎉🎉 TPAMI 2025
* Equal contribution    Corresponding author

Abstract

Image fusion aims to blend complementary information from diverse modalities, yet most current methods lack robustness in complex fusion scenarios and cannot flexibly accommodate user intent. We present DiTFuse, the first Diffusion-Transformer (DiT) framework for instruction-driven, dynamic fusion control. Guided by natural-language commands, DiTFuse flexibly blends multimodal content to match diverse preferences and scenarios. Training employs a multi-degradation masked-image modeling strategy, so the network jointly learns cross-modal alignment, modality-invariant restoration, and task-aware feature selection without relying on ideal reference images. A curated, multi-granularity instruction dataset further equips the model with interactive fusion capabilities. DiTFuse unifies infrared–visible, multi-focus, and multi-exposure fusion—as well as text-controlled refinement and downstream tasks—within a single architecture. Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention. The model also supports multi-level user control and zero-shot generalization to other multi-image fusion scenarios, including instruction-conditioned segmentation.

Contributions

Multimodal parallel architecture for image fusion

We propose a DiT-based framework with a parallel input structure that jointly fuses text and visual features for multi-modal fusion (IVIF, MFF, MEF). This design builds a robust backbone that reduces modality redundancy in both pre- and post-fusion stages.

Multi-objective hybrid self-supervised training

Our self-supervised framework combines three key elements: M3-based noisy pair generation for realistic priors, mean IVIF fusion to bridge modality gaps, and multi-prompt text-conditioned data for global control. This enables scalable multimodal alignment without ground-truth labels.

Instruction-driven end-to-end controllable fusion

DiTFuse is the first end-to-end framework that directly controls fusion results via natural-language instructions by aligning text and multi-modal features in the latent space. It supports fine-grained visual control, instruction-following segmentation, and strong zero-shot generalization to related tasks.

Training & Data Pipeline

Training and Inference Pipeline of DiTFuse

Training & Inference Pipeline. Textual control information is encoded by the Text Tokenizer, while image data is mapped into visual embeddings via VAE encoders. These conditional signals guide the DiT backbone during the denoising process. The left half illustrates the training stage with M3-based constraints; the right half shows inference, where the unified framework supports multiple fusion tasks and downstream applications.

Multi-degradation Mask Image Modeling and Data Construction

Data Construction Pipeline. The upper part shows the Multi-degradation Mask Image Modeling (M3) process, which creates noisy pairs through mixed degradations such as masking, noise, and blur. The lower part builds a Ground Truth Pool by adjusting contrast and illumination or overlaying transparent masks for segmentation targets, enabling self-supervised learning without ideal reference images.

Qualitative Results

Quantitative Comparison Results

We report quantitative comparisons on infrared–visible fusion (IVIF), multi-focus fusion (MFF), multi-exposure fusion (MEF), and multi-modal segmentation. The best and second-best results are marked with gold and blue underlined text, respectively.

Table I. Quantitative comparison on the MSRS, M3FD, and TNO datasets.

Method MSRS M3FD TNO
MSE↓PSNR↑MANIQA↑LIQE↑CLIP-IQA↑ MSE↓PSNR↑MANIQA↑LIQE↑CLIP-IQA↑ MSE↓PSNR↑MANIQA↑LIQE↑CLIP-IQA↑
SwinFusion 0.03864.520.1381.1080.312 0.05961.370.2841.7040.481 0.05961.360.1641.0100.229
SeAFusion 0.03864.330.1441.1380.355 0.06061.120.2881.6410.466 0.05861.630.1871.0130.267
PMGI 0.06660.320.1421.0300.244 0.03862.910.2771.3590.420 0.04462.230.1621.0130.212
DDBFusion 0.02166.930.1381.1020.285 0.03263.810.2741.4450.440 0.03962.970.1991.0190.244
DDFM 0.02266.600.1421.0530.296 0.03363.580.2961.5530.452 0.04562.210.1871.0190.253
DeFusion 0.02666.080.1321.0420.318 0.03663.520.2761.4250.433 0.04063.400.1851.0150.253
U2Fusion 0.02266.460.1541.0960.327 0.03363.610.2821.4230.506 0.03863.080.2011.0140.256
Text-DiFuse 0.09258.620.1310.9840.284 0.05860.800.2751.4780.423 0.05661.190.1971.0270.270
Text-IF 0.03964.100.1401.1070.362 0.05162.100.2861.6610.457 0.05162.020.1901.0230.281
DiTFuse 0.021 66.63 0.162 1.240 0.392 0.032 63.81 0.299 1.718 0.498 0.036 63.50 0.209 1.019 0.297

Table II. Quantitative comparison on the MFIF, RealMFF, and SICE datasets. MFIF and RealMFF are multi-focus datasets, while SICE is a multi-exposure dataset.

Method MFIF (MFF_DATA) RealMFF (MFF_DATA) SICE (MEF_DATA)
SF↑AG↑LIQE↑MUSIQ↑CLIP-IQA↑ SF↑AG↑LIQE↑MUSIQ↑CLIP-IQA↑ EN↑SD↑LIQE↑MUSIQ↑CLIP-IQA↑
ZMMF 16.775.8632.47649.320.483 14.865.7033.03753.230.566 7.2860.2683.77469.300.667
PMGI 11.364.2262.19553.630.581 15.435.8842.61052.550.511 6.6770.1562.59270.720.599
SwinFusion 20.166.9412.99458.530.628 18.326.5772.94355.720.543 7.1330.2713.57167.500.665
U2Fusion 22.118.0603.11962.14 0.65816.646.3702.02153.400.501 7.3280.2623.34667.670.673
DDBFusion 15.515.3742.79557.330.623 14.195.3022.68353.310.534 7.4530.2683.68469.300.686
DeFusion 12.604.6792.85658.950.619 12.414.7002.42151.810.517 7.3150.2403.56768.970.650
DiTFuse 23.81 8.260 3.891 68.42 0.668 18.26 6.634 3.408 58.46 0.572 7.532 0.274 4.005 70.32 0.693

Table III. Quantitative comparison of DiTFuse and LISA on multi-modal segmentation.

Class DiTFuse LISA (Fusion) LISA (VIS) LISA (IR)
Building 0.5271 0.5598 0.5654 0.5622
Bus 0.4087 0.2871 0.3142 0.2687
Car 0.6426 0.5829 0.5955 0.4621
Motorcycle 0.2903 0.1923 0.2790 0.0877
Person 0.3776 0.2238 0.2117 0.2684
Pole 0.2405 0.2193 0.2289 0.1184
Road 0.7436 0.7744 0.7640 0.7660
Sidewalk 0.2039 0.3143 0.3357 0.2673
Sky 0.8915 0.8869 0.8998 0.8674
Truck 0.3161 0.2695 0.2825 0.2037
Vegetation 0.5936 0.5137 0.5624 0.4659
Overall 0.4760 0.4386 0.4581 0.3943

BibTeX

@article{ditfuse2025,
  title={Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach},
  author={Jiayang Li, Chengjie Jiang, Junjun Jiang, Pengwei Liang, Jiayi Ma, Liqiang Nie},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2025}
}