TL;DR: VIRAL directly aligns MLLM's visual tokens with pretrained vision model features, preserving fine-grained visual details and boosting vision-centric reasoning beyond text-only supervision.

Introducing the VIsual Representation ALignment (VIRAL)

VIsual Representation ALignment (VIRAL) is a simple but effective regularization strategy that aligns the internal visual representations of MLLMs with those from pretrained vision foundation models (VFMs). By explicitly supervising intermediate visual tokens with rich VFM features, VIRAL preserves semantically meaningful visual information that would otherwise be discarded under text-only learning.

VIRAL robot VIRAL introduces an auxiliary regularization objective on the visual pathway to prevent MLLMs from discarding detailed attributes during training.

VIRAL robot VIRAL, when trained with DINOv2 as the vision foundation model (VFM), consistently produces more accurate visually grounded responses and achieves substantial improvements over standard baselines across diverse vision encoders, including CLIP and SigLIPv2!

Pilot study

(a) baseline visual instruction tuning, (b) re-injecting visual features from the post-projection layer, (c) re-injecting from the pre-projection layer, and (d) the proposed visual representation alignment.

(e) Layer-wise alignment between visual tokens in MLLMs and vision encoder features measured by CKNNA, with shaded regions highlighting middle layers that are especially important for visual understanding. (f) Benchmark performance corresponding to (a–d)

Do MLLMs undergo visual information loss?

MLLMs are trained with text-only supervision, causing their visual features to drift from the encoder’s rich representations. We measure this with CKNNA and observe a sharp drop in similarity to CLIP features across layers.

Is preserving visual information beneficial?

Motivated by the observation that internal visual representations of MLLMs diverge from the encoder’s original features, we ask whether re-injecting them could help. We add a residual connection that feeds the input visual features of the language model back into its mid-layers. This preserves alignment with the encoder and yields consistent performance gains on multimodal benchmarks.

Can we compensate visual information with raw vision encoder feature?

We first tested re-injecting raw vision encoder features into mid-layers via a residual branch, but this degraded performance as the features were not language-aligned and disrupted intermediate representations. To address this, we introduce a more principled strategy by explicitly aligning intermediate visual representations with frozen encoder features through a Visual Representation Alignment (VRA) loss.

This approach yields general improvements, though limitations remain: encoder features can also transmit their inherent biases, as observed in MMVP.

VIRAL

What should serve as the alignment target?

While aligning multimodal models with their own encoder features preserves some visual cues, it is fundamentally constrained by the encoder’s representational capacity. To overcome this limitation, we leverage stronger vision foundation models (VFMs) as teachers, providing richer, vision-centric targets.

Building on this insight, we introduce VIRAL, which aligns intermediate MLLM representations to features from pretrained VFMs, thereby preserving more informative visual semantics than those available from the input encoder alone.

Building upon our findings in visual representation alignment, we align visual pathway representation from MLLMs to strong, informative representations from VFMs to improve the vision understanding performance of MLLMs.

Effect of visual representation alignment

VIRAL demonstrates its effectiveness through evaluations on benchmarks covering vision-centric (CV-Bench^2D, What's Up, and MMVP), hallucination (POPE), and general multimodal tasks (MMStar, MME). Across identical training conditions, VIRAL consistently surpasses the baseline, with its strongest improvements on fine-grained visual understanding. These gains are achieved by leveraging DINOv2 as the VFM.

Generally applicable to MLLMs across diverse vision encoders.

Our method consistently boosts MLLM performance on vision-centric tasks by aligning intermediate visual features with a strong vision encoder. Gains persist even with SigLIPv2, showing that regularizing visual representations benefits MLLMs beyond compensating for contrastive pretraining limits.

Language Model	Vision Encoder	CV-Bench^2D	MMVP	What's Up	POPE	MMStar	MME
Vicuna-7B-1.5	CLIP	56.82%	28.20%	40.13%	85.70%	33.93%	1650.21
	CLIP	59.67%	33.33%	48.55%	87.43%	33.93%	1694.52
	SigLipv2	58.90%	28.22%	40.90%	90.13%	37.20%	1738.96
	SigLipv2	62.66%	33.11%	44.40%	90.77%	37.20%	1835.62

Although using a much stronger vision encoder, we observe that adopting our regularization loss also leads to consistent improvements.

Generally applicable to MLLMs across model scales and LLM backbones.

Experiments with Qwen2.5-7B and a scaled-up Vicuna-1.5-13B show that VIRAL follows a scaling trend and generalizes across different LLM backbones.

Vision Encoder	Language Model	CV-Bench^2D	MMVP	What's Up	POPE	MMStar	MME
CLIP	Qwen2.5-7B	58.97%	33.47%	59.08%	85.88%	39.20%	1743.56
	Qwen2.5-7B	60.50%	36.07%	63.57%	84.92%	39.67%	1765.65
	Vicuna-1.5-13B	57.51%	39.33%	44.44%	87.12%	34.47%	1599.04
	Vicuna-1.5-13B	58.97%	45.33%	62.26%	87.79%	37.00%	1636.62

Across different LLM backbones, the regularization loss remains effective and yields consistent improvements.

Key Improvements

With VIRAL, performance on vision-centric tasks like counting and spatial reasoning improves significantly, outperforming LLaVA-7B-1.5 on challenging visual questions.

The left part presents PCA visualizations of intermediate representations, demonstrating that VIRAL yields more structured, semantically meaningful visual embeddings. The right part illustrates instance counting and spatial relation tasks, highlighting scenarios where VIRAL correctly answers questions while the baseline fails.

Ablation study on key design components

We identify optimal visual features and layers for aligning MLLM representations. We adopt DINOv2 as the default encoder. Layer-wise analysis shows the 16th layer yields the best results, with single-layer alignment outperforming multi-layer targets. Finally, we compare direct feature alignment with relation-based objectives over self-similarity matrices.

VFM	Layer	Objective	CV-Bench^2D	MMVP	What's Up	POPE	MME
Baseline			56.82%	28.20%	40.13%	85.70%	1650.21
Ablation on different VFMs
DINOv2	16	Cos. Sim.	59.67%	33.33%	48.55%	88.32%	1694.52
CLIP	16	Cos. Sim.	57.51%	29.33%	44.50%	87.17%	1548.49
SAM	16	Cos. Sim.	57.58%	30.27%	49.84%	88.34%	1648.77
DAv2	16	Cos. Sim.	58.55%	28.67%	47.29%	88.70%	1682.42
RADIO	16	Cos. Sim.	57.59%	31.80%	47.35%	88.52%	1692.94

VFM	Layer	Objective	CV-Bench^2D	MMVP	What's Up	POPE	MME
Ablation on single-layer targets
DINOv2	4	Cos. Sim.	58.55%	30.67%	45.05%	87.68%	1720.36
DINOv2	8	Cos. Sim.	58.28%	27.70%	48.32%	88.43%	1662.67
DINOv2	12	Cos. Sim.	57.77%	28.59%	48.19%	88.27%	1648.88
DINOv2	16	Cos. Sim.	59.67%	33.33%	48.55%	88.32%	1694.52
DINOv2	20	Cos. Sim.	55.22%	27.41%	48.04%	88.39%	1705.97
DINOv2	24	Cos. Sim.	55.77%	27.48%	47.99%	88.10%	1740.55
DINOv2	28	Cos. Sim.	54.87%	27.19%	47.82%	88.56%	1755.86
DINOv2	32	Cos. Sim.	56.12%	26.52%	47.60%	87.32%	1678.69
Ablation on multi-layer targets
DINOv2	15–17	Cos. Sim.	59.32%	28.00%	47.17%	87.61%	1639.72
DINOv2	14–18	Cos. Sim.	49.62%	22.55%	42.58%	87.90%	1444.32

VFM	Layer	Objective	CV-Bench^2D	MMVP	What's Up	POPE	MME
Ablation on different objectives
DINOv2	16	Relation	58.83%	26.60%	49.05%	87.58%	1674.30
DINOv2	16	Cos. Sim.	59.67%	33.33%	48.55%	88.32%	1694.52

Ablation results across VFMs, layers, and objectives. Consistent gains are observed with VIRAL regularization.

Analysis

Qualitative comparison on text-to-image attention maps (left) and quantified spatial entropy of attention across layers and heads (right). Applying visual representation alignment encourages model to attend to more contextually important content, yielding a more focused and structured attention pattern.

Performance with and without $\mathcal{L}_\mathrm{VRA}$ evaluated every 1K steps on (a) POPE, (b) CV-Bench$^\mathrm{2D}$ and (c) MMVP. Improved early-stage performance with $\mathcal{L}_\mathrm{VRA}$ indicates that VIRAL facilitates faster convergence.

Number of correct predictions out of 788 spatial reasoning tasks in CV-Bench$^\mathrm{2D}$. Models with $\mathcal{L}_\mathrm{VRA}$ show larger performance drops under random permutation, indicating stronger sensitivity to spatial relationships.

PCA visualizations of 16-th layer visual representations aligned with different VFMs: CLIP, DINOv2, SAM, DAv2, and RADIO.

Conclusion

In this work, we propose VIRAL, a simple yet effective training strategy that aligns the internal visual representations of MLLMs with those from powerful vision foundation models. Our approach preserves fine-grained visual semantics often discarded under text-only supervision, enabling more accurate spatial reasoning and object grounding. Extensive experiments across diverse benchmarks validate the effectiveness and generality of our method, demonstrating that visual representation alignment significantly enhances both performance and training efficiency in multimodal learning.

Citation

If you use this work or find it helpful, please consider citing:

    @misc{yoon2025visualrepresentationalignmentmultimodal,
      title={Visual Representation Alignment for Multimodal Large Language Models}, 
      author={Heeji Yoon and Jaewoo Jung and Junwan Kim and Hyungyu Choi and Heeseong Shin and Sangbeom Lim and Honggyu An and Chaehyun Kim and Jisang Han and Donghyun Kim and Chanho Eom and Sunghwan Hong and Seungryong Kim},
      year={2025},
      eprint={2509.07979},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.07979}}

VIRAL: Visual Representation Alignment for Multimodal Large Language Models

arXiv 2025