RL makes MLLMs see better than SFT

TL;DR

Does the MLLM training method, like SFT versus RL, actually have an impact on the vision encoder's representations?

Yes—and the training recipe matters! Our research shows that training with RL (e.g., DPO) produces stronger, more precisely localized visual representations than SFT. This translates to superior performance not only on MLLM benchmarks (especially strongly vision-related VQA tasks) but also on classic vision tasks like ImageNet classification and segmentation. If your goal is to improve the vision encoder itself for MLLM development, RL is the more effective path.

Research Motivation

The dominant MLLM research paradigm has focused primarily on the LLM backbone or the MLLM itself, leaving the vision encoder under-analyzed.
This oversight impedes a deeper understanding of how modern MLLM training strategies, such as Supervised Finetuning (SFT) and Reinforcement Learning (RL), impact the model.

How do SFT and RL affect MLLMs?

The impact of MLLM training strategies is first investigated using common VQA benchmarks. SFT vs RL comparison graphic This contrasts with previous SFT versus RL studies, which performed comparisons in specialized environments like card games or robot action planning. Our analysis aims to answer: How do SFT and DPO affect MLLM on diverse VQA tasks? Is DPO actually superior to SFT? Does this trend hold with model scaling?

Scaling the vision encoder in MLLMs: Performance is reported as the SigLIP2 vision model scales from 86M to 1B with a fixed Qwen2.5-3B language model.

Scaling the language model in MLLMs: Performance is reported as the Qwen2.5 language model scales from 0.5B to 7B with a fixed SigLIP2-So/16 vision encoder.

Findings:

The performance improves with the size of the vision encoder, underscoring the importance of the visual representation capacity within MLLMs even though the LM size is also a critical factor.
DPO achieves superior performance compared to SFT, particularly on strongly vision-related tasks, motivating an in-depth analysis of how these learning strategies impact the visual representation in MLLMs.

How does MLLM training affect visual representations?

Now, the focus shifts to an in-depth analysis of the vision encoder within MLLMs. As illustrated in our experimental setup SFT vs RL comparison graphic , this is achieved by isolating the encoder and evaluating its standalone performance on classic vision tasks.

ImageNet classification Gradient visualization Semantic segmentation Vision & Language alignment

Does MLLM training really reshape visual representations?

Do DPO and SFT provide distinct gradient signals to the vision encoder?

Do these distinct gradient signals translate to differences in the vision encoder's localization capabilities?

What is the impact of MLLM training on vision & language alignment?

Findings:

MLLM training not only adapts the language model but also reshapes the visual representations.
DPO steers the vision encoder toward a more fine-grained analysis of visual information, improving its object localization capabilities.
The vision encoder benefits from a larger LLM, which provides more informative backward signals for visual representation within an MLLM.

What’s next: Unlocking vision model potential via RL

Building on our finding that DPO benefits visual representation learning in MLLMs, we reframe this process as a simple recipe for evolving vision models for MLLMs—Preference-Instructed Vision Optimization (PIVOT). Here, we assess the effectiveness of PIVOT-enhanced representations within MLLMs, following prior evaluation protocols pivot evaluation such as Cambrian and Web-SSL. It should be noted that PIVOT is not proposed as a new method, but rather as an underexplored training regime that enables the development of better MLLMs than those using the original vision encoders.

The results reveal a remarkable impact of PIVOT when the enhanced encoders are used within MLLMs; a vision model trained with PIVOT not only outperforms its original counterpart but also surpasses a substantially larger model (e.g., SigLIP2-So/16+PIVOT > SigLIP2-g/16) and even a subsequent-generation encoder (e.g., SigLIP1-So/14+PIVOT > SigLIP2-So/16).

Notably, this enhancement is achieved with just 18 hours of training on 8 H100 GPUs. This amounts to fewer than 1% of GPUs of standard vision pre-training, with SigLIP2 trained on up to 2K TPUv5e chips.

When comparing vision encoders trained with different strategies within MLLMs, we find that a vision encoder enhanced with PIVOT provides a greater advantage over one trained with SFT in MLLM applications.

PIVOT also benefits a diverse set of vision encoders, including CLIP, MAE, DINOv2, and ImageNet-SupViT. It is an interesting finding that this benefit extends beyond vision-only self-supervised models such as MAE and MOCO, to vision–language supervised models like CLIP.

Finding:

Existing vision models possess substantial potential for improvement within MLLMs, which can be unlocked by PIVOT.

Citation

@article{song2025rlseebetter,
  title   = {RL makes MLLMs see better than SFT},
  author  = {Junha Song and Sangdoo Yun and Dongyoon Han and Jaegul Choo and Byeongho Heo},
  journal = {arXiv preprint arXiv:2510.16333},
  year    = {2025}
}

Correspondence

Please reach out to Junha Song with any questions.

sb020518@kaist.ac.kr

Acknowledgements

Special thanks to the NAVER AI teams for their generous support.

This project page was developed with reference to the awesome project Web-SSL. We sincerely appreciate the creators for their inspiring work.