MM-SeR Project Page

TL;DR

Image captioning is a critical component for assistive applications, yet MLLM-based captioners face deployment constraints due to substantial computational cost.
A compact 125M-parameter language model achieves comparable captioning performance, suggesting that factual image captioning does not significantly require complex LLM reasoning.
MM-SeR introduces multimodal self-refinement, where a coarse caption guides the model toward salient visual regions to produce a more reliable refined description.

Overview of MM-SeR's multimodal self-refinement process

Motivation & Scope

Examples of applications where image captioning is important

We study image captioning as a key enabler for applications that repeatedly convert images or video frames into textual scene descriptions.

Our scope is not to build a general-purpose MLLM, but to develop a compact captioning specialist that can support practical vision-language systems with lower deployment cost.

Exploring Lightweight Captioning

We ask whether image captioning truly requires the full reasoning capacity of large MLLMs, or whether a compact language model can work as a strong captioning specialist. By swapping LLaVA-1.5's LLaMA-7B backbone with OPT-125M and evaluating on ShareGPT4V and DCI, we find that the resulting lightweight model remains highly competitive against much larger MLLM captioners.
These results show that the complex capabilities of LLMs are less critical for tasks that focus on enumerating factual visual details.

Detailed Captioning (ShareGPT4V & DCI)

CIDEr BERT Score CAPT

LLaVA-1.5 7.3B params

EyesWideShut 7.6B params

Cambrian 10.5B params

SmallCap 450M params

Tag2Text 900M params

Our simple model 450M params

Bars are normalized within each metric using Table 2 values from the ShareGPT4V and DCI setting.

MM-SeR: Multimodal Self-Refinement

A lightweight captioner can produce strong captions, but a single pass can still miss details or carry small visual errors. MM-SeR improves reliability by letting the model use its first caption as a guide for a second, more visually grounded refinement step.

The core idea is simple: generate a coarse caption, use it to revisit the image, and produce a refined caption.

Overview of the MM-SeR multimodal self-refinement workflow

MM-SeR uses the initial caption to identify visually relevant regions and combines it with multi-layer vision features through the SeR-Connector, allowing the model to look at what matters and look in detail during refinement.
MM-SeR is, to our knowledge, the first framework to realize self-refinement within multimodal models by explicitly grounding the refinement process in visual evidence.
For refinement training, an LLM creates pseudo-initial captions \(\hat{c}_k\) by applying small entity, attribute, or relation perturbations to the ground-truth caption \(c_k\), encouraging localized correction rather than full regeneration.

Experiments

We analyze MM-SeR across diverse benchmarks and settings, covering quantitative captioning scores, qualitative behavior, downstream video question answering, and iterative refinement.

1

Detailed Captioning

2

Qualitative Results

3

Long-Range VideoQA

Captioner	Params	Acc.	Time
LLaVA-1.5	7.3B	51.1	29m 20s
Tag2Text	900M	47.1	7m 14s
Our simple model	450M	49.3	4m 53s
+ MM-SeR	500M	50.8	5m 10s

Long-range VideoQA requires captions that preserve relevant visual details across many frames. In the LLoVi-style pipeline, frame-level captions are aggregated and passed to Qwen2.5-14B for answer generation, while we replace only the captioner to isolate the effect of caption quality.

With MM-SeR, our lightweight specialist achieves performance comparable to pipelines that use captioning MLLMs.

4

Iterative Refinement

ShareGPT4V & DCI

CAPT GPT eval

OPT-125M

Initial

CAPT

GPT

Ref. x1

CAPT

GPT

Ref. x2

CAPT

GPT

Ref. x3

CAPT

GPT

OPT-1.3B

Initial

CAPT

GPT

Ref. x1

CAPT

GPT

Ref. x2

CAPT

GPT

Ref. x3

CAPT

GPT

We test iterative self-refinement by feeding the refined caption back into MM-SeR for up to three refinement steps, while keeping the vision encoder fixed and varying the LM used to train the captioning specialist.
The captioning specialist trained with OPT-125M shows little benefit from multi-step refinement beyond the first pass, whereas the specialist trained with OPT-1.3B improves with additional refinement iterations.
This trend aligns with recent findings in LLM-based self-refinement, where larger models tend to benefit more from repeated refinement; future work could explore scalable strategies such as dynamically adjusting the iteration count.

Citation

@inproceedings{song2026mmser,
  title     = {MM-SeR: Multimodal Self-Refinement for Lightweight Image Captioning},
  author    = {Song, Junha and Jo, Yongsik and Min, So Yeon and Xie, Quanting and Kim, Taehwan and Bisk, Yonatan and Choo, Jaegul},
  booktitle = {CVPR},
  year      = {2026}
}