MM-SeR: Multimodal Self-Refinement for Lightweight Image Captioning

CVPR 2026 Main

1KAIST    2UNIST    3Carnegie Mellon University

TL;DR

  1. Image captioning is a critical component for assistive applications, yet MLLM-based captioners face deployment constraints due to substantial computational cost.
  2. A compact 125M-parameter language model achieves comparable captioning performance, suggesting that factual image captioning does not significantly require complex LLM reasoning.
  3. MM-SeR introduces multimodal self-refinement, where a coarse caption guides the model toward salient visual regions to produce a more reliable refined description.
Overview of MM-SeR's multimodal self-refinement process

Motivation & Scope

Examples of applications where image captioning is important

We study image captioning as a key enabler for applications that repeatedly convert images or video frames into textual scene descriptions.

Our scope is not to build a general-purpose MLLM, but to develop a compact captioning specialist that can support practical vision-language systems with lower deployment cost.

Exploring Lightweight Captioning

  • We ask whether image captioning truly requires the full reasoning capacity of large MLLMs, or whether a compact language model can work as a strong captioning specialist. By swapping LLaVA-1.5's LLaMA-7B backbone with OPT-125M and evaluating on ShareGPT4V and DCI, we find that the resulting lightweight model remains highly competitive against much larger MLLM captioners.
  • These results show that the complex capabilities of LLMs are less critical for tasks that focus on enumerating factual visual details.

Detailed Captioning (ShareGPT4V & DCI)

CIDEr BERT Score CAPT

Bars are normalized within each metric using Table 2 values from the ShareGPT4V and DCI setting.

MM-SeR: Multimodal Self-Refinement

A lightweight captioner can produce strong captions, but a single pass can still miss details or carry small visual errors. MM-SeR improves reliability by letting the model use its first caption as a guide for a second, more visually grounded refinement step.

The core idea is simple: generate a coarse caption, use it to revisit the image, and produce a refined caption.

Overview of the MM-SeR multimodal self-refinement workflow
  • MM-SeR uses the initial caption to identify visually relevant regions and combines it with multi-layer vision features through the SeR-ConnectorDetails of the SeR-Connector architecture, allowing the model to look at what matters and look in detail during refinement.
  • MM-SeR is, to our knowledge, the first framework to realize self-refinement within multimodal models by explicitly grounding the refinement process in visual evidence.
  • For refinement training, an LLM creates pseudo-initial captions \(\hat{c}_k\) by applying small entity, attribute, or relation perturbations to the ground-truth caption \(c_k\), encouraging localized correction rather than full regeneration.

Experiments

We analyze MM-SeR across diverse benchmarks and settings, covering quantitative captioning scores, qualitative behavior, downstream video question answering, and iterative refinement.

1

Detailed Captioning

2

Qualitative Results

3

Long-Range VideoQA

4

Iterative Refinement

Citation

@inproceedings{song2026mmser,
  title     = {MM-SeR: Multimodal Self-Refinement for Lightweight Image Captioning},
  author    = {Song, Junha and Jo, Yongsik and Min, So Yeon and Xie, Quanting and Kim, Taehwan and Bisk, Yonatan and Choo, Jaegul},
  booktitle = {CVPR},
  year      = {2026}
}