Detailed Captioning
ShareGPT4V & DCI
MM-SeR consistently improves CIDEr, CAPT, CLAIR, and GPT evals, demonstrating the effectiveness of our framework in improving caption quality.
We study image captioning as a key enabler for applications that repeatedly convert images or video frames into textual scene descriptions.
Our scope is not to build a general-purpose MLLM, but to develop a compact captioning specialist that can support practical vision-language systems with lower deployment cost.
Bars are normalized within each metric using Table 2 values from the ShareGPT4V and DCI setting.
A lightweight captioner can produce strong captions, but a single pass can still miss details or carry small visual errors. MM-SeR improves reliability by letting the model use its first caption as a guide for a second, more visually grounded refinement step.
The core idea is simple: generate a coarse caption, use it to revisit the image, and produce a refined caption.
, allowing the model to look at what matters and look in detail
during refinement.
We analyze MM-SeR across diverse benchmarks and settings, covering quantitative captioning scores, qualitative behavior, downstream video question answering, and iterative refinement.
MM-SeR consistently improves CIDEr, CAPT, CLAIR, and GPT evals, demonstrating the effectiveness of our framework in improving caption quality.
| Captioner | Params | Acc. | Time |
|---|---|---|---|
| LLaVA-1.5 | 7.3B | 51.1 | 29m 20s |
| Tag2Text | 900M | 47.1 | 7m 14s |
| Our simple model | 450M | 49.3 | 4m 53s |
| + MM-SeR | 500M | 50.8 | 5m 10s |
Long-range VideoQA requires captions that preserve relevant visual details across many frames. In the LLoVi-style pipeline, frame-level captions are aggregated and passed to Qwen2.5-14B for answer generation, while we replace only the captioner to isolate the effect of caption quality.
With MM-SeR, our lightweight specialist achieves performance comparable to pipelines that use captioning MLLMs.
@inproceedings{song2026mmser,
title = {MM-SeR: Multimodal Self-Refinement for Lightweight Image Captioning},
author = {Song, Junha and Jo, Yongsik and Min, So Yeon and Xie, Quanting and Kim, Taehwan and Bisk, Yonatan and Choo, Jaegul},
booktitle = {CVPR},
year = {2026}
}