Light-Omni

Reflex over Reasoning in Agentic Video Understanding with Long-Term Memory

Chang Nie¹ Jiaju Wei¹ Junlan Feng^2* Chaoyou Fu¹ Caifeng Shan^1*

¹Nanjing University ²China Mobile Research

* Corresponding authors

Introduction

Agentic video understanding equips multimodal models with long-term memory to process continuous, long-horizon streams. Existing video agents often rely on detective-style iterative reasoning for search control and evidence aggregation, which introduces high latency and computational cost.

Light-Omni reframes this process as reflexive video understanding. Instead of repeatedly planning and searching, it maintains dual contextual states that provide global context and generate semantically aligned retrieval embeddings in a single forward pass.

Comparison between iterative video agents and Light-Omni — Light-Omni replaces costly iterative reasoning with global context and semantically aligned retrieval for low-latency video understanding.

The Light-Omni Framework

Light-Omni builds a multimodal long-term memory system with identity profiles, semantic memory, and episodic memory. Sleep-time memory consolidation constructs a compact global state from episodic memory, preserving recent details while summarizing long-range observations.

Conditioned on this global state, Light-Omni derives a latent state that directly controls autonomous actions and produces retrieval embeddings. This coupled design narrows the semantic gap between user queries and memory distributions without explicit query rewriting or multi-step reasoning.

Light-Omni framework — Dual contextual states enable global video context, reflexive action control, and aligned evidence retrieval.

Results

+2.4% average accuracy over M3-Agent

12.1x speedup over M3-Agent

2.6x GPU memory efficiency improvement

Light-Omni performance results — Light-Omni achieves strong long-video performance while maintaining near-constant latency as video duration increases.

Citation

Please cite Light-Omni if this project is useful for your research.

@inproceedings{nie2026lightomni,
  title={Light-Omni: Reflex over Reasoning in Agentic Video Understanding with Long-Term Memory},
  author={Nie, Chang and Wei, Jiaju and Feng, Junlan and Fu, Chaoyou and Shan,
  Caifeng},
  year={2026},
  url={http://arxiv.org/abs/xxxx.xxxx}
}