Video Understanding
updated
AURA: Always-On Understanding and Real-Time Assistance via Video Streams
Paper
• 2604.04184
• Published • 50
Video Mamba Suite: State Space Model as a Versatile Alternative for
Video Understanding
Paper
• 2403.09626
• Published • 15
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding
Paper
• 2506.01300
• Published
Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model
Paper
• 2503.21782
• Published
VideoNSA: Native Sparse Attention Scales Video Understanding
Paper
• 2510.02295
• Published • 10
VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding
Paper
• 2603.22285
• Published • 49
AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding
Paper
• 2603.28696
• Published • 6
Dense Video Understanding with Gated Residual Tokenization
Paper
• 2509.14199
• Published • 3
PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation
Paper
• 2512.04025
• Published • 4
EventMemAgent: Hierarchical Event-Centric Memory for Online Video Understanding with Adaptive Tool Use
Paper
• 2602.15329
• Published • 1
Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory
Paper
• 2602.18434
• Published
Streaming Video Understanding and Multi-round Interaction with
Memory-enhanced Knowledge
Paper
• 2501.13468
• Published
Video Inference for Human Mesh Recovery with Vision Transformer
Paper
• 2507.08981
• Published • 1
Video Depth Anything: Consistent Depth Estimation for Super-Long Videos
Paper
• 2501.12375
• Published • 23
V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval
Paper
• 2512.12284
• Published
HoliTom: Holistic Token Merging for Fast Video Large Language Models
Paper
• 2505.21334
• Published • 21
Accurate and Fast Compressed Video Captioning
Paper
• 2309.12867
• Published
Unleashing Hour-Scale Video Training for Long Video-Language
Understanding
Paper
• 2506.05332
• Published • 3
Thinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning
Paper
• 2601.21037
• Published • 15
Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models
Paper
• 2603.02872
• Published • 1
Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster
VLM Inference
Paper
• 2510.14624
• Published • 2
Rethinking Chain-of-Thought Reasoning for Videos
Paper
• 2512.09616
• Published • 19
Inference Compute-Optimal Video Vision Language Models
Paper
• 2505.18855
• Published
Video-LaVIT: Unified Video-Language Pre-training with Decoupled
Visual-Motional Tokenization
Paper
• 2402.03161
• Published • 16
Select Less, Reason More: Prioritizing Evidence Purity for Video
Reasoning
Paper
• 2510.15440
• Published
When Thinking Drifts: Evidential Grounding for Robust Video Reasoning
Paper
• 2510.06077
• Published
VideoICL: Confidence-based Iterative In-context Learning for
Out-of-Distribution Video Understanding
Paper
• 2412.02186
• Published • 23
VIDEOP2R: Video Understanding from Perception to Reasoning
Paper
• 2511.11113
• Published • 112
QuickVideo: Real-Time Long Video Understanding with System Algorithm
Co-Design
Paper
• 2505.16175
• Published • 42
Towards Video Thinking Test: A Holistic Benchmark for Advanced Video
Reasoning and Understanding
Paper
• 2507.15028
• Published • 21
Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding
Paper
• 2508.20478
• Published • 18
Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or
True Temporal Understanding?
Paper
• 2505.14321
• Published • 11
Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding
Paper
• 2512.05774
• Published • 7
LOVE-R1: Advancing Long Video Understanding with an Adaptive Zoom-in
Mechanism via Multi-Step Reasoning
Paper
• 2509.24786
• Published • 7
Online Video Understanding: A Comprehensive Benchmark and
Memory-Augmented Method
Paper
• 2501.00584
• Published
DATE: Dynamic Absolute Time Enhancement for Long Video Understanding
Paper
• 2509.09263
• Published
Video-CCAM: Enhancing Video-Language Understanding with Causal
Cross-Attention Masks for Short and Long Videos
Paper
• 2408.14023
• Published
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
Paper
• 2604.05015
• Published • 234
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with
Video LLM
Paper
• 2501.00599
• Published • 46
Vidi2: Large Multimodal Models for Video Understanding and Creation
Paper
• 2511.19529
• Published • 2
Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning
Paper
• 2602.01649
• Published
Token-Efficient Long Video Understanding for Multimodal LLMs
Paper
• 2503.04130
• Published • 96
ShareGPT4Video: Improving Video Understanding and Generation with Better
Captions
Paper
• 2406.04325
• Published • 74
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Paper
• 2412.10360
• Published • 147
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long
Video Understanding
Paper
• 2508.21496
• Published • 55
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video
Understanding
Paper
• 2406.14515
• Published • 33
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token
Marks
Paper
• 2501.08326
• Published • 34
Vidi: Large Multimodal Models for Video Understanding and Editing
Paper
• 2504.15681
• Published • 14
SkillFormer: Unified Multi-View Video Understanding for Proficiency
Estimation
Paper
• 2505.08665
• Published • 5
Visual Context Window Extension: A New Perspective for Long Video
Understanding
Paper
• 2409.20018
• Published • 11
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
Paper
• 2411.02327
• Published • 11
VideoAds for Fast-Paced Video Understanding: Where Opensource Foundation
Models Beat GPT-4o & Gemini-1.5 Pro
Paper
• 2504.09282
• Published
VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting
Paper
• 2603.14659
• Published • 6
MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for
Video Temporal Reasoning
Paper
• 2509.21113
• Published • 6
Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models
Paper
• 2511.23478
• Published
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal
Evidence
Paper
• 2510.20579
• Published • 56
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video
Understanding?
Paper
• 2501.05510
• Published • 44
Accelerating Streaming Video Large Language Models via Hierarchical Token Compression
Paper
• 2512.00891
• Published • 16
A Simple Baseline for Streaming Video Understanding
Paper
• 2604.02317
• Published • 72
Streaming Video Question-Answering with In-context Video KV-Cache
Retrieval
Paper
• 2503.00540
• Published • 3
StreamChat: Chatting with Streaming Video
Paper
• 2412.08646
• Published • 18
Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models
Paper
• 2603.11896
• Published • 10
CurveStream: Boosting Streaming Video Understanding in MLLMs via Curvature-Aware Hierarchical Visual Memory Management
Paper
• 2603.19571
• Published • 2
StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video
Understanding
Paper
• 2411.03628
• Published • 2
VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs
Paper
• 2512.22226
• Published
Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs
Paper
• 2510.17364
• Published
Memory-efficient Streaming VideoLLMs for Real-time Procedural Video
Understanding
Paper
• 2504.13915
• Published
VideoLLM-online: Online Video Large Language Model for Streaming Video
Paper
• 2406.11816
• Published • 26
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths
Vision Computation
Paper
• 2408.16730
• Published
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming
Videos
Paper
• 2504.17343
• Published • 13
HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding
Paper
• 2601.14724
• Published • 75
Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video
Paper
• 2510.14560
• Published