Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, Yu Qiao, Yali Wang, and Limin Wang
🤗 Model & Data | 🖥️ Demo | 📑 Paper | 🌐 Blog
-
2025/02/25:🔥🔥🔥 We release our training data, training codes based LLaVA for VideoChat-Flash and training codes based XTuner for finetuning InternVideo2.5.
-
2025/02/12: 🎉🎉🎉Our VideoChat-Flash-7B@448 has achieved first place on the latest Video Detail Caption Benchmark, AuroraCap.
-
2025/01/15: We provide evaluation codes for QA & Grounding Benchmark.
-
2025/01/12: 🔥🔥🔥Release VideoChat2-Flash, a powerfull MLLM built on video encoder (InternVideo) and LLM (Qwen).
- We offer five models, VideoChat2-Flash-2B@224 (Small LLM), VideoChat2-Flash-7B@224, VideoChat2-Flash-7B@448 (Overall best), VideoChat-Flash-Qwen2_5-7B-1M (Super long video input) and VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B (Stronger short-term temporal understanding).
-
Dataset and evaluation codes for single-hop and multi-hop needle-in-a-haystack;
🚀State-of-the-art performance in short and long video understanding, with temporal localization capabilities comparable to expert models.
🔭Supports ultra-long video inputs, achieving a groundbreaking needle-in-a-haystack evaluation accuracy of 99.1% on 10,000 frames, capable of processing videos up to three hours long.
⚡Highly efficient model architecture with exceptional inference speed, encoding each video frame into just 16 tokens, making it 5–10 times faster than the previous model.
Refer to hf README to inference our model.
See evaluation codes.
See training codes based LLaVA for VideoChat-Flash and training codes based XTuner for finetuning InternVideo2.5.
📊 NIAH
If you find this project useful in your research, please consider cite:
@article{li2024videochat,
title={VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling},
author={Li, Xinhao and Wang, Yi and Yu, Jiashuo and Zeng, Xiangyu and Zhu, Yuhan and Huang, Haian and Gao, Jianfei and Li, Kunchang and He, Yinan and Wang, Chenting and Qiao, Yu and Wang, Yali and Wang, Limin},
journal={arXiv preprint arXiv:2501.00574},
year={2024}
}
Thanks to the open source of the following projects: InternVideo, UMT, Qwen, LLaVA-VL, lmms-eval, Ask-Anything, ToMe, LongVLM, FastV, LLaVolta, PyramidDrop, LongVA, their implementation provides valuable reference experience for our project.