Flash-VStream Model Card

Model details

We proposed Flash-VStream, a video-language model that simulates the memory mechanism of human. Our model is able to process extremely long video streams in real-time and respond to user queries simultaneously.

Training data

This model is trained based on image data from LLaVA-1.5 dataset, and video data from WebVid and ActivityNet datasets following LLaMA-VID, including

  • 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
  • 158K GPT-generated multimodal instruction-following data.
  • 450K academic-task-oriented VQA data mixture.
  • 40K ShareGPT data.
  • 232K video-caption pairs sampled from the WebVid 2.5M dataset.
  • 98K videos from ActivityNet with QA pairs from Video-ChatGPT.

License

This project is licensed under the LLAMA 2 License.

Downloads last month
371
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Spaces using IVGSZ/Flash-VStream-7b 4