xjtupanda
/

Idefics3-200K-video-finetune

Video-Text-to-Text

image-text-to-text

Inference Endpoints

Model card Files Files and versions Community

T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs

💻 GitHub | 📑 Paper

Model Summary

This is a part of the project T2Vid.
The video-LLM is fine-tuned from the image-LLM Idefics3-8B-Llama3.

License

Model License

The model is built on top of the pre-trained model: HuggingFaceM4/Idefics3-8B-Llama3. We release the fine-tuned Idefics3 checkpoints under the Apache 2.0 license.
The code in this repo is released under the Apache-2.0 License.

Statement

As an LLM, Idefics3-8B-Llama3 generates contents by learning a large mount of texts, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by Idefics3-8B-Llama3 does not represent the views and positions of the model developers
We will not be liable for any problems arising from the use of the Idefics3-8B-Llama3 open Source model, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.

Training dataset

100K video instruction data from Video-ChatGPT
100K video caption data from ShareGemini

Downloads last month: 16

Safetensors

Model size

8.46B params

Tensor type

BF16

·

Inference API

Video-Text-to-Text

Inference API (serverless) does not yet support transformers models for this pipeline type.

Model tree for xjtupanda/Idefics3-200K-video-finetune

Base model

HuggingFaceM4/Idefics3-8B-Llama3

Finetuned

(11)

this model

Datasets used to train xjtupanda/Idefics3-200K-video-finetune

Collection including xjtupanda/Idefics3-200K-video-finetune

T2Vid

T2Vid is a data augmentation method that enriches the instruction diversity of video data. In this collection, you will find related data and weights. • 5 items • Updated 28 days ago