Upload README.md with huggingface_hub
Browse files
README.md
ADDED
@@ -0,0 +1,572 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- chenjoya/Live-CC-5M
|
5 |
+
- chenjoya/Live-WhisperX-526K
|
6 |
+
- lmms-lab/LLaVA-Video-178K
|
7 |
+
language:
|
8 |
+
- en
|
9 |
+
base_model:
|
10 |
+
- Qwen/Qwen2-VL-7B
|
11 |
+
tags:
|
12 |
+
- qwen_vl
|
13 |
+
- video
|
14 |
+
- real-time
|
15 |
+
- multimodal
|
16 |
+
- LLM
|
17 |
+
---
|
18 |
+
|
19 |
+
# <span style="color: #7FFF7F;">LiveCC-7B-Instruct GGUF Models</span>
|
20 |
+
|
21 |
+
## <span style="color: #7FFF7F;">Ultra-Low-Bit Quantization with IQ-DynamicGate (1-2 bit)</span>
|
22 |
+
|
23 |
+
Our latest quantization method introduces **precision-adaptive quantization** for ultra-low-bit models (1-2 bit), with benchmark-proven improvements on **Llama-3-8B**. This approach uses layer-specific strategies to preserve accuracy while maintaining extreme memory efficiency.
|
24 |
+
|
25 |
+
### **Benchmark Context**
|
26 |
+
All tests conducted on **Llama-3-8B-Instruct** using:
|
27 |
+
- Standard perplexity evaluation pipeline
|
28 |
+
- 2048-token context window
|
29 |
+
- Same prompt set across all quantizations
|
30 |
+
|
31 |
+
### **Method**
|
32 |
+
- **Dynamic Precision Allocation**:
|
33 |
+
- First/Last 25% of layers → IQ4_XS (selected layers)
|
34 |
+
- Middle 50% → IQ2_XXS/IQ3_S (increase efficiency)
|
35 |
+
- **Critical Component Protection**:
|
36 |
+
- Embeddings/output layers use Q5_K
|
37 |
+
- Reduces error propagation by 38% vs standard 1-2bit
|
38 |
+
|
39 |
+
### **Quantization Performance Comparison (Llama-3-8B)**
|
40 |
+
|
41 |
+
| Quantization | Standard PPL | DynamicGate PPL | Δ PPL | Std Size | DG Size | Δ Size | Std Speed | DG Speed |
|
42 |
+
|--------------|--------------|------------------|---------|----------|---------|--------|-----------|----------|
|
43 |
+
| IQ2_XXS | 11.30 | 9.84 | -12.9% | 2.5G | 2.6G | +0.1G | 234s | 246s |
|
44 |
+
| IQ2_XS | 11.72 | 11.63 | -0.8% | 2.7G | 2.8G | +0.1G | 242s | 246s |
|
45 |
+
| IQ2_S | 14.31 | 9.02 | -36.9% | 2.7G | 2.9G | +0.2G | 238s | 244s |
|
46 |
+
| IQ1_M | 27.46 | 15.41 | -43.9% | 2.2G | 2.5G | +0.3G | 206s | 212s |
|
47 |
+
| IQ1_S | 53.07 | 32.00 | -39.7% | 2.1G | 2.4G | +0.3G | 184s | 209s |
|
48 |
+
|
49 |
+
**Key**:
|
50 |
+
- PPL = Perplexity (lower is better)
|
51 |
+
- Δ PPL = Percentage change from standard to DynamicGate
|
52 |
+
- Speed = Inference time (CPU avx2, 2048 token context)
|
53 |
+
- Size differences reflect mixed quantization overhead
|
54 |
+
|
55 |
+
**Key Improvements:**
|
56 |
+
- 🔥 **IQ1_M** shows massive 43.9% perplexity reduction (27.46 → 15.41)
|
57 |
+
- 🚀 **IQ2_S** cuts perplexity by 36.9% while adding only 0.2GB
|
58 |
+
- ⚡ **IQ1_S** maintains 39.7% better accuracy despite 1-bit quantization
|
59 |
+
|
60 |
+
**Tradeoffs:**
|
61 |
+
- All variants have modest size increases (0.1-0.3GB)
|
62 |
+
- Inference speeds remain comparable (<5% difference)
|
63 |
+
|
64 |
+
|
65 |
+
### **When to Use These Models**
|
66 |
+
📌 **Fitting models into GPU VRAM**
|
67 |
+
|
68 |
+
✔ **Memory-constrained deployments**
|
69 |
+
|
70 |
+
✔ **Cpu and Edge Devices** where 1-2bit errors can be tolerated
|
71 |
+
|
72 |
+
✔ **Research** into ultra-low-bit quantization
|
73 |
+
|
74 |
+
|
75 |
+
## **Choosing the Right Model Format**
|
76 |
+
|
77 |
+
Selecting the correct model format depends on your **hardware capabilities** and **memory constraints**.
|
78 |
+
|
79 |
+
### **BF16 (Brain Float 16) – Use if BF16 acceleration is available**
|
80 |
+
- A 16-bit floating-point format designed for **faster computation** while retaining good precision.
|
81 |
+
- Provides **similar dynamic range** as FP32 but with **lower memory usage**.
|
82 |
+
- Recommended if your hardware supports **BF16 acceleration** (check your device's specs).
|
83 |
+
- Ideal for **high-performance inference** with **reduced memory footprint** compared to FP32.
|
84 |
+
|
85 |
+
📌 **Use BF16 if:**
|
86 |
+
✔ Your hardware has native **BF16 support** (e.g., newer GPUs, TPUs).
|
87 |
+
✔ You want **higher precision** while saving memory.
|
88 |
+
✔ You plan to **requantize** the model into another format.
|
89 |
+
|
90 |
+
📌 **Avoid BF16 if:**
|
91 |
+
❌ Your hardware does **not** support BF16 (it may fall back to FP32 and run slower).
|
92 |
+
❌ You need compatibility with older devices that lack BF16 optimization.
|
93 |
+
|
94 |
+
---
|
95 |
+
|
96 |
+
### **F16 (Float 16) – More widely supported than BF16**
|
97 |
+
- A 16-bit floating-point **high precision** but with less of range of values than BF16.
|
98 |
+
- Works on most devices with **FP16 acceleration support** (including many GPUs and some CPUs).
|
99 |
+
- Slightly lower numerical precision than BF16 but generally sufficient for inference.
|
100 |
+
|
101 |
+
📌 **Use F16 if:**
|
102 |
+
✔ Your hardware supports **FP16** but **not BF16**.
|
103 |
+
✔ You need a **balance between speed, memory usage, and accuracy**.
|
104 |
+
✔ You are running on a **GPU** or another device optimized for FP16 computations.
|
105 |
+
|
106 |
+
📌 **Avoid F16 if:**
|
107 |
+
❌ Your device lacks **native FP16 support** (it may run slower than expected).
|
108 |
+
❌ You have memory limitations.
|
109 |
+
|
110 |
+
---
|
111 |
+
|
112 |
+
### **Quantized Models (Q4_K, Q6_K, Q8, etc.) – For CPU & Low-VRAM Inference**
|
113 |
+
Quantization reduces model size and memory usage while maintaining as much accuracy as possible.
|
114 |
+
- **Lower-bit models (Q4_K)** → **Best for minimal memory usage**, may have lower precision.
|
115 |
+
- **Higher-bit models (Q6_K, Q8_0)** → **Better accuracy**, requires more memory.
|
116 |
+
|
117 |
+
📌 **Use Quantized Models if:**
|
118 |
+
✔ You are running inference on a **CPU** and need an optimized model.
|
119 |
+
✔ Your device has **low VRAM** and cannot load full-precision models.
|
120 |
+
✔ You want to reduce **memory footprint** while keeping reasonable accuracy.
|
121 |
+
|
122 |
+
📌 **Avoid Quantized Models if:**
|
123 |
+
❌ You need **maximum accuracy** (full-precision models are better for this).
|
124 |
+
❌ Your hardware has enough VRAM for higher-precision formats (BF16/F16).
|
125 |
+
|
126 |
+
---
|
127 |
+
|
128 |
+
### **Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)**
|
129 |
+
These models are optimized for **extreme memory efficiency**, making them ideal for **low-power devices** or **large-scale deployments** where memory is a critical constraint.
|
130 |
+
|
131 |
+
- **IQ3_XS**: Ultra-low-bit quantization (3-bit) with **extreme memory efficiency**.
|
132 |
+
- **Use case**: Best for **ultra-low-memory devices** where even Q4_K is too large.
|
133 |
+
- **Trade-off**: Lower accuracy compared to higher-bit quantizations.
|
134 |
+
|
135 |
+
- **IQ3_S**: Small block size for **maximum memory efficiency**.
|
136 |
+
- **Use case**: Best for **low-memory devices** where **IQ3_XS** is too aggressive.
|
137 |
+
|
138 |
+
- **IQ3_M**: Medium block size for better accuracy than **IQ3_S**.
|
139 |
+
- **Use case**: Suitable for **low-memory devices** where **IQ3_S** is too limiting.
|
140 |
+
|
141 |
+
- **Q4_K**: 4-bit quantization with **block-wise optimization** for better accuracy.
|
142 |
+
- **Use case**: Best for **low-memory devices** where **Q6_K** is too large.
|
143 |
+
|
144 |
+
- **Q4_0**: Pure 4-bit quantization, optimized for **ARM devices**.
|
145 |
+
- **Use case**: Best for **ARM-based devices** or **low-memory environments**.
|
146 |
+
|
147 |
+
---
|
148 |
+
|
149 |
+
### **Summary Table: Model Format Selection**
|
150 |
+
|
151 |
+
| Model Format | Precision | Memory Usage | Device Requirements | Best Use Case |
|
152 |
+
|--------------|------------|---------------|----------------------|---------------|
|
153 |
+
| **BF16** | Highest | High | BF16-supported GPU/CPUs | High-speed inference with reduced memory |
|
154 |
+
| **F16** | High | High | FP16-supported devices | GPU inference when BF16 isn't available |
|
155 |
+
| **Q4_K** | Medium Low | Low | CPU or Low-VRAM devices | Best for memory-constrained environments |
|
156 |
+
| **Q6_K** | Medium | Moderate | CPU with more memory | Better accuracy while still being quantized |
|
157 |
+
| **Q8_0** | High | Moderate | CPU or GPU with enough VRAM | Best accuracy among quantized models |
|
158 |
+
| **IQ3_XS** | Very Low | Very Low | Ultra-low-memory devices | Extreme memory efficiency and low accuracy |
|
159 |
+
| **Q4_0** | Low | Low | ARM or low-memory devices | llama.cpp can optimize for ARM devices |
|
160 |
+
|
161 |
+
---
|
162 |
+
|
163 |
+
## **Included Files & Details**
|
164 |
+
|
165 |
+
### `LiveCC-7B-Instruct-bf16.gguf`
|
166 |
+
- Model weights preserved in **BF16**.
|
167 |
+
- Use this if you want to **requantize** the model into a different format.
|
168 |
+
- Best if your device supports **BF16 acceleration**.
|
169 |
+
|
170 |
+
### `LiveCC-7B-Instruct-f16.gguf`
|
171 |
+
- Model weights stored in **F16**.
|
172 |
+
- Use if your device supports **FP16**, especially if BF16 is not available.
|
173 |
+
|
174 |
+
### `LiveCC-7B-Instruct-bf16-q8_0.gguf`
|
175 |
+
- **Output & embeddings** remain in **BF16**.
|
176 |
+
- All other layers quantized to **Q8_0**.
|
177 |
+
- Use if your device supports **BF16** and you want a quantized version.
|
178 |
+
|
179 |
+
### `LiveCC-7B-Instruct-f16-q8_0.gguf`
|
180 |
+
- **Output & embeddings** remain in **F16**.
|
181 |
+
- All other layers quantized to **Q8_0**.
|
182 |
+
|
183 |
+
### `LiveCC-7B-Instruct-q4_k.gguf`
|
184 |
+
- **Output & embeddings** quantized to **Q8_0**.
|
185 |
+
- All other layers quantized to **Q4_K**.
|
186 |
+
- Good for **CPU inference** with limited memory.
|
187 |
+
|
188 |
+
### `LiveCC-7B-Instruct-q4_k_s.gguf`
|
189 |
+
- Smallest **Q4_K** variant, using less memory at the cost of accuracy.
|
190 |
+
- Best for **very low-memory setups**.
|
191 |
+
|
192 |
+
### `LiveCC-7B-Instruct-q6_k.gguf`
|
193 |
+
- **Output & embeddings** quantized to **Q8_0**.
|
194 |
+
- All other layers quantized to **Q6_K** .
|
195 |
+
|
196 |
+
### `LiveCC-7B-Instruct-q8_0.gguf`
|
197 |
+
- Fully **Q8** quantized model for better accuracy.
|
198 |
+
- Requires **more memory** but offers higher precision.
|
199 |
+
|
200 |
+
### `LiveCC-7B-Instruct-iq3_xs.gguf`
|
201 |
+
- **IQ3_XS** quantization, optimized for **extreme memory efficiency**.
|
202 |
+
- Best for **ultra-low-memory devices**.
|
203 |
+
|
204 |
+
### `LiveCC-7B-Instruct-iq3_m.gguf`
|
205 |
+
- **IQ3_M** quantization, offering a **medium block size** for better accuracy.
|
206 |
+
- Suitable for **low-memory devices**.
|
207 |
+
|
208 |
+
### `LiveCC-7B-Instruct-q4_0.gguf`
|
209 |
+
- Pure **Q4_0** quantization, optimized for **ARM devices**.
|
210 |
+
- Best for **low-memory environments**.
|
211 |
+
- Prefer IQ4_NL for better accuracy.
|
212 |
+
|
213 |
+
# <span id="testllm" style="color: #7F7FFF;">🚀 If you find these models useful</span>
|
214 |
+
❤ **Please click "Like" if you find this useful!**
|
215 |
+
Help me test my **AI-Powered Network Monitor Assistant** with **quantum-ready security checks**:
|
216 |
+
👉 [Free Network Monitor](https://freenetworkmonitor.click/dashboard)
|
217 |
+
|
218 |
+
💬 **How to test**:
|
219 |
+
1. Click the **chat icon** (bottom right on any page)
|
220 |
+
2. Choose an **AI assistant type**:
|
221 |
+
- `TurboLLM` (GPT-4-mini)
|
222 |
+
- `FreeLLM` (Open-source)
|
223 |
+
- `TestLLM` (Experimental CPU-only)
|
224 |
+
|
225 |
+
### **What I’m Testing**
|
226 |
+
I’m pushing the limits of **small open-source models for AI network monitoring**, specifically:
|
227 |
+
- **Function calling** against live network services
|
228 |
+
- **How small can a model go** while still handling:
|
229 |
+
- Automated **Nmap scans**
|
230 |
+
- **Quantum-readiness checks**
|
231 |
+
- **Metasploit integration**
|
232 |
+
|
233 |
+
🟡 **TestLLM** – Current experimental model (llama.cpp on 6 CPU threads):
|
234 |
+
- ✅ **Zero-configuration setup**
|
235 |
+
- ⏳ 30s load time (slow inference but **no API costs**)
|
236 |
+
- 🔧 **Help wanted!** If you’re into **edge-device AI**, let’s collaborate!
|
237 |
+
|
238 |
+
### **Other Assistants**
|
239 |
+
🟢 **TurboLLM** – Uses **gpt-4-mini** for:
|
240 |
+
- **Real-time network diagnostics**
|
241 |
+
- **Automated penetration testing** (Nmap/Metasploit)
|
242 |
+
- 🔑 Get more tokens by [downloading our Free Network Monitor Agent](https://freenetworkmonitor.click/download)
|
243 |
+
|
244 |
+
🔵 **HugLLM** – Open-source models (≈8B params):
|
245 |
+
- **2x more tokens** than TurboLLM
|
246 |
+
- **AI-powered log analysis**
|
247 |
+
- 🌐 Runs on Hugging Face Inference API
|
248 |
+
|
249 |
+
### 💡 **Example AI Commands to Test**:
|
250 |
+
1. `"Give me info on my websites SSL certificate"`
|
251 |
+
2. `"Check if my server is using quantum safe encyption for communication"`
|
252 |
+
3. `"Run a quick Nmap vulnerability test"`
|
253 |
+
|
254 |
+
|
255 |
+
# LiveCC-7B-Instruct
|
256 |
+
|
257 |
+
## Introduction
|
258 |
+
|
259 |
+
We introduce LiveCC, the first video LLM capable of real-time commentary, trained with a novel video-ASR streaming method, SOTA on both streaming and offline benchmarks.
|
260 |
+
|
261 |
+
- Project Page: https://showlab.github.io/livecc
|
262 |
+
|
263 |
+
> [!Important]
|
264 |
+
> This is the SFT model. The base model is at [LiveCC-7B-Base](https://huggingface.co/chenjoya/LiveCC-7B-Base).
|
265 |
+
|
266 |
+
## Training with Streaming Frame-Words Paradigm
|
267 |
+
|
268 |
+

|
269 |
+
|
270 |
+
## Quickstart
|
271 |
+
|
272 |
+
### Gradio Demo
|
273 |
+
|
274 |
+
Please refer to https://github.com/showlab/livecc:
|
275 |
+
|
276 |
+

|
277 |
+
|
278 |
+
### Hands-on
|
279 |
+
|
280 |
+
Like qwen-vl-utils, we offer a toolkit to help you handle various types of visual input more conveniently, **especially on video streaming inputs**. You can install it using the following command:
|
281 |
+
|
282 |
+
```bash
|
283 |
+
pip install qwen-vl-utils livecc-utils liger_kernel
|
284 |
+
```
|
285 |
+
|
286 |
+
Here we show a code snippet to show you how to do **real-time video commentary** with `transformers` and the above utils:
|
287 |
+
|
288 |
+
```python
|
289 |
+
import functools, torch, os, tqdm
|
290 |
+
from liger_kernel.transformers import apply_liger_kernel_to_qwen2_vl
|
291 |
+
apply_liger_kernel_to_qwen2_vl() # important. our model is trained with this. keep consistency
|
292 |
+
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor, LogitsProcessor, logging
|
293 |
+
from livecc_utils import prepare_multiturn_multimodal_inputs_for_generation, get_smart_resized_clip, get_smart_resized_video_reader
|
294 |
+
from qwen_vl_utils import process_vision_info
|
295 |
+
|
296 |
+
class LiveCCDemoInfer:
|
297 |
+
fps = 2
|
298 |
+
initial_fps_frames = 6
|
299 |
+
streaming_fps_frames = 2
|
300 |
+
initial_time_interval = initial_fps_frames / fps
|
301 |
+
streaming_time_interval = streaming_fps_frames / fps
|
302 |
+
frame_time_interval = 1 / fps
|
303 |
+
def __init__(self, model_path: str = None, device_id: int = 0):
|
304 |
+
self.model = Qwen2VLForConditionalGeneration.from_pretrained(
|
305 |
+
model_path, torch_dtype="auto",
|
306 |
+
device_map=f'cuda:{device_id}',
|
307 |
+
attn_implementation='flash_attention_2'
|
308 |
+
)
|
309 |
+
self.processor = AutoProcessor.from_pretrained(model_path, use_fast=False)
|
310 |
+
self.model.prepare_inputs_for_generation = functools.partial(prepare_multiturn_multimodal_inputs_for_generation, self.model)
|
311 |
+
message = {
|
312 |
+
"role": "user",
|
313 |
+
"content": [
|
314 |
+
{"type": "text", "text": 'livecc'},
|
315 |
+
]
|
316 |
+
}
|
317 |
+
texts = self.processor.apply_chat_template([message], tokenize=False)
|
318 |
+
self.system_prompt_offset = texts.index('<|im_start|>user')
|
319 |
+
self._cached_video_readers_with_hw = {}
|
320 |
+
|
321 |
+
|
322 |
+
def live_cc(
|
323 |
+
self,
|
324 |
+
query: str,
|
325 |
+
state: dict,
|
326 |
+
max_pixels: int = 384 * 28 * 28,
|
327 |
+
default_query: str = 'Please describe the video.',
|
328 |
+
do_sample: bool = True,
|
329 |
+
repetition_penalty: float = 1.05,
|
330 |
+
**kwargs,
|
331 |
+
):
|
332 |
+
"""
|
333 |
+
state: dict, (maybe) with keys:
|
334 |
+
video_path: str, video path
|
335 |
+
video_timestamp: float, current video timestamp
|
336 |
+
last_timestamp: float, last processed video timestamp
|
337 |
+
last_video_pts_index: int, last processed video frame index
|
338 |
+
video_pts: np.ndarray, video pts
|
339 |
+
last_history: list, last processed history
|
340 |
+
past_key_values: llm past_key_values
|
341 |
+
past_ids: past generated ids
|
342 |
+
"""
|
343 |
+
# 1. preparation: video_reader, and last processing info
|
344 |
+
video_timestamp, last_timestamp = state.get('video_timestamp', 0), state.get('last_timestamp', -1 / self.fps)
|
345 |
+
video_path = state['video_path']
|
346 |
+
if video_path not in self._cached_video_readers_with_hw:
|
347 |
+
self._cached_video_readers_with_hw[video_path] = get_smart_resized_video_reader(video_path, max_pixels)
|
348 |
+
video_reader = self._cached_video_readers_with_hw[video_path][0]
|
349 |
+
video_reader.get_frame_timestamp(0)
|
350 |
+
state['video_pts'] = torch.from_numpy(video_reader._frame_pts[:, 1])
|
351 |
+
state['last_video_pts_index'] = -1
|
352 |
+
video_pts = state['video_pts']
|
353 |
+
if last_timestamp + self.frame_time_interval > video_pts[-1]:
|
354 |
+
state['video_end'] = True
|
355 |
+
return
|
356 |
+
video_reader, resized_height, resized_width = self._cached_video_readers_with_hw[video_path]
|
357 |
+
last_video_pts_index = state['last_video_pts_index']
|
358 |
+
|
359 |
+
# 2. which frames will be processed
|
360 |
+
initialized = last_timestamp >= 0
|
361 |
+
if not initialized:
|
362 |
+
video_timestamp = max(video_timestamp, self.initial_time_interval)
|
363 |
+
if video_timestamp <= last_timestamp + self.frame_time_interval:
|
364 |
+
return
|
365 |
+
timestamps = torch.arange(last_timestamp + self.frame_time_interval, video_timestamp, self.frame_time_interval) # add compensation
|
366 |
+
|
367 |
+
# 3. fetch frames in required timestamps
|
368 |
+
clip, clip_timestamps, clip_idxs = get_smart_resized_clip(video_reader, resized_height, resized_width, timestamps, video_pts, video_pts_index_from=last_video_pts_index+1)
|
369 |
+
state['last_video_pts_index'] = clip_idxs[-1]
|
370 |
+
state['last_timestamp'] = clip_timestamps[-1]
|
371 |
+
|
372 |
+
# 4. organize to interleave frames
|
373 |
+
interleave_clips, interleave_timestamps = [], []
|
374 |
+
if not initialized:
|
375 |
+
interleave_clips.append(clip[:self.initial_fps_frames])
|
376 |
+
interleave_timestamps.append(clip_timestamps[:self.initial_fps_frames])
|
377 |
+
clip = clip[self.initial_fps_frames:]
|
378 |
+
clip_timestamps = clip_timestamps[self.initial_fps_frames:]
|
379 |
+
if len(clip) > 0:
|
380 |
+
interleave_clips.extend(list(clip.split(self.streaming_fps_frames)))
|
381 |
+
interleave_timestamps.extend(list(clip_timestamps.split(self.streaming_fps_frames)))
|
382 |
+
|
383 |
+
# 5. make conversation and send to model
|
384 |
+
for clip, timestamps in zip(interleave_clips, interleave_timestamps):
|
385 |
+
start_timestamp, stop_timestamp = timestamps[0].item(), timestamps[-1].item() + self.frame_time_interval
|
386 |
+
message = {
|
387 |
+
"role": "user",
|
388 |
+
"content": [
|
389 |
+
{"type": "text", "text": f'Time={start_timestamp:.1f}-{stop_timestamp:.1f}s'},
|
390 |
+
{"type": "video", "video": clip}
|
391 |
+
]
|
392 |
+
}
|
393 |
+
if not query and not state.get('query', None):
|
394 |
+
query = default_query
|
395 |
+
print(f'No query provided, use default_query={default_query}')
|
396 |
+
if query and state.get('query', None) != query:
|
397 |
+
message['content'].append({"type": "text", "text": query})
|
398 |
+
state['query'] = query
|
399 |
+
texts = self.processor.apply_chat_template([message], tokenize=False, add_generation_prompt=True, return_tensors='pt')
|
400 |
+
past_ids = state.get('past_ids', None)
|
401 |
+
if past_ids is not None:
|
402 |
+
texts = '<|im_end|>\n' + texts[self.system_prompt_offset:]
|
403 |
+
inputs = self.processor(
|
404 |
+
text=texts,
|
405 |
+
images=None,
|
406 |
+
videos=[clip],
|
407 |
+
return_tensors="pt",
|
408 |
+
return_attention_mask=False
|
409 |
+
)
|
410 |
+
inputs.to('cuda')
|
411 |
+
if past_ids is not None:
|
412 |
+
inputs['input_ids'] = torch.cat([past_ids, inputs.input_ids], dim=1)
|
413 |
+
outputs = self.model.generate(
|
414 |
+
**inputs, past_key_values=state.get('past_key_values', None),
|
415 |
+
return_dict_in_generate=True, do_sample=do_sample,
|
416 |
+
repetition_penalty=repetition_penalty,
|
417 |
+
)
|
418 |
+
state['past_key_values'] = outputs.past_key_values
|
419 |
+
state['past_ids'] = outputs.sequences[:, :-1]
|
420 |
+
yield (start_timestamp, stop_timestamp), self.processor.decode(outputs.sequences[0, inputs.input_ids.size(1):], skip_special_tokens=True), state
|
421 |
+
|
422 |
+
model_path = 'chenjoya/LiveCC-7B-Instruct'
|
423 |
+
# download a test video at: https://github.com/showlab/livecc/blob/main/demo/sources/howto_fix_laptop_mute_1080p.mp4
|
424 |
+
video_path = "demo/sources/howto_fix_laptop_mute_1080p.mp4"
|
425 |
+
query = "Please describe the video."
|
426 |
+
|
427 |
+
infer = LiveCCDemoInfer(model_path=model_path)
|
428 |
+
state = {'video_path': video_path}
|
429 |
+
commentaries = []
|
430 |
+
t = 0
|
431 |
+
for t in range(31):
|
432 |
+
state['video_timestamp'] = t
|
433 |
+
for (start_t, stop_t), response, state in infer.live_cc(
|
434 |
+
query=query, state=state,
|
435 |
+
max_pixels = 384 * 28 * 28, repetition_penalty=1.05,
|
436 |
+
streaming_eos_base_threshold=0.0, streaming_eos_threshold_step=0
|
437 |
+
):
|
438 |
+
print(f'{start_t}s-{stop_t}s: {response}')
|
439 |
+
commentaries.append([start_t, stop_t, response])
|
440 |
+
if state.get('video_end', False):
|
441 |
+
break
|
442 |
+
t += 1
|
443 |
+
```
|
444 |
+
|
445 |
+
Here we show a code snippet to show you how to do **common video (multi-turn) qa** with `transformers` and the above utils:
|
446 |
+
```python
|
447 |
+
import functools, torch
|
448 |
+
from liger_kernel.transformers import apply_liger_kernel_to_qwen2_vl
|
449 |
+
apply_liger_kernel_to_qwen2_vl() # important. our model is trained with this. keep consistency
|
450 |
+
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor, LogitsProcessor, logging
|
451 |
+
from livecc_utils import prepare_multiturn_multimodal_inputs_for_generation, get_smart_resized_clip, get_smart_resized_video_reader
|
452 |
+
from qwen_vl_utils import process_vision_info
|
453 |
+
|
454 |
+
class LiveCCDemoInfer:
|
455 |
+
fps = 2
|
456 |
+
initial_fps_frames = 6
|
457 |
+
streaming_fps_frames = 2
|
458 |
+
initial_time_interval = initial_fps_frames / fps
|
459 |
+
streaming_time_interval = streaming_fps_frames / fps
|
460 |
+
frame_time_interval = 1 / fps
|
461 |
+
|
462 |
+
def __init__(self, model_path: str = None, device: str = 'cuda'):
|
463 |
+
self.model = Qwen2VLForConditionalGeneration.from_pretrained(
|
464 |
+
model_path, torch_dtype="auto",
|
465 |
+
device_map=device,
|
466 |
+
attn_implementation='flash_attention_2'
|
467 |
+
)
|
468 |
+
self.processor = AutoProcessor.from_pretrained(model_path, use_fast=False)
|
469 |
+
self.streaming_eos_token_id = self.processor.tokenizer(' ...').input_ids[-1]
|
470 |
+
self.model.prepare_inputs_for_generation = functools.partial(prepare_multiturn_multimodal_inputs_for_generation, self.model)
|
471 |
+
message = {
|
472 |
+
"role": "user",
|
473 |
+
"content": [
|
474 |
+
{"type": "text", "text": 'livecc'},
|
475 |
+
]
|
476 |
+
}
|
477 |
+
texts = self.processor.apply_chat_template([message], tokenize=False)
|
478 |
+
self.system_prompt_offset = texts.index('<|im_start|>user')
|
479 |
+
|
480 |
+
def video_qa(
|
481 |
+
self,
|
482 |
+
message: str,
|
483 |
+
state: dict,
|
484 |
+
do_sample: bool = True,
|
485 |
+
repetition_penalty: float = 1.05,
|
486 |
+
**kwargs,
|
487 |
+
):
|
488 |
+
"""
|
489 |
+
state: dict, (maybe) with keys:
|
490 |
+
video_path: str, video path
|
491 |
+
video_timestamp: float, current video timestamp
|
492 |
+
last_timestamp: float, last processed video timestamp
|
493 |
+
last_video_pts_index: int, last processed video frame index
|
494 |
+
video_pts: np.ndarray, video pts
|
495 |
+
last_history: list, last processed history
|
496 |
+
past_key_values: llm past_key_values
|
497 |
+
past_ids: past generated ids
|
498 |
+
"""
|
499 |
+
video_path = state.get('video_path', None)
|
500 |
+
conversation = []
|
501 |
+
past_ids = state.get('past_ids', None)
|
502 |
+
content = [{"type": "text", "text": message}]
|
503 |
+
if past_ids is None and video_path: # only use once
|
504 |
+
content.insert(0, {"type": "video", "video": video_path})
|
505 |
+
conversation.append({"role": "user", "content": content})
|
506 |
+
image_inputs, video_inputs = process_vision_info(conversation)
|
507 |
+
texts = self.processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True, return_tensors='pt')
|
508 |
+
if past_ids is not None:
|
509 |
+
texts = '<|im_end|>\n' + texts[self.system_prompt_offset:]
|
510 |
+
inputs = self.processor(
|
511 |
+
text=texts,
|
512 |
+
images=image_inputs,
|
513 |
+
videos=video_inputs,
|
514 |
+
return_tensors="pt",
|
515 |
+
return_attention_mask=False
|
516 |
+
)
|
517 |
+
inputs.to(self.model.device)
|
518 |
+
if past_ids is not None:
|
519 |
+
inputs['input_ids'] = torch.cat([past_ids, inputs.input_ids], dim=1)
|
520 |
+
outputs = self.model.generate(
|
521 |
+
**inputs, past_key_values=state.get('past_key_values', None),
|
522 |
+
return_dict_in_generate=True, do_sample=do_sample,
|
523 |
+
repetition_penalty=repetition_penalty,
|
524 |
+
max_new_tokens=512,
|
525 |
+
)
|
526 |
+
state['past_key_values'] = outputs.past_key_values
|
527 |
+
state['past_ids'] = outputs.sequences[:, :-1]
|
528 |
+
response = self.processor.decode(outputs.sequences[0, inputs.input_ids.size(1):], skip_special_tokens=True)
|
529 |
+
return response, state
|
530 |
+
|
531 |
+
model_path = 'chenjoya/LiveCC-7B-Instruct'
|
532 |
+
# download a test video at: https://github.com/showlab/livecc/blob/main/demo/sources/howto_fix_laptop_mute_1080p.mp4
|
533 |
+
video_path = "demo/sources/howto_fix_laptop_mute_1080p.mp4"
|
534 |
+
|
535 |
+
infer = LiveCCDemoInfer(model_path=model_path)
|
536 |
+
state = {'video_path': video_path}
|
537 |
+
# first round
|
538 |
+
query1 = 'What is the video?'
|
539 |
+
response1, state = infer.video_qa(message=query1, state=state)
|
540 |
+
print(f'Q1: {query1}\nA1: {response1}')
|
541 |
+
# second round
|
542 |
+
query2 = 'How do you know that?'
|
543 |
+
response2, state = infer.video_qa(message=query2, state=state)
|
544 |
+
print(f'Q2: {query2}\nA2: {response2}')
|
545 |
+
```
|
546 |
+
|
547 |
+
## Performance
|
548 |
+
|
549 |
+

|
550 |
+
|
551 |
+

|
552 |
+
|
553 |
+
## Limitations
|
554 |
+
|
555 |
+
- This model is finetuned on LiveCC-7B-Base, which is starting from Qwen2-VL-7B-Base, so it may have limitations mentioned in https://huggingface.co/Qwen/Qwen2-VL-7B.
|
556 |
+
- When performing real-time video commentary, it may appear collapse --- e.g., repeat pattern. If you encounter this situation, try to adjust repetition_penalty, streaming_eos_base_threshold, and streaming_eos_threshold_step.
|
557 |
+
- This model only has a context window of 32768. Using more visual tokens per frame (e.g. 768 * 28 * 28) will have better performance, but will shorten the working duration.
|
558 |
+
|
559 |
+
These limitations serve as ongoing directions for model optimization and improvement, and we are committed to continually enhancing the model's performance and scope of application.
|
560 |
+
|
561 |
+
## Citation
|
562 |
+
|
563 |
+
If you find our work helpful, feel free to give us a cite.
|
564 |
+
|
565 |
+
```
|
566 |
+
@article{livecc,
|
567 |
+
author = {Joya Chen and Ziyun Zeng and Yiqi Lin and Wei Li and Zejun Ma and Mike Zheng Shou},
|
568 |
+
title = {LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale},
|
569 |
+
journal = {arXiv preprint arXiv:2504.16030}
|
570 |
+
year = {2025},
|
571 |
+
}
|
572 |
+
```
|