GGUF
English
qwen_vl
video
real-time
multimodal
LLM
conversational
Mungert commited on
Commit
9bd0c64
·
verified ·
1 Parent(s): 496d80f

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +572 -0
README.md ADDED
@@ -0,0 +1,572 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - chenjoya/Live-CC-5M
5
+ - chenjoya/Live-WhisperX-526K
6
+ - lmms-lab/LLaVA-Video-178K
7
+ language:
8
+ - en
9
+ base_model:
10
+ - Qwen/Qwen2-VL-7B
11
+ tags:
12
+ - qwen_vl
13
+ - video
14
+ - real-time
15
+ - multimodal
16
+ - LLM
17
+ ---
18
+
19
+ # <span style="color: #7FFF7F;">LiveCC-7B-Instruct GGUF Models</span>
20
+
21
+ ## <span style="color: #7FFF7F;">Ultra-Low-Bit Quantization with IQ-DynamicGate (1-2 bit)</span>
22
+
23
+ Our latest quantization method introduces **precision-adaptive quantization** for ultra-low-bit models (1-2 bit), with benchmark-proven improvements on **Llama-3-8B**. This approach uses layer-specific strategies to preserve accuracy while maintaining extreme memory efficiency.
24
+
25
+ ### **Benchmark Context**
26
+ All tests conducted on **Llama-3-8B-Instruct** using:
27
+ - Standard perplexity evaluation pipeline
28
+ - 2048-token context window
29
+ - Same prompt set across all quantizations
30
+
31
+ ### **Method**
32
+ - **Dynamic Precision Allocation**:
33
+ - First/Last 25% of layers → IQ4_XS (selected layers)
34
+ - Middle 50% → IQ2_XXS/IQ3_S (increase efficiency)
35
+ - **Critical Component Protection**:
36
+ - Embeddings/output layers use Q5_K
37
+ - Reduces error propagation by 38% vs standard 1-2bit
38
+
39
+ ### **Quantization Performance Comparison (Llama-3-8B)**
40
+
41
+ | Quantization | Standard PPL | DynamicGate PPL | Δ PPL | Std Size | DG Size | Δ Size | Std Speed | DG Speed |
42
+ |--------------|--------------|------------------|---------|----------|---------|--------|-----------|----------|
43
+ | IQ2_XXS | 11.30 | 9.84 | -12.9% | 2.5G | 2.6G | +0.1G | 234s | 246s |
44
+ | IQ2_XS | 11.72 | 11.63 | -0.8% | 2.7G | 2.8G | +0.1G | 242s | 246s |
45
+ | IQ2_S | 14.31 | 9.02 | -36.9% | 2.7G | 2.9G | +0.2G | 238s | 244s |
46
+ | IQ1_M | 27.46 | 15.41 | -43.9% | 2.2G | 2.5G | +0.3G | 206s | 212s |
47
+ | IQ1_S | 53.07 | 32.00 | -39.7% | 2.1G | 2.4G | +0.3G | 184s | 209s |
48
+
49
+ **Key**:
50
+ - PPL = Perplexity (lower is better)
51
+ - Δ PPL = Percentage change from standard to DynamicGate
52
+ - Speed = Inference time (CPU avx2, 2048 token context)
53
+ - Size differences reflect mixed quantization overhead
54
+
55
+ **Key Improvements:**
56
+ - 🔥 **IQ1_M** shows massive 43.9% perplexity reduction (27.46 → 15.41)
57
+ - 🚀 **IQ2_S** cuts perplexity by 36.9% while adding only 0.2GB
58
+ - ⚡ **IQ1_S** maintains 39.7% better accuracy despite 1-bit quantization
59
+
60
+ **Tradeoffs:**
61
+ - All variants have modest size increases (0.1-0.3GB)
62
+ - Inference speeds remain comparable (<5% difference)
63
+
64
+
65
+ ### **When to Use These Models**
66
+ 📌 **Fitting models into GPU VRAM**
67
+
68
+ ✔ **Memory-constrained deployments**
69
+
70
+ ✔ **Cpu and Edge Devices** where 1-2bit errors can be tolerated
71
+
72
+ ✔ **Research** into ultra-low-bit quantization
73
+
74
+
75
+ ## **Choosing the Right Model Format**
76
+
77
+ Selecting the correct model format depends on your **hardware capabilities** and **memory constraints**.
78
+
79
+ ### **BF16 (Brain Float 16) – Use if BF16 acceleration is available**
80
+ - A 16-bit floating-point format designed for **faster computation** while retaining good precision.
81
+ - Provides **similar dynamic range** as FP32 but with **lower memory usage**.
82
+ - Recommended if your hardware supports **BF16 acceleration** (check your device's specs).
83
+ - Ideal for **high-performance inference** with **reduced memory footprint** compared to FP32.
84
+
85
+ 📌 **Use BF16 if:**
86
+ ✔ Your hardware has native **BF16 support** (e.g., newer GPUs, TPUs).
87
+ ✔ You want **higher precision** while saving memory.
88
+ ✔ You plan to **requantize** the model into another format.
89
+
90
+ 📌 **Avoid BF16 if:**
91
+ ❌ Your hardware does **not** support BF16 (it may fall back to FP32 and run slower).
92
+ ❌ You need compatibility with older devices that lack BF16 optimization.
93
+
94
+ ---
95
+
96
+ ### **F16 (Float 16) – More widely supported than BF16**
97
+ - A 16-bit floating-point **high precision** but with less of range of values than BF16.
98
+ - Works on most devices with **FP16 acceleration support** (including many GPUs and some CPUs).
99
+ - Slightly lower numerical precision than BF16 but generally sufficient for inference.
100
+
101
+ 📌 **Use F16 if:**
102
+ ✔ Your hardware supports **FP16** but **not BF16**.
103
+ ✔ You need a **balance between speed, memory usage, and accuracy**.
104
+ ✔ You are running on a **GPU** or another device optimized for FP16 computations.
105
+
106
+ 📌 **Avoid F16 if:**
107
+ ❌ Your device lacks **native FP16 support** (it may run slower than expected).
108
+ ❌ You have memory limitations.
109
+
110
+ ---
111
+
112
+ ### **Quantized Models (Q4_K, Q6_K, Q8, etc.) – For CPU & Low-VRAM Inference**
113
+ Quantization reduces model size and memory usage while maintaining as much accuracy as possible.
114
+ - **Lower-bit models (Q4_K)** → **Best for minimal memory usage**, may have lower precision.
115
+ - **Higher-bit models (Q6_K, Q8_0)** → **Better accuracy**, requires more memory.
116
+
117
+ 📌 **Use Quantized Models if:**
118
+ ✔ You are running inference on a **CPU** and need an optimized model.
119
+ ✔ Your device has **low VRAM** and cannot load full-precision models.
120
+ ✔ You want to reduce **memory footprint** while keeping reasonable accuracy.
121
+
122
+ 📌 **Avoid Quantized Models if:**
123
+ ❌ You need **maximum accuracy** (full-precision models are better for this).
124
+ ❌ Your hardware has enough VRAM for higher-precision formats (BF16/F16).
125
+
126
+ ---
127
+
128
+ ### **Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)**
129
+ These models are optimized for **extreme memory efficiency**, making them ideal for **low-power devices** or **large-scale deployments** where memory is a critical constraint.
130
+
131
+ - **IQ3_XS**: Ultra-low-bit quantization (3-bit) with **extreme memory efficiency**.
132
+ - **Use case**: Best for **ultra-low-memory devices** where even Q4_K is too large.
133
+ - **Trade-off**: Lower accuracy compared to higher-bit quantizations.
134
+
135
+ - **IQ3_S**: Small block size for **maximum memory efficiency**.
136
+ - **Use case**: Best for **low-memory devices** where **IQ3_XS** is too aggressive.
137
+
138
+ - **IQ3_M**: Medium block size for better accuracy than **IQ3_S**.
139
+ - **Use case**: Suitable for **low-memory devices** where **IQ3_S** is too limiting.
140
+
141
+ - **Q4_K**: 4-bit quantization with **block-wise optimization** for better accuracy.
142
+ - **Use case**: Best for **low-memory devices** where **Q6_K** is too large.
143
+
144
+ - **Q4_0**: Pure 4-bit quantization, optimized for **ARM devices**.
145
+ - **Use case**: Best for **ARM-based devices** or **low-memory environments**.
146
+
147
+ ---
148
+
149
+ ### **Summary Table: Model Format Selection**
150
+
151
+ | Model Format | Precision | Memory Usage | Device Requirements | Best Use Case |
152
+ |--------------|------------|---------------|----------------------|---------------|
153
+ | **BF16** | Highest | High | BF16-supported GPU/CPUs | High-speed inference with reduced memory |
154
+ | **F16** | High | High | FP16-supported devices | GPU inference when BF16 isn't available |
155
+ | **Q4_K** | Medium Low | Low | CPU or Low-VRAM devices | Best for memory-constrained environments |
156
+ | **Q6_K** | Medium | Moderate | CPU with more memory | Better accuracy while still being quantized |
157
+ | **Q8_0** | High | Moderate | CPU or GPU with enough VRAM | Best accuracy among quantized models |
158
+ | **IQ3_XS** | Very Low | Very Low | Ultra-low-memory devices | Extreme memory efficiency and low accuracy |
159
+ | **Q4_0** | Low | Low | ARM or low-memory devices | llama.cpp can optimize for ARM devices |
160
+
161
+ ---
162
+
163
+ ## **Included Files & Details**
164
+
165
+ ### `LiveCC-7B-Instruct-bf16.gguf`
166
+ - Model weights preserved in **BF16**.
167
+ - Use this if you want to **requantize** the model into a different format.
168
+ - Best if your device supports **BF16 acceleration**.
169
+
170
+ ### `LiveCC-7B-Instruct-f16.gguf`
171
+ - Model weights stored in **F16**.
172
+ - Use if your device supports **FP16**, especially if BF16 is not available.
173
+
174
+ ### `LiveCC-7B-Instruct-bf16-q8_0.gguf`
175
+ - **Output & embeddings** remain in **BF16**.
176
+ - All other layers quantized to **Q8_0**.
177
+ - Use if your device supports **BF16** and you want a quantized version.
178
+
179
+ ### `LiveCC-7B-Instruct-f16-q8_0.gguf`
180
+ - **Output & embeddings** remain in **F16**.
181
+ - All other layers quantized to **Q8_0**.
182
+
183
+ ### `LiveCC-7B-Instruct-q4_k.gguf`
184
+ - **Output & embeddings** quantized to **Q8_0**.
185
+ - All other layers quantized to **Q4_K**.
186
+ - Good for **CPU inference** with limited memory.
187
+
188
+ ### `LiveCC-7B-Instruct-q4_k_s.gguf`
189
+ - Smallest **Q4_K** variant, using less memory at the cost of accuracy.
190
+ - Best for **very low-memory setups**.
191
+
192
+ ### `LiveCC-7B-Instruct-q6_k.gguf`
193
+ - **Output & embeddings** quantized to **Q8_0**.
194
+ - All other layers quantized to **Q6_K** .
195
+
196
+ ### `LiveCC-7B-Instruct-q8_0.gguf`
197
+ - Fully **Q8** quantized model for better accuracy.
198
+ - Requires **more memory** but offers higher precision.
199
+
200
+ ### `LiveCC-7B-Instruct-iq3_xs.gguf`
201
+ - **IQ3_XS** quantization, optimized for **extreme memory efficiency**.
202
+ - Best for **ultra-low-memory devices**.
203
+
204
+ ### `LiveCC-7B-Instruct-iq3_m.gguf`
205
+ - **IQ3_M** quantization, offering a **medium block size** for better accuracy.
206
+ - Suitable for **low-memory devices**.
207
+
208
+ ### `LiveCC-7B-Instruct-q4_0.gguf`
209
+ - Pure **Q4_0** quantization, optimized for **ARM devices**.
210
+ - Best for **low-memory environments**.
211
+ - Prefer IQ4_NL for better accuracy.
212
+
213
+ # <span id="testllm" style="color: #7F7FFF;">🚀 If you find these models useful</span>
214
+ ❤ **Please click "Like" if you find this useful!**
215
+ Help me test my **AI-Powered Network Monitor Assistant** with **quantum-ready security checks**:
216
+ 👉 [Free Network Monitor](https://freenetworkmonitor.click/dashboard)
217
+
218
+ 💬 **How to test**:
219
+ 1. Click the **chat icon** (bottom right on any page)
220
+ 2. Choose an **AI assistant type**:
221
+ - `TurboLLM` (GPT-4-mini)
222
+ - `FreeLLM` (Open-source)
223
+ - `TestLLM` (Experimental CPU-only)
224
+
225
+ ### **What I’m Testing**
226
+ I’m pushing the limits of **small open-source models for AI network monitoring**, specifically:
227
+ - **Function calling** against live network services
228
+ - **How small can a model go** while still handling:
229
+ - Automated **Nmap scans**
230
+ - **Quantum-readiness checks**
231
+ - **Metasploit integration**
232
+
233
+ 🟡 **TestLLM** – Current experimental model (llama.cpp on 6 CPU threads):
234
+ - ✅ **Zero-configuration setup**
235
+ - ⏳ 30s load time (slow inference but **no API costs**)
236
+ - 🔧 **Help wanted!** If you’re into **edge-device AI**, let’s collaborate!
237
+
238
+ ### **Other Assistants**
239
+ 🟢 **TurboLLM** – Uses **gpt-4-mini** for:
240
+ - **Real-time network diagnostics**
241
+ - **Automated penetration testing** (Nmap/Metasploit)
242
+ - 🔑 Get more tokens by [downloading our Free Network Monitor Agent](https://freenetworkmonitor.click/download)
243
+
244
+ 🔵 **HugLLM** – Open-source models (≈8B params):
245
+ - **2x more tokens** than TurboLLM
246
+ - **AI-powered log analysis**
247
+ - 🌐 Runs on Hugging Face Inference API
248
+
249
+ ### 💡 **Example AI Commands to Test**:
250
+ 1. `"Give me info on my websites SSL certificate"`
251
+ 2. `"Check if my server is using quantum safe encyption for communication"`
252
+ 3. `"Run a quick Nmap vulnerability test"`
253
+
254
+
255
+ # LiveCC-7B-Instruct
256
+
257
+ ## Introduction
258
+
259
+ We introduce LiveCC, the first video LLM capable of real-time commentary, trained with a novel video-ASR streaming method, SOTA on both streaming and offline benchmarks.
260
+
261
+ - Project Page: https://showlab.github.io/livecc
262
+
263
+ > [!Important]
264
+ > This is the SFT model. The base model is at [LiveCC-7B-Base](https://huggingface.co/chenjoya/LiveCC-7B-Base).
265
+
266
+ ## Training with Streaming Frame-Words Paradigm
267
+
268
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/642435a1a3adbc7142c3b0a6/T-Zs50VlFT2tE7RdV49TE.png)
269
+
270
+ ## Quickstart
271
+
272
+ ### Gradio Demo
273
+
274
+ Please refer to https://github.com/showlab/livecc:
275
+
276
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/642435a1a3adbc7142c3b0a6/HUvadZRIhrT5vd332XBO3.png)
277
+
278
+ ### Hands-on
279
+
280
+ Like qwen-vl-utils, we offer a toolkit to help you handle various types of visual input more conveniently, **especially on video streaming inputs**. You can install it using the following command:
281
+
282
+ ```bash
283
+ pip install qwen-vl-utils livecc-utils liger_kernel
284
+ ```
285
+
286
+ Here we show a code snippet to show you how to do **real-time video commentary** with `transformers` and the above utils:
287
+
288
+ ```python
289
+ import functools, torch, os, tqdm
290
+ from liger_kernel.transformers import apply_liger_kernel_to_qwen2_vl
291
+ apply_liger_kernel_to_qwen2_vl() # important. our model is trained with this. keep consistency
292
+ from transformers import Qwen2VLForConditionalGeneration, AutoProcessor, LogitsProcessor, logging
293
+ from livecc_utils import prepare_multiturn_multimodal_inputs_for_generation, get_smart_resized_clip, get_smart_resized_video_reader
294
+ from qwen_vl_utils import process_vision_info
295
+
296
+ class LiveCCDemoInfer:
297
+ fps = 2
298
+ initial_fps_frames = 6
299
+ streaming_fps_frames = 2
300
+ initial_time_interval = initial_fps_frames / fps
301
+ streaming_time_interval = streaming_fps_frames / fps
302
+ frame_time_interval = 1 / fps
303
+ def __init__(self, model_path: str = None, device_id: int = 0):
304
+ self.model = Qwen2VLForConditionalGeneration.from_pretrained(
305
+ model_path, torch_dtype="auto",
306
+ device_map=f'cuda:{device_id}',
307
+ attn_implementation='flash_attention_2'
308
+ )
309
+ self.processor = AutoProcessor.from_pretrained(model_path, use_fast=False)
310
+ self.model.prepare_inputs_for_generation = functools.partial(prepare_multiturn_multimodal_inputs_for_generation, self.model)
311
+ message = {
312
+ "role": "user",
313
+ "content": [
314
+ {"type": "text", "text": 'livecc'},
315
+ ]
316
+ }
317
+ texts = self.processor.apply_chat_template([message], tokenize=False)
318
+ self.system_prompt_offset = texts.index('<|im_start|>user')
319
+ self._cached_video_readers_with_hw = {}
320
+
321
+
322
+ def live_cc(
323
+ self,
324
+ query: str,
325
+ state: dict,
326
+ max_pixels: int = 384 * 28 * 28,
327
+ default_query: str = 'Please describe the video.',
328
+ do_sample: bool = True,
329
+ repetition_penalty: float = 1.05,
330
+ **kwargs,
331
+ ):
332
+ """
333
+ state: dict, (maybe) with keys:
334
+ video_path: str, video path
335
+ video_timestamp: float, current video timestamp
336
+ last_timestamp: float, last processed video timestamp
337
+ last_video_pts_index: int, last processed video frame index
338
+ video_pts: np.ndarray, video pts
339
+ last_history: list, last processed history
340
+ past_key_values: llm past_key_values
341
+ past_ids: past generated ids
342
+ """
343
+ # 1. preparation: video_reader, and last processing info
344
+ video_timestamp, last_timestamp = state.get('video_timestamp', 0), state.get('last_timestamp', -1 / self.fps)
345
+ video_path = state['video_path']
346
+ if video_path not in self._cached_video_readers_with_hw:
347
+ self._cached_video_readers_with_hw[video_path] = get_smart_resized_video_reader(video_path, max_pixels)
348
+ video_reader = self._cached_video_readers_with_hw[video_path][0]
349
+ video_reader.get_frame_timestamp(0)
350
+ state['video_pts'] = torch.from_numpy(video_reader._frame_pts[:, 1])
351
+ state['last_video_pts_index'] = -1
352
+ video_pts = state['video_pts']
353
+ if last_timestamp + self.frame_time_interval > video_pts[-1]:
354
+ state['video_end'] = True
355
+ return
356
+ video_reader, resized_height, resized_width = self._cached_video_readers_with_hw[video_path]
357
+ last_video_pts_index = state['last_video_pts_index']
358
+
359
+ # 2. which frames will be processed
360
+ initialized = last_timestamp >= 0
361
+ if not initialized:
362
+ video_timestamp = max(video_timestamp, self.initial_time_interval)
363
+ if video_timestamp <= last_timestamp + self.frame_time_interval:
364
+ return
365
+ timestamps = torch.arange(last_timestamp + self.frame_time_interval, video_timestamp, self.frame_time_interval) # add compensation
366
+
367
+ # 3. fetch frames in required timestamps
368
+ clip, clip_timestamps, clip_idxs = get_smart_resized_clip(video_reader, resized_height, resized_width, timestamps, video_pts, video_pts_index_from=last_video_pts_index+1)
369
+ state['last_video_pts_index'] = clip_idxs[-1]
370
+ state['last_timestamp'] = clip_timestamps[-1]
371
+
372
+ # 4. organize to interleave frames
373
+ interleave_clips, interleave_timestamps = [], []
374
+ if not initialized:
375
+ interleave_clips.append(clip[:self.initial_fps_frames])
376
+ interleave_timestamps.append(clip_timestamps[:self.initial_fps_frames])
377
+ clip = clip[self.initial_fps_frames:]
378
+ clip_timestamps = clip_timestamps[self.initial_fps_frames:]
379
+ if len(clip) > 0:
380
+ interleave_clips.extend(list(clip.split(self.streaming_fps_frames)))
381
+ interleave_timestamps.extend(list(clip_timestamps.split(self.streaming_fps_frames)))
382
+
383
+ # 5. make conversation and send to model
384
+ for clip, timestamps in zip(interleave_clips, interleave_timestamps):
385
+ start_timestamp, stop_timestamp = timestamps[0].item(), timestamps[-1].item() + self.frame_time_interval
386
+ message = {
387
+ "role": "user",
388
+ "content": [
389
+ {"type": "text", "text": f'Time={start_timestamp:.1f}-{stop_timestamp:.1f}s'},
390
+ {"type": "video", "video": clip}
391
+ ]
392
+ }
393
+ if not query and not state.get('query', None):
394
+ query = default_query
395
+ print(f'No query provided, use default_query={default_query}')
396
+ if query and state.get('query', None) != query:
397
+ message['content'].append({"type": "text", "text": query})
398
+ state['query'] = query
399
+ texts = self.processor.apply_chat_template([message], tokenize=False, add_generation_prompt=True, return_tensors='pt')
400
+ past_ids = state.get('past_ids', None)
401
+ if past_ids is not None:
402
+ texts = '<|im_end|>\n' + texts[self.system_prompt_offset:]
403
+ inputs = self.processor(
404
+ text=texts,
405
+ images=None,
406
+ videos=[clip],
407
+ return_tensors="pt",
408
+ return_attention_mask=False
409
+ )
410
+ inputs.to('cuda')
411
+ if past_ids is not None:
412
+ inputs['input_ids'] = torch.cat([past_ids, inputs.input_ids], dim=1)
413
+ outputs = self.model.generate(
414
+ **inputs, past_key_values=state.get('past_key_values', None),
415
+ return_dict_in_generate=True, do_sample=do_sample,
416
+ repetition_penalty=repetition_penalty,
417
+ )
418
+ state['past_key_values'] = outputs.past_key_values
419
+ state['past_ids'] = outputs.sequences[:, :-1]
420
+ yield (start_timestamp, stop_timestamp), self.processor.decode(outputs.sequences[0, inputs.input_ids.size(1):], skip_special_tokens=True), state
421
+
422
+ model_path = 'chenjoya/LiveCC-7B-Instruct'
423
+ # download a test video at: https://github.com/showlab/livecc/blob/main/demo/sources/howto_fix_laptop_mute_1080p.mp4
424
+ video_path = "demo/sources/howto_fix_laptop_mute_1080p.mp4"
425
+ query = "Please describe the video."
426
+
427
+ infer = LiveCCDemoInfer(model_path=model_path)
428
+ state = {'video_path': video_path}
429
+ commentaries = []
430
+ t = 0
431
+ for t in range(31):
432
+ state['video_timestamp'] = t
433
+ for (start_t, stop_t), response, state in infer.live_cc(
434
+ query=query, state=state,
435
+ max_pixels = 384 * 28 * 28, repetition_penalty=1.05,
436
+ streaming_eos_base_threshold=0.0, streaming_eos_threshold_step=0
437
+ ):
438
+ print(f'{start_t}s-{stop_t}s: {response}')
439
+ commentaries.append([start_t, stop_t, response])
440
+ if state.get('video_end', False):
441
+ break
442
+ t += 1
443
+ ```
444
+
445
+ Here we show a code snippet to show you how to do **common video (multi-turn) qa** with `transformers` and the above utils:
446
+ ```python
447
+ import functools, torch
448
+ from liger_kernel.transformers import apply_liger_kernel_to_qwen2_vl
449
+ apply_liger_kernel_to_qwen2_vl() # important. our model is trained with this. keep consistency
450
+ from transformers import Qwen2VLForConditionalGeneration, AutoProcessor, LogitsProcessor, logging
451
+ from livecc_utils import prepare_multiturn_multimodal_inputs_for_generation, get_smart_resized_clip, get_smart_resized_video_reader
452
+ from qwen_vl_utils import process_vision_info
453
+
454
+ class LiveCCDemoInfer:
455
+ fps = 2
456
+ initial_fps_frames = 6
457
+ streaming_fps_frames = 2
458
+ initial_time_interval = initial_fps_frames / fps
459
+ streaming_time_interval = streaming_fps_frames / fps
460
+ frame_time_interval = 1 / fps
461
+
462
+ def __init__(self, model_path: str = None, device: str = 'cuda'):
463
+ self.model = Qwen2VLForConditionalGeneration.from_pretrained(
464
+ model_path, torch_dtype="auto",
465
+ device_map=device,
466
+ attn_implementation='flash_attention_2'
467
+ )
468
+ self.processor = AutoProcessor.from_pretrained(model_path, use_fast=False)
469
+ self.streaming_eos_token_id = self.processor.tokenizer(' ...').input_ids[-1]
470
+ self.model.prepare_inputs_for_generation = functools.partial(prepare_multiturn_multimodal_inputs_for_generation, self.model)
471
+ message = {
472
+ "role": "user",
473
+ "content": [
474
+ {"type": "text", "text": 'livecc'},
475
+ ]
476
+ }
477
+ texts = self.processor.apply_chat_template([message], tokenize=False)
478
+ self.system_prompt_offset = texts.index('<|im_start|>user')
479
+
480
+ def video_qa(
481
+ self,
482
+ message: str,
483
+ state: dict,
484
+ do_sample: bool = True,
485
+ repetition_penalty: float = 1.05,
486
+ **kwargs,
487
+ ):
488
+ """
489
+ state: dict, (maybe) with keys:
490
+ video_path: str, video path
491
+ video_timestamp: float, current video timestamp
492
+ last_timestamp: float, last processed video timestamp
493
+ last_video_pts_index: int, last processed video frame index
494
+ video_pts: np.ndarray, video pts
495
+ last_history: list, last processed history
496
+ past_key_values: llm past_key_values
497
+ past_ids: past generated ids
498
+ """
499
+ video_path = state.get('video_path', None)
500
+ conversation = []
501
+ past_ids = state.get('past_ids', None)
502
+ content = [{"type": "text", "text": message}]
503
+ if past_ids is None and video_path: # only use once
504
+ content.insert(0, {"type": "video", "video": video_path})
505
+ conversation.append({"role": "user", "content": content})
506
+ image_inputs, video_inputs = process_vision_info(conversation)
507
+ texts = self.processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True, return_tensors='pt')
508
+ if past_ids is not None:
509
+ texts = '<|im_end|>\n' + texts[self.system_prompt_offset:]
510
+ inputs = self.processor(
511
+ text=texts,
512
+ images=image_inputs,
513
+ videos=video_inputs,
514
+ return_tensors="pt",
515
+ return_attention_mask=False
516
+ )
517
+ inputs.to(self.model.device)
518
+ if past_ids is not None:
519
+ inputs['input_ids'] = torch.cat([past_ids, inputs.input_ids], dim=1)
520
+ outputs = self.model.generate(
521
+ **inputs, past_key_values=state.get('past_key_values', None),
522
+ return_dict_in_generate=True, do_sample=do_sample,
523
+ repetition_penalty=repetition_penalty,
524
+ max_new_tokens=512,
525
+ )
526
+ state['past_key_values'] = outputs.past_key_values
527
+ state['past_ids'] = outputs.sequences[:, :-1]
528
+ response = self.processor.decode(outputs.sequences[0, inputs.input_ids.size(1):], skip_special_tokens=True)
529
+ return response, state
530
+
531
+ model_path = 'chenjoya/LiveCC-7B-Instruct'
532
+ # download a test video at: https://github.com/showlab/livecc/blob/main/demo/sources/howto_fix_laptop_mute_1080p.mp4
533
+ video_path = "demo/sources/howto_fix_laptop_mute_1080p.mp4"
534
+
535
+ infer = LiveCCDemoInfer(model_path=model_path)
536
+ state = {'video_path': video_path}
537
+ # first round
538
+ query1 = 'What is the video?'
539
+ response1, state = infer.video_qa(message=query1, state=state)
540
+ print(f'Q1: {query1}\nA1: {response1}')
541
+ # second round
542
+ query2 = 'How do you know that?'
543
+ response2, state = infer.video_qa(message=query2, state=state)
544
+ print(f'Q2: {query2}\nA2: {response2}')
545
+ ```
546
+
547
+ ## Performance
548
+
549
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/642435a1a3adbc7142c3b0a6/cqoiqYjOePj1vANakNCTL.png)
550
+
551
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/642435a1a3adbc7142c3b0a6/W2f-UExEbDuUCGsH8omMe.png)
552
+
553
+ ## Limitations
554
+
555
+ - This model is finetuned on LiveCC-7B-Base, which is starting from Qwen2-VL-7B-Base, so it may have limitations mentioned in https://huggingface.co/Qwen/Qwen2-VL-7B.
556
+ - When performing real-time video commentary, it may appear collapse --- e.g., repeat pattern. If you encounter this situation, try to adjust repetition_penalty, streaming_eos_base_threshold, and streaming_eos_threshold_step.
557
+ - This model only has a context window of 32768. Using more visual tokens per frame (e.g. 768 * 28 * 28) will have better performance, but will shorten the working duration.
558
+
559
+ These limitations serve as ongoing directions for model optimization and improvement, and we are committed to continually enhancing the model's performance and scope of application.
560
+
561
+ ## Citation
562
+
563
+ If you find our work helpful, feel free to give us a cite.
564
+
565
+ ```
566
+ @article{livecc,
567
+ author = {Joya Chen and Ziyun Zeng and Yiqi Lin and Wei Li and Zejun Ma and Mike Zheng Shou},
568
+ title = {LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale},
569
+ journal = {arXiv preprint arXiv:2504.16030}
570
+ year = {2025},
571
+ }
572
+ ```