Mungert
/

LiveCC-7B-Instruct-GGUF

+---
+license: apache-2.0
+datasets:
+- chenjoya/Live-CC-5M
+- chenjoya/Live-WhisperX-526K
+- lmms-lab/LLaVA-Video-178K
+language:
+- en
+base_model:
+- Qwen/Qwen2-VL-7B
+tags:
+- qwen_vl
+- video
+- real-time
+- multimodal
+- LLM
+---
+# <span style="color: #7FFF7F;">LiveCC-7B-Instruct GGUF Models</span>
+## <span style="color: #7FFF7F;">Ultra-Low-Bit Quantization with IQ-DynamicGate (1-2 bit)</span>
+Our latest quantization method introduces **precision-adaptive quantization** for ultra-low-bit models (1-2 bit), with benchmark-proven improvements on **Llama-3-8B**. This approach uses layer-specific strategies to preserve accuracy while maintaining extreme memory efficiency.
+### **Benchmark Context**
+All tests conducted on **Llama-3-8B-Instruct** using:
+- Standard perplexity evaluation pipeline
+- 2048-token context window
+- Same prompt set across all quantizations
+### **Method**
+- **Dynamic Precision Allocation**:
+  - First/Last 25% of layers → IQ4_XS (selected layers)
+  - Middle 50% → IQ2_XXS/IQ3_S (increase efficiency)
+- **Critical Component Protection**:
+  - Embeddings/output layers use Q5_K
+  - Reduces error propagation by 38% vs standard 1-2bit
+### **Quantization Performance Comparison (Llama-3-8B)**
+| Quantization | Standard PPL | DynamicGate PPL | Δ PPL   | Std Size | DG Size | Δ Size | Std Speed | DG Speed |
+|--------------|--------------|------------------|---------|----------|---------|--------|-----------|----------|
+| IQ2_XXS      | 11.30        | 9.84             | -12.9%  | 2.5G     | 2.6G    | +0.1G  | 234s      | 246s     |
+| IQ2_XS       | 11.72        | 11.63            | -0.8%   | 2.7G     | 2.8G    | +0.1G  | 242s      | 246s     |
+| IQ2_S        | 14.31        | 9.02             | -36.9%  | 2.7G     | 2.9G    | +0.2G  | 238s      | 244s     |
+| IQ1_M        | 27.46        | 15.41            | -43.9%  | 2.2G     | 2.5G    | +0.3G  | 206s      | 212s     |
+| IQ1_S        | 53.07        | 32.00            | -39.7%  | 2.1G     | 2.4G    | +0.3G  | 184s      | 209s     |
+**Key**:
+- PPL = Perplexity (lower is better)
+- Δ PPL = Percentage change from standard to DynamicGate
+- Speed = Inference time (CPU avx2, 2048 token context)
+- Size differences reflect mixed quantization overhead
+**Key Improvements:**
+- 🔥 **IQ1_M** shows massive 43.9% perplexity reduction (27.46 → 15.41)
+- 🚀 **IQ2_S** cuts perplexity by 36.9% while adding only 0.2GB
+- ⚡ **IQ1_S** maintains 39.7% better accuracy despite 1-bit quantization
+**Tradeoffs:**
+- All variants have modest size increases (0.1-0.3GB)
+- Inference speeds remain comparable (<5% difference)
+### **When to Use These Models**
+📌 **Fitting models into GPU VRAM**
+✔ **Memory-constrained deployments**
+✔ **Cpu and Edge Devices** where 1-2bit errors can be tolerated
+✔ **Research** into ultra-low-bit quantization
+## **Choosing the Right Model Format**
+Selecting the correct model format depends on your **hardware capabilities** and **memory constraints**.
+### **BF16 (Brain Float 16) – Use if BF16 acceleration is available**
+- A 16-bit floating-point format designed for **faster computation** while retaining good precision.
+- Provides **similar dynamic range** as FP32 but with **lower memory usage**.
+- Recommended if your hardware supports **BF16 acceleration** (check your device's specs).
+- Ideal for **high-performance inference** with **reduced memory footprint** compared to FP32.
+📌 **Use BF16 if:**
+✔ Your hardware has native **BF16 support** (e.g., newer GPUs, TPUs).
+✔ You want **higher precision** while saving memory.
+✔ You plan to **requantize** the model into another format.
+📌 **Avoid BF16 if:**
+❌ Your hardware does **not** support BF16 (it may fall back to FP32 and run slower).
+❌ You need compatibility with older devices that lack BF16 optimization.
+---
+### **F16 (Float 16) – More widely supported than BF16**
+- A 16-bit floating-point **high precision** but with less of range of values than BF16.
+- Works on most devices with **FP16 acceleration support** (including many GPUs and some CPUs).
+- Slightly lower numerical precision than BF16 but generally sufficient for inference.
+📌 **Use F16 if:**
+✔ Your hardware supports **FP16** but **not BF16**.
+✔ You need a **balance between speed, memory usage, and accuracy**.
+✔ You are running on a **GPU** or another device optimized for FP16 computations.
+📌 **Avoid F16 if:**
+❌ Your device lacks **native FP16 support** (it may run slower than expected).
+❌ You have memory limitations.
+---
+### **Quantized Models (Q4_K, Q6_K, Q8, etc.) – For CPU & Low-VRAM Inference**
+Quantization reduces model size and memory usage while maintaining as much accuracy as possible.
+- **Lower-bit models (Q4_K)** → **Best for minimal memory usage**, may have lower precision.
+- **Higher-bit models (Q6_K, Q8_0)** → **Better accuracy**, requires more memory.
+📌 **Use Quantized Models if:**
+✔ You are running inference on a **CPU** and need an optimized model.
+✔ Your device has **low VRAM** and cannot load full-precision models.
+✔ You want to reduce **memory footprint** while keeping reasonable accuracy.
+📌 **Avoid Quantized Models if:**
+❌ You need **maximum accuracy** (full-precision models are better for this).
+❌ Your hardware has enough VRAM for higher-precision formats (BF16/F16).
+---
+### **Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)**
+These models are optimized for **extreme memory efficiency**, making them ideal for **low-power devices** or **large-scale deployments** where memory is a critical constraint.
+- **IQ3_XS**: Ultra-low-bit quantization (3-bit) with **extreme memory efficiency**.
+  - **Use case**: Best for **ultra-low-memory devices** where even Q4_K is too large.
+  - **Trade-off**: Lower accuracy compared to higher-bit quantizations.
+- **IQ3_S**: Small block size for **maximum memory efficiency**.
+  - **Use case**: Best for **low-memory devices** where **IQ3_XS** is too aggressive.
+- **IQ3_M**: Medium block size for better accuracy than **IQ3_S**.
+  - **Use case**: Suitable for **low-memory devices** where **IQ3_S** is too limiting.
+- **Q4_K**: 4-bit quantization with **block-wise optimization** for better accuracy.
+  - **Use case**: Best for **low-memory devices** where **Q6_K** is too large.
+- **Q4_0**: Pure 4-bit quantization, optimized for **ARM devices**.
+  - **Use case**: Best for **ARM-based devices** or **low-memory environments**.
+---
+### **Summary Table: Model Format Selection**
+| Model Format  | Precision  | Memory Usage  | Device Requirements  | Best Use Case  |
+|--------------|------------|---------------|----------------------|---------------|
+| **BF16**     | Highest    | High          | BF16-supported GPU/CPUs  | High-speed inference with reduced memory |
+| **F16**      | High       | High          | FP16-supported devices | GPU inference when BF16 isn't available |
+| **Q4_K**     | Medium Low | Low           | CPU or Low-VRAM devices | Best for memory-constrained environments |
+| **Q6_K**     | Medium     | Moderate      | CPU with more memory | Better accuracy while still being quantized |
+| **Q8_0**     | High       | Moderate      | CPU or GPU with enough VRAM | Best accuracy among quantized models |
+| **IQ3_XS**   | Very Low   | Very Low      | Ultra-low-memory devices | Extreme memory efficiency and low accuracy |
+| **Q4_0**     | Low        | Low           | ARM or low-memory devices | llama.cpp can optimize for ARM devices |
+---
+## **Included Files & Details**
+### `LiveCC-7B-Instruct-bf16.gguf`
+- Model weights preserved in **BF16**.
+- Use this if you want to **requantize** the model into a different format.
+- Best if your device supports **BF16 acceleration**.
+### `LiveCC-7B-Instruct-f16.gguf`
+- Model weights stored in **F16**.
+- Use if your device supports **FP16**, especially if BF16 is not available.
+### `LiveCC-7B-Instruct-bf16-q8_0.gguf`
+- **Output & embeddings** remain in **BF16**.
+- All other layers quantized to **Q8_0**.
+- Use if your device supports **BF16** and you want a quantized version.
+### `LiveCC-7B-Instruct-f16-q8_0.gguf`
+- **Output & embeddings** remain in **F16**.
+- All other layers quantized to **Q8_0**.
+### `LiveCC-7B-Instruct-q4_k.gguf`
+- **Output & embeddings** quantized to **Q8_0**.
+- All other layers quantized to **Q4_K**.
+- Good for **CPU inference** with limited memory.
+### `LiveCC-7B-Instruct-q4_k_s.gguf`
+- Smallest **Q4_K** variant, using less memory at the cost of accuracy.
+- Best for **very low-memory setups**.
+### `LiveCC-7B-Instruct-q6_k.gguf`
+- **Output & embeddings** quantized to **Q8_0**.
+- All other layers quantized to **Q6_K** .
+### `LiveCC-7B-Instruct-q8_0.gguf`
+- Fully **Q8** quantized model for better accuracy.
+- Requires **more memory** but offers higher precision.
+### `LiveCC-7B-Instruct-iq3_xs.gguf`
+- **IQ3_XS** quantization, optimized for **extreme memory efficiency**.
+- Best for **ultra-low-memory devices**.
+### `LiveCC-7B-Instruct-iq3_m.gguf`
+- **IQ3_M** quantization, offering a **medium block size** for better accuracy.
+- Suitable for **low-memory devices**.
+### `LiveCC-7B-Instruct-q4_0.gguf`
+- Pure **Q4_0** quantization, optimized for **ARM devices**.
+- Best for **low-memory environments**.
+- Prefer IQ4_NL for better accuracy.
+# <span id="testllm" style="color: #7F7FFF;">🚀 If you find these models useful</span>
+❤ **Please click "Like" if you find this useful!**
+Help me test my **AI-Powered Network Monitor Assistant** with **quantum-ready security checks**:
+👉 [Free Network Monitor](https://freenetworkmonitor.click/dashboard)
+💬 **How to test**:
+1. Click the **chat icon** (bottom right on any page)
+2. Choose an **AI assistant type**:
+   - `TurboLLM` (GPT-4-mini)
+   - `FreeLLM` (Open-source)
+   - `TestLLM` (Experimental CPU-only)
+### **What I’m Testing**
+I’m pushing the limits of **small open-source models for AI network monitoring**, specifically:
+- **Function calling** against live network services
+- **How small can a model go** while still handling:
+  - Automated **Nmap scans**
+  - **Quantum-readiness checks**
+  - **Metasploit integration**
+🟡 **TestLLM** – Current experimental model (llama.cpp on 6 CPU threads):
+- ✅ **Zero-configuration setup**
+- ⏳ 30s load time (slow inference but **no API costs**)
+- 🔧 **Help wanted!** If you’re into **edge-device AI**, let’s collaborate!
+### **Other Assistants**
+🟢 **TurboLLM** – Uses **gpt-4-mini** for:
+- **Real-time network diagnostics**
+- **Automated penetration testing** (Nmap/Metasploit)
+- 🔑 Get more tokens by [downloading our Free Network Monitor Agent](https://freenetworkmonitor.click/download)
+🔵 **HugLLM** – Open-source models (≈8B params):
+- **2x more tokens** than TurboLLM
+- **AI-powered log analysis**
+- 🌐 Runs on Hugging Face Inference API
+### 💡 **Example AI Commands to Test**:
+1. `"Give me info on my websites SSL certificate"`
+2. `"Check if my server is using quantum safe encyption for communication"`
+3. `"Run a quick Nmap vulnerability test"`
+# LiveCC-7B-Instruct
+## Introduction
+We introduce LiveCC, the first video LLM capable of real-time commentary, trained with a novel video-ASR streaming method, SOTA on both streaming and offline benchmarks.
+- Project Page: https://showlab.github.io/livecc
+> [!Important]
+> This is the SFT model. The base model is at [LiveCC-7B-Base](https://huggingface.co/chenjoya/LiveCC-7B-Base).
+## Training with Streaming Frame-Words Paradigm
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/642435a1a3adbc7142c3b0a6/T-Zs50VlFT2tE7RdV49TE.png)
+## Quickstart
+### Gradio Demo
+Please refer to https://github.com/showlab/livecc:
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/642435a1a3adbc7142c3b0a6/HUvadZRIhrT5vd332XBO3.png)
+### Hands-on
+Like qwen-vl-utils, we offer a toolkit to help you handle various types of visual input more conveniently, **especially on video streaming inputs**. You can install it using the following command:
+```bash
+pip install qwen-vl-utils livecc-utils liger_kernel
+```
+Here we show a code snippet to show you how to do **real-time video commentary** with `transformers` and the above utils:
+```python
+import functools, torch, os, tqdm
+from liger_kernel.transformers import apply_liger_kernel_to_qwen2_vl
+apply_liger_kernel_to_qwen2_vl() # important. our model is trained with this. keep consistency
+from transformers import Qwen2VLForConditionalGeneration, AutoProcessor, LogitsProcessor, logging
+from livecc_utils import prepare_multiturn_multimodal_inputs_for_generation, get_smart_resized_clip, get_smart_resized_video_reader
+from qwen_vl_utils import process_vision_info
+class LiveCCDemoInfer:
+  fps = 2
+  initial_fps_frames = 6
+  streaming_fps_frames = 2
+  initial_time_interval = initial_fps_frames / fps
+  streaming_time_interval = streaming_fps_frames / fps
+  frame_time_interval = 1 / fps
+  def __init__(self, model_path: str = None, device_id: int = 0):
+      self.model = Qwen2VLForConditionalGeneration.from_pretrained(
+          model_path, torch_dtype="auto",
+          device_map=f'cuda:{device_id}',
+          attn_implementation='flash_attention_2'
+      )
+      self.processor = AutoProcessor.from_pretrained(model_path, use_fast=False)
+      self.model.prepare_inputs_for_generation = functools.partial(prepare_multiturn_multimodal_inputs_for_generation, self.model)
+      message = {
+          "role": "user",
+          "content": [
+              {"type": "text", "text": 'livecc'},
+          ]
+      }
+      texts = self.processor.apply_chat_template([message], tokenize=False)
+      self.system_prompt_offset = texts.index('<|im_start|>user')
+      self._cached_video_readers_with_hw = {}
+  def live_cc(
+      self,
+      query: str,
+      state: dict,
+      max_pixels: int = 384 * 28 * 28,
+      default_query: str = 'Please describe the video.',
+      do_sample: bool = True,
+      repetition_penalty: float = 1.05,
+      **kwargs,
+  ):
+      """
+      state: dict, (maybe) with keys:
+          video_path: str, video path
+          video_timestamp: float, current video timestamp
+          last_timestamp: float, last processed video timestamp
+          last_video_pts_index: int, last processed video frame index
+          video_pts: np.ndarray, video pts
+          last_history: list, last processed history
+          past_key_values: llm past_key_values
+          past_ids: past generated ids
+      """
+      # 1. preparation: video_reader, and last processing info
+      video_timestamp, last_timestamp = state.get('video_timestamp', 0), state.get('last_timestamp', -1 / self.fps)
+      video_path = state['video_path']
+      if video_path not in self._cached_video_readers_with_hw:
+          self._cached_video_readers_with_hw[video_path] = get_smart_resized_video_reader(video_path, max_pixels)
+          video_reader = self._cached_video_readers_with_hw[video_path][0]
+          video_reader.get_frame_timestamp(0)
+          state['video_pts'] = torch.from_numpy(video_reader._frame_pts[:, 1])
+          state['last_video_pts_index'] = -1
+      video_pts = state['video_pts']
+      if last_timestamp + self.frame_time_interval > video_pts[-1]:
+          state['video_end'] = True
+          return
+      video_reader, resized_height, resized_width = self._cached_video_readers_with_hw[video_path]
+      last_video_pts_index = state['last_video_pts_index']
+      # 2. which frames will be processed
+      initialized = last_timestamp >= 0
+      if not initialized:
+          video_timestamp = max(video_timestamp, self.initial_time_interval)
+      if video_timestamp <= last_timestamp + self.frame_time_interval:
+          return
+      timestamps = torch.arange(last_timestamp + self.frame_time_interval, video_timestamp, self.frame_time_interval) # add compensation
+      # 3. fetch frames in required timestamps
+      clip, clip_timestamps, clip_idxs = get_smart_resized_clip(video_reader, resized_height, resized_width, timestamps, video_pts, video_pts_index_from=last_video_pts_index+1)
+      state['last_video_pts_index'] = clip_idxs[-1]
+      state['last_timestamp'] = clip_timestamps[-1]
+      # 4. organize to interleave frames
+      interleave_clips, interleave_timestamps = [], []
+      if not initialized:
+          interleave_clips.append(clip[:self.initial_fps_frames])
+          interleave_timestamps.append(clip_timestamps[:self.initial_fps_frames])
+          clip = clip[self.initial_fps_frames:]
+          clip_timestamps = clip_timestamps[self.initial_fps_frames:]
+      if len(clip) > 0:
+          interleave_clips.extend(list(clip.split(self.streaming_fps_frames)))
+          interleave_timestamps.extend(list(clip_timestamps.split(self.streaming_fps_frames)))
+      # 5. make conversation and send to model
+      for clip, timestamps in zip(interleave_clips, interleave_timestamps):
+          start_timestamp, stop_timestamp = timestamps[0].item(), timestamps[-1].item() + self.frame_time_interval
+          message = {
+              "role": "user",
+              "content": [
+                  {"type": "text", "text": f'Time={start_timestamp:.1f}-{stop_timestamp:.1f}s'},
+                  {"type": "video", "video": clip}
+              ]
+          }
+          if not query and not state.get('query', None):
+              query = default_query
+              print(f'No query provided, use default_query={default_query}')
+          if query and state.get('query', None) != query:
+              message['content'].append({"type": "text", "text": query})
+              state['query'] = query
+          texts = self.processor.apply_chat_template([message], tokenize=False, add_generation_prompt=True, return_tensors='pt')
+          past_ids = state.get('past_ids', None)
+          if past_ids is not None:
+              texts = '<|im_end|>\n' + texts[self.system_prompt_offset:]
+          inputs = self.processor(
+              text=texts,
+              images=None,
+              videos=[clip],
+              return_tensors="pt",
+              return_attention_mask=False
+          )
+          inputs.to('cuda')
+          if past_ids is not None:
+              inputs['input_ids'] = torch.cat([past_ids, inputs.input_ids], dim=1)
+          outputs = self.model.generate(
+              **inputs, past_key_values=state.get('past_key_values', None),
+              return_dict_in_generate=True, do_sample=do_sample,
+              repetition_penalty=repetition_penalty,
+          )
+          state['past_key_values'] = outputs.past_key_values
+          state['past_ids'] = outputs.sequences[:, :-1]
+          yield (start_timestamp, stop_timestamp), self.processor.decode(outputs.sequences[0, inputs.input_ids.size(1):], skip_special_tokens=True), state
+model_path = 'chenjoya/LiveCC-7B-Instruct'
+# download a test video at: https://github.com/showlab/livecc/blob/main/demo/sources/howto_fix_laptop_mute_1080p.mp4
+video_path = "demo/sources/howto_fix_laptop_mute_1080p.mp4"
+query = "Please describe the video."
+infer = LiveCCDemoInfer(model_path=model_path)
+state = {'video_path': video_path}
+commentaries = []
+t = 0
+for t in range(31):
+    state['video_timestamp'] = t
+    for (start_t, stop_t), response, state in infer.live_cc(
+        query=query, state=state,
+        max_pixels = 384 * 28 * 28, repetition_penalty=1.05,
+        streaming_eos_base_threshold=0.0, streaming_eos_threshold_step=0
+    ):
+        print(f'{start_t}s-{stop_t}s: {response}')
+        commentaries.append([start_t, stop_t, response])
+    if state.get('video_end', False):
+        break
+    t += 1
+```
+Here we show a code snippet to show you how to do **common video (multi-turn) qa** with `transformers` and the above utils:
+```python
+import functools, torch
+from liger_kernel.transformers import apply_liger_kernel_to_qwen2_vl
+apply_liger_kernel_to_qwen2_vl() # important. our model is trained with this. keep consistency
+from transformers import Qwen2VLForConditionalGeneration, AutoProcessor, LogitsProcessor, logging
+from livecc_utils import prepare_multiturn_multimodal_inputs_for_generation, get_smart_resized_clip, get_smart_resized_video_reader
+from qwen_vl_utils import process_vision_info
+class LiveCCDemoInfer:
+  fps = 2
+  initial_fps_frames = 6
+  streaming_fps_frames = 2
+  initial_time_interval = initial_fps_frames / fps
+  streaming_time_interval = streaming_fps_frames / fps
+  frame_time_interval = 1 / fps
+  def __init__(self, model_path: str = None, device: str = 'cuda'):
+      self.model = Qwen2VLForConditionalGeneration.from_pretrained(
+          model_path, torch_dtype="auto",
+          device_map=device,
+          attn_implementation='flash_attention_2'
+      )
+      self.processor = AutoProcessor.from_pretrained(model_path, use_fast=False)
+      self.streaming_eos_token_id = self.processor.tokenizer(' ...').input_ids[-1]
+      self.model.prepare_inputs_for_generation = functools.partial(prepare_multiturn_multimodal_inputs_for_generation, self.model)
+      message = {
+          "role": "user",
+          "content": [
+              {"type": "text", "text": 'livecc'},
+          ]
+      }
+      texts = self.processor.apply_chat_template([message], tokenize=False)
+      self.system_prompt_offset = texts.index('<|im_start|>user')
+  def video_qa(
+      self,
+      message: str,
+      state: dict,
+      do_sample: bool = True,
+      repetition_penalty: float = 1.05,
+      **kwargs,
+  ):
+      """
+      state: dict, (maybe) with keys:
+          video_path: str, video path
+          video_timestamp: float, current video timestamp
+          last_timestamp: float, last processed video timestamp
+          last_video_pts_index: int, last processed video frame index
+          video_pts: np.ndarray, video pts
+          last_history: list, last processed history
+          past_key_values: llm past_key_values
+          past_ids: past generated ids
+      """
+      video_path = state.get('video_path', None)
+      conversation = []
+      past_ids = state.get('past_ids', None)
+      content = [{"type": "text", "text": message}]
+      if past_ids is None and video_path: # only use once
+          content.insert(0, {"type": "video", "video": video_path})
+      conversation.append({"role": "user", "content": content})
+      image_inputs, video_inputs = process_vision_info(conversation)
+      texts = self.processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True, return_tensors='pt')
+      if past_ids is not None:
+          texts = '<|im_end|>\n' + texts[self.system_prompt_offset:]
+      inputs = self.processor(
+          text=texts,
+          images=image_inputs,
+          videos=video_inputs,
+          return_tensors="pt",
+          return_attention_mask=False
+      )
+      inputs.to(self.model.device)
+      if past_ids is not None:
+          inputs['input_ids'] = torch.cat([past_ids, inputs.input_ids], dim=1)
+      outputs = self.model.generate(
+          **inputs, past_key_values=state.get('past_key_values', None),
+          return_dict_in_generate=True, do_sample=do_sample,
+          repetition_penalty=repetition_penalty,
+          max_new_tokens=512,
+      )
+      state['past_key_values'] = outputs.past_key_values
+      state['past_ids'] = outputs.sequences[:, :-1]
+      response = self.processor.decode(outputs.sequences[0, inputs.input_ids.size(1):], skip_special_tokens=True)
+      return response, state
+model_path = 'chenjoya/LiveCC-7B-Instruct'
+# download a test video at: https://github.com/showlab/livecc/blob/main/demo/sources/howto_fix_laptop_mute_1080p.mp4
+video_path = "demo/sources/howto_fix_laptop_mute_1080p.mp4"
+infer = LiveCCDemoInfer(model_path=model_path)
+state = {'video_path': video_path}
+# first round
+query1 = 'What is the video?'
+response1, state = infer.video_qa(message=query1, state=state)
+print(f'Q1: {query1}\nA1: {response1}')
+# second round
+query2 = 'How do you know that?'
+response2, state = infer.video_qa(message=query2, state=state)
+print(f'Q2: {query2}\nA2: {response2}')
+```
+## Performance
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/642435a1a3adbc7142c3b0a6/cqoiqYjOePj1vANakNCTL.png)
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/642435a1a3adbc7142c3b0a6/W2f-UExEbDuUCGsH8omMe.png)
+## Limitations
+- This model is finetuned on LiveCC-7B-Base, which is starting from Qwen2-VL-7B-Base, so it may have limitations mentioned in https://huggingface.co/Qwen/Qwen2-VL-7B.
+- When performing real-time video commentary, it may appear collapse --- e.g., repeat pattern. If you encounter this situation, try to adjust repetition_penalty, streaming_eos_base_threshold, and streaming_eos_threshold_step.
+- This model only has a context window of 32768. Using more visual tokens per frame (e.g. 768 * 28 * 28) will have better performance, but will shorten the working duration.
+These limitations serve as ongoing directions for model optimization and improvement, and we are committed to continually enhancing the model's performance and scope of application.
+## Citation
+If you find our work helpful, feel free to give us a cite.
+```
+@article{livecc,
+  author       = {Joya Chen and Ziyun Zeng and Yiqi Lin and Wei Li and Zejun Ma and Mike Zheng Shou},
+  title        = {LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale},
+  journal      = {arXiv preprint arXiv:2504.16030}
+  year         = {2025},
+}
+```