🚀 Optimized Models: torchao & Pruna Quantization
Collection
Quantized Models using torchao & Pruna for efficient inference and deployment.
•
6 items
•
Updated
•
1
This is a quantized version of GLM‑4.1V‑9B‑Thinking, a powerful 9B‑parameter vision‑language model using the “thinking paradigm” and reinforced reasoning. The quantization enables significantly lighter memory usage and faster inference on consumer-grade GPUs while preserving its strong performance on multimodal reasoning tasks.
Method: torchao quantization Weight Precision: int8 Activation Precision: int8 dynamic Technique: Symmetric mapping Impact: Significant reduction in model size with minimal loss in reasoning, coding, and general instruction-following capabilities.
Perfect for: