Update README.md
Browse files
README.md
CHANGED
@@ -26,10 +26,11 @@ Preliminary trials show that converting the entire model to pure Int4 (AWQ/GPTQ)
|
|
26 |
|
27 |
Variant Overview
|
28 |
|
29 |
-
| Variant | Characteristics
|
30 |
-
|
31 |
-
| **
|
32 |
-
| **
|
|
|
33 |
|
34 |
Choose the variant that best matches your hardware and quality requirements.
|
35 |
|
|
|
26 |
|
27 |
Variant Overview
|
28 |
|
29 |
+
| Variant | Characteristics | File Size | Recommended Scenario |
|
30 |
+
|-------------|---------------------------------------------------------------------------------------|-----------|--------------------------------------------------------------------------------------------|
|
31 |
+
| **Lite** | Only the most critical layers upgraded to Int8; size close to pure Int4 | 355 GB | Resource-constrained, lightweight server deployments |
|
32 |
+
| **Compact** | More Int8 layers, relatively higher output quality | 414 GB | VRAM-sufficient deployments focused on answer quality (e.g., 8 × A100) |
|
33 |
+
| **Medium** | Compact plus fully-Int8 attention layers; high quality with reduced long-context loss | 445 GB | VRAM-rich deployments needing both top answer quality and high concurrency (e.g., 8 × H20) |
|
34 |
|
35 |
Choose the variant that best matches your hardware and quality requirements.
|
36 |
|