File size: 9,646 Bytes
0142dbf
50d6cfb
0142dbf
50d6cfb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0142dbf
 
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
 
 
 
 
 
 
0142dbf
50d6cfb
0142dbf
50d6cfb
 
 
 
 
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
 
 
0142dbf
50d6cfb
0142dbf
50d6cfb
 
 
 
0142dbf
50d6cfb
 
0142dbf
50d6cfb
 
 
 
0142dbf
50d6cfb
 
0142dbf
50d6cfb
 
 
0142dbf
50d6cfb
 
 
 
 
 
 
 
 
 
0142dbf
50d6cfb
 
 
 
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
 
 
 
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
 
 
 
 
 
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
 
 
 
 
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
 
 
 
 
 
 
 
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
50d6cfb
0142dbf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
---
inference: false
library_name: transformers
language:
  - en
  - fr
  - de
  - es
  - it
  - pt
  - ja
  - ko
  - zh
  - ar
  - el
  - fa
  - pl
  - id
  - cs
  - he
  - hi
  - nl
  - ro
  - ru
  - tr
  - uk
  - vi
license: cc-by-nc-4.0
extra_gated_prompt: "By submitting this form, you agree to the [License Agreement](https://cohere.com/c4ai-cc-by-nc-license) and acknowledge that the information you provide will be collected, used, and shared in accordance with Cohere’s [Privacy Policy](https://cohere.com/privacy). You’ll receive email updates about C4AI and Cohere research, events, products and services. You can unsubscribe at any time."
extra_gated_fields:
  Name: text
  Affiliation: text
  Country: country
  I agree to use this model for non-commercial use ONLY: checkbox
tags:
  - quantized
  - 4bit
  - 8bit
  - multi-gpu
  - nlp
  - conversational-ai
  - rag
  - tool-use
  - code-generation
  - enterprise
model_name: C4AI Command A - Quantized
base_model: CohereForAI/c4ai-command-a-03-2025
model_size: 111B
context_length: 256K
developers:
  - Cohere
  - Cohere For AI
contact: [email protected]
---

# C4AI Command A - Quantized Models

This repository contains quantized versions of the **C4AI Command A** model, an open weights research release by [Cohere](https://cohere.com/) and [Cohere For AI](https://cohere.for.ai/). The original model is a 111 billion parameter language model optimized for enterprise use cases, excelling in agentic, multilingual, and retrieval-augmented generation (RAG) tasks while being deployable on minimal hardware (e.g., two GPUs). Here, we provide multiple quantized variants to further reduce memory footprint and enhance deployment flexibility across various hardware setups, including multi-GPU environments.

For details on the original model, refer to the [official model card](#model-card-for-c4ai-command-a) below.

---

## Quantized Models

We have quantized the original `CohereForAI/c4ai-command-a-03-2025` model using the `bitsandbytes` library with various configurations to balance performance, memory efficiency, and accuracy. Below are the available quantized versions:

| Quantization Type         | Description                                                                 | Compute Dtype | Double Quantization | Notes                                      |
|---------------------------|-----------------------------------------------------------------------------|---------------|---------------------|--------------------------------------------|
| `4bit_nf4_double`         | 4-bit quantization with `nf4` (Normal Float 4)                              | `bfloat16`    | Yes                 | High precision with reduced memory usage  |
| `4bit_fp4`               | 4-bit quantization with `fp4` (Float Point 4)                              | `bfloat16`    | No                  | Lightweight, slightly less precise        |
| `8bit_standard`          | Standard 8-bit quantization                                                | `bfloat16`    | N/A                 | Balanced memory and accuracy              |
| `8bit_mixed`             | 8-bit quantization with mixed precision and CPU offloading capability      | `float16`     | N/A                 | Flexible for constrained environments     |
| `4bit_nf4_no_double`     | 4-bit quantization with `nf4`, no double quantization                      | `bfloat16`    | No                  | Minimal memory footprint                  |

These models are optimized for multi-GPU deployment using the `accelerate` library, ensuring efficient distribution across available GPUs. Each variant is hosted in its own sub-repository:

- `Tonic/c4ai-command-a-03-2025-4bit_nf4_double`
- `Tonic/c4ai-command-a-03-2025-4bit_fp4`
- `Tonic/c4ai-command-a-03-2025-8bit_standard`
- `Tonic/c4ai-command-a-03-2025-8bit_mixed`
- `Tonic/c4ai-command-a-03-2025-4bit_nf4_no_double`

---

## Usage

To use a quantized model, install the required dependencies and load the desired variant as shown below. Multi-GPU support is enabled via `accelerate`.

### Installation

```bash
pip install transformers bitsandbytes accelerate torch huggingface_hub
```

### Example: Loading and Generating Text

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from accelerate import Accelerator

# Initialize Accelerator for multi-GPU support
accelerator = Accelerator()

# Specify the quantized model ID
model_id = "Tonic/c4ai-command-a-03-2025-4bit_nf4_double"  # Replace with desired variant
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)

# Prepare model for multi-GPU
model = accelerator.prepare(model)

# Format message with chat template
messages = [{"role": "user", "content": "Hello, how are you?"}]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(accelerator.device)

# Generate text
gen_tokens = model.generate(
    input_ids,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.3,
)
gen_text = tokenizer.decode(gen_tokens[0])
print(gen_text)
```

### Notes
- **Device Mapping**: `device_map="auto"` ensures the model is distributed across all available GPUs.
- **Compute Dtype**: Adjust `torch_dtype` (e.g., `torch.bfloat16` or `torch.float16`) based on your hardware and the quantization type.
- **Memory**: Quantized models significantly reduce VRAM requirements compared to the original 111B parameter model, making them suitable for deployment on consumer-grade GPUs.

---

## Quantization Details

The quantization process leverages `bitsandbytes` with the following configurations:
- **4-bit Variants**: Use `nf4` or `fp4` quantization types, with optional double quantization for improved precision.
- **8-bit Variants**: Offer standard or mixed precision options, with the latter supporting CPU offloading for additional flexibility.
- **Multi-GPU Optimization**: The `accelerate` library handles model sharding and distribution, allowing deployment on systems with multiple GPUs.

For the exact quantization script, see [this Gist](#) (replace with a link to your script if hosted).

---

## Model Card for C4AI Command A

Below is the original model card for `C4AI Command A`, adapted for this repository.

---

### Model Summary

C4AI Command A is an open weights research release of a 111 billion parameter model optimized for demanding enterprises that require fast, secure, and high-quality AI. Compared to other leading proprietary and open-weights models, Command A delivers maximum performance with minimum hardware costs, excelling on business-critical agentic and multilingual tasks while being deployable on just two GPUs.

- **Developed by**: [Cohere](https://cohere.com/) and [Cohere For AI](https://cohere.for.ai/)
- **Point of Contact**: Cohere For AI: [cohere.for.ai](https://cohere.for.ai/)
- **License**: [CC-BY-NC](https://cohere.com/c4ai-cc-by-nc-license), requires adhering to [C4AI's Acceptable Use Policy](https://docs.cohere.com/docs/c4ai-acceptable-use-policy)
- **Model**: `c4ai-command-a-03-2025`
- **Model Size**: 111 billion parameters
- **Context Length**: 256K

**Try C4AI Command A**

You can try the original model before downloading weights in the hosted [Hugging Face Space](https://cohereforai-c4ai-command.hf.space/models/command-a-03-2025).

---

### Model Details

- **Input**: Text only
- **Output**: Text only
- **Model Architecture**: Auto-regressive language model with an optimized transformer architecture, featuring sliding window attention (window size 4096) with RoPE, and a global attention layer without positional embeddings.
- **Languages**: Supports 23 languages including English, French, Spanish, German, Japanese, Chinese, Arabic, and more (see full list in the original model card).
- **Context Length**: 256K

---

### Chat Capabilities

Command A is configured as a conversational model by default with two safety modes: **contextual** (default, fewer constraints) and **strict** (avoids sensitive topics). See [Command A prompt format docs](https://docs.cohere.com/docs/command-a-hf) for details.

---

### RAG Capabilities

Command A excels in Retrieval Augmented Generation (RAG) tasks. Use the `apply_chat_template` method with document snippets for RAG functionality. Example:

```python
conversation = [{"role": "user", "content": "What has Man always dreamed of?"}]
documents = [
    {"heading": "The Moon", "body": "Man has always dreamed of destroying the moon..."},
    {"heading": "Love", "body": "Man's dream has always been to find love..."}
]
input_ids = tokenizer.apply_chat_template(conversation, documents=documents, tokenize=True, add_generation_prompt=True, return_tensors="pt")
```

---

### Tool Use Capabilities

Command A supports conversational tool use with JSON schema-based tool descriptions. See the [tool use example](#tool-use-example-click-to-expand) in the original model card for implementation details.

---

### Code Capabilities

The model performs well on enterprise-relevant code tasks (e.g., SQL generation, code translation). Use low temperature or greedy decoding for optimal code generation.

---

## Terms of Use

This model is released under a [CC-BY-NC](https://cohere.com/c4ai-cc-by-nc-license) license for non-commercial use only, adhering to [C4AI's Acceptable Use Policy](https://docs.cohere.com/docs/c4ai-acceptable-use-policy). For commercial inquiries, contact [Cohere’s Sales team](https://cohere.com/contact-sales).

---

## Contact

For issues or questions, reach out to `[email protected]`.

---