Update README.md
Browse files
README.md
CHANGED
@@ -45,8 +45,6 @@ Once set up, you can proceed to run the model by running the snippet below:
|
|
45 |
from mlx_lm import load, generate
|
46 |
from transformers import AutoTokenizer
|
47 |
|
48 |
-
model, tokenizer = load("HalleyAI/gpt-oss-20b-6bit-gs32")
|
49 |
-
|
50 |
model, tokenizer = load("HalleyAI/gpt-oss-20b-6bit-gs32")
|
51 |
print(generate(
|
52 |
model, tokenizer,
|
@@ -57,7 +55,7 @@ print(generate(
|
|
57 |
|
58 |
## Performance (Apple Silicon, real-world)
|
59 |
|
60 |
-
LM Studio and CLI (MLX,
|
61 |
- tested on on M1 Max 32 GB (short runs show lower t/s due to startup overhead)
|
62 |
|
63 |
Throughput varies with Mac model, context, and sampler settings.
|
@@ -70,7 +68,7 @@ We report perplexity (PPL) on a small internal text corpus using the same tokeni
|
|
70 |
</thead>
|
71 |
<tbody>
|
72 |
<tr><td>MLX Q8 (reference)</td><td>2.4986</td></tr>
|
73 |
-
<tr><td>MLX
|
74 |
</tbody>
|
75 |
</table>
|
76 |
Note: This is a small, domain-specific eval for quick sanity; not a benchmark suite.
|
|
|
45 |
from mlx_lm import load, generate
|
46 |
from transformers import AutoTokenizer
|
47 |
|
|
|
|
|
48 |
model, tokenizer = load("HalleyAI/gpt-oss-20b-6bit-gs32")
|
49 |
print(generate(
|
50 |
model, tokenizer,
|
|
|
55 |
|
56 |
## Performance (Apple Silicon, real-world)
|
57 |
|
58 |
+
LM Studio and CLI (MLX, Q6 gs32): ~63–72 tok/s, TTFB ~0.3–0.4 s (2k-token responses)
|
59 |
- tested on on M1 Max 32 GB (short runs show lower t/s due to startup overhead)
|
60 |
|
61 |
Throughput varies with Mac model, context, and sampler settings.
|
|
|
68 |
</thead>
|
69 |
<tbody>
|
70 |
<tr><td>MLX Q8 (reference)</td><td>2.4986</td></tr>
|
71 |
+
<tr><td>MLX Q6 (gs=32)</td><td> 2.4858 (-0.51% vs Q8)</td></tr>
|
72 |
</tbody>
|
73 |
</table>
|
74 |
Note: This is a small, domain-specific eval for quick sanity; not a benchmark suite.
|