sebastavar commited on
Commit
4350e8a
·
verified ·
1 Parent(s): 6792d12

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -4
README.md CHANGED
@@ -45,8 +45,6 @@ Once set up, you can proceed to run the model by running the snippet below:
45
  from mlx_lm import load, generate
46
  from transformers import AutoTokenizer
47
 
48
- model, tokenizer = load("HalleyAI/gpt-oss-20b-6bit-gs32")
49
-
50
  model, tokenizer = load("HalleyAI/gpt-oss-20b-6bit-gs32")
51
  print(generate(
52
  model, tokenizer,
@@ -57,7 +55,7 @@ print(generate(
57
 
58
  ## Performance (Apple Silicon, real-world)
59
 
60
- LM Studio and CLI (MLX, Q4 gs32): ~63–72 tok/s, TTFB ~0.3–0.4 s (2k-token responses)
61
  - tested on on M1 Max 32 GB (short runs show lower t/s due to startup overhead)
62
 
63
  Throughput varies with Mac model, context, and sampler settings.
@@ -70,7 +68,7 @@ We report perplexity (PPL) on a small internal text corpus using the same tokeni
70
  </thead>
71
  <tbody>
72
  <tr><td>MLX Q8 (reference)</td><td>2.4986</td></tr>
73
- <tr><td>MLX Q4 (gs=32)</td><td> 2.4858 (~-0.51% vs Q8)</td></tr>
74
  </tbody>
75
  </table>
76
  Note: This is a small, domain-specific eval for quick sanity; not a benchmark suite.
 
45
  from mlx_lm import load, generate
46
  from transformers import AutoTokenizer
47
 
 
 
48
  model, tokenizer = load("HalleyAI/gpt-oss-20b-6bit-gs32")
49
  print(generate(
50
  model, tokenizer,
 
55
 
56
  ## Performance (Apple Silicon, real-world)
57
 
58
+ LM Studio and CLI (MLX, Q6 gs32): ~63–72 tok/s, TTFB ~0.3–0.4 s (2k-token responses)
59
  - tested on on M1 Max 32 GB (short runs show lower t/s due to startup overhead)
60
 
61
  Throughput varies with Mac model, context, and sampler settings.
 
68
  </thead>
69
  <tbody>
70
  <tr><td>MLX Q8 (reference)</td><td>2.4986</td></tr>
71
+ <tr><td>MLX Q6 (gs=32)</td><td> 2.4858 (-0.51% vs Q8)</td></tr>
72
  </tbody>
73
  </table>
74
  Note: This is a small, domain-specific eval for quick sanity; not a benchmark suite.