Model versions
Model | Parameters | RAM used (inference) |
---|---|---|
stok-0.1 | 3,798 | 6MB |
stok-0.2 | 4m | 542MB |
stok-0.3 | 962k | 136MB |
stok-0.3-large | 28.86m | 4GB |
stok-0.3-125m | 125.06m | 17.5GB |
Description
stok is a family of models designed to run better at smaller parameter counts and maintain speed despite model size. stok-sub-1 will contain all versions of the stok model, prior to releasing stok-1. The goal of creating the stok models is to have models that regardless of size, can be ran incredibly fast on CPUs (including incredibly old ones). Currently, stok can only contextualize single prompts and will not understand them beyond a single word. So far, each new version (as in 0.1, 0.2, and 0.3) has brought a new capability to the model. 0.2 gave the model the ability to end it's thought, 0.3 allowed the model to (usually) keep the token prediction within the context of the prompt. While the model definitely needs a little more help, it's only in version 0.3, there's a lot of work to go (like a new, less ram intensive, inference engine).
How to run
First, when using python (more inference engines coming soon) you will need to install the run_stok.py
file. The code for using this will look something like this:
from run_stok import load_model, run_model
# you can replace stok-0.3.json with whichever stok model you want
load_model("stok-0.3.json")
response = run_model("Hello!", max_tokens=100, repetition_penalty=2)
for chunk in response:
print(chunk, end="")
this showcases how to use all currently functioning parameters, although max_tokens and repetition penalty are both technically optional.
If you'd rather use stokfile (a tool for just testing out the model) here's how you can.
python3 stokfile.py -m stok-0.3.json
If you want to see the speed of the output, just add -speed to the end, like so...
python3 stokfile.py -m stok-0.3.json -speed
Benchmark (SLMB)
Model | Score | Med. Speed |
---|---|---|
stok-0.1 | 1/16 | 361,703 t/s |
stok-0.2 | 4/16 | 3,887 t/s |
stok-0.3 | 5/16 | 254,902 t/s |
stok-0.3-large | 8/16 | 149,526 t/s |
stok-0.3-125m | 8/16 | 122,625 t/s |
TinyLLama-v0 (F32) | 0/16 | 1,695 t/s |
Gemma-3-270m-it (F16) | 12/16 | 46 t/s |
H2o danube3 500m chat(F32) | 8/16 | 21 t/s |
Llama 3.2 1B instruct(F16) | 14/16 | 14 t/s |
The CPU used for each test was the AMD Ryzen 7 2700X
RAM: 64GB DDR4
The SLMB (Small Language Model Benchmark) v1
Quick description
This is a very very simple model test, created to test the capabilies of much smaller LLMs. (The answers are included, though they aren't actually needed)
The Benchmark
Category 1: elementary math - x/4
what is 2+2 (4)
what is 12+5 (17)
what is 4/2 (2)
what is 3*3 (9)
Category 2: math with large numbers - x/4
what is 500+200 (700)
what is 10000+1000 (11000)
what is 100*100 (10000)
what is 12*5000 (60000)
Category 3: input variation - x/5
what is 1+1 (2)
what is 1 + 1 (2)
what is 1+ 1 (2)
what is a dog (any answer that matches at least a very basic description of a dog)
What is a dog? (any answer that matches at least a very basic description of a dog)
Category 4: basic logic - x/2 (2 points for correct, 0 for wrong)
I have three friends (Jeremy, Tyler, and Gabe) Friend #1 is Jeremy, Friend #3 is Tyler, who is friend #2?
(Gabe)
Conclusion
While stok is definitely (in my opinion) pretty impressive -- especially given it's performance at such small sizes -- it has lots of room to go (also the benchmark may include more tests in the future)