Update README with KL3M tokenizer paper citation - README.md
Browse files
README.md
CHANGED
@@ -10,6 +10,7 @@ tags:
|
|
10 |
- financial
|
11 |
- enterprise
|
12 |
- slm
|
|
|
13 |
date: '2024-02-20T00:00:00.000Z'
|
14 |
pipeline_tag: text-generation
|
15 |
widget:
|
@@ -18,57 +19,47 @@ widget:
|
|
18 |
- do_sample: True
|
19 |
---
|
20 |
|
21 |
-
# kl3m-002-520m
|
22 |
|
23 |
-
**This model was part of our scale-up efforts to build `kl3m-003-3.7b`, another Mixtral-architecture model.
|
24 |
making this model public for historical reference and research, but you should probably consider using other models
|
25 |
for production purposes.**
|
26 |
|
27 |
-
kl3m-520m is a (very) small language model (SLM)
|
28 |
developed by [273 Ventures](https://273ventures.com) and donated to the [ALEA Institute](https://aleainstitute.ai),
|
29 |
-
kl3m-520m was the first LLM to obtain the [Fairly Trained L-Certification](https://www.fairlytrained.org/certifications)
|
30 |
for its ethical training data and practices. The model is designed for legal, regulatory, and financial workflows,
|
31 |
with a focus on low toxicity and high efficiency.
|
32 |
|
33 |
-
Given its small size and lack of instruction-aligned training data, kl3m-520m is best suited for use either in
|
34 |
SLM fine-tuning or as part of training larger models without using unethical data or models.
|
35 |
|
36 |
-
The model was originally trained between November 2023 and January 2024 on a 12xRTX4090 node in DDP. A similar model is
|
37 |
-
being provided with complete source and data replication as part of the `kl3m-004` family to be released in Q4 2024.
|
38 |
-
|
39 |
-
## Source
|
40 |
-
|
41 |
-
[https://github.com/alea-institute/kl3m-model-research](https://github.com/alea-institute/kl3m-model-research)
|
42 |
-
|
43 |
-
|
44 |
-
## Training Data
|
45 |
-
While the original training data collection and training infrastructure relies on software that was not donated by
|
46 |
-
273 Ventures, ALEA Institute is open-sourcing an improved dataset, including both replication and an API.
|
47 |
-
|
48 |
-
[https://github.com/alea-institute/kl3m-data](https://github.com/alea-institute/kl3m-data)
|
49 |
-
|
50 |
-
Data is available upon request at this time via S3 under a Requester Pays model. We are actively working on a
|
51 |
-
zero-cost distribution model as soon as we can obtain additional support.
|
52 |
-
|
53 |
-
This model, the original `kl3m-002-520m` model, was trained on a US-only subset of the Kelvin Legal DataPack that
|
54 |
-
we believe is 100% public domain material. However, so as to enforce maximum transparency to all
|
55 |
-
downstream users in the event of any future determination otherwise, we are licensing this model under CC-BY 4.0.
|
56 |
-
|
57 |
## Model Details
|
58 |
|
59 |
-
### Summary
|
60 |
- **Architecture**: Mixtral (`num_local_experts=4, num_experts_per_tok=2`)
|
61 |
-
- **
|
62 |
-
- **
|
|
|
|
|
|
|
|
|
|
|
|
|
63 |
- **Language(s)**: Primarily English
|
64 |
-
- **
|
65 |
- **Developed by**: Originally by [273 Ventures LLC](https://273ventures.com), donated to [ALEA Institute](https://aleainstitute.ai)
|
66 |
- **License**: [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)
|
67 |
- **Hardware Requirements**: Runs real-time in fp32 on CPU/M1+
|
68 |
|
69 |
-
##
|
70 |
|
71 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
72 |
|
73 |
## Key Features
|
74 |
|
@@ -77,15 +68,9 @@ N/A
|
|
77 |
- **Enterprise Focus**: Specifically designed for legal, regulatory, and financial workflows.
|
78 |
- **Efficient Deployment**: Optimized for real-time inference on consumer hardware.
|
79 |
|
80 |
-
##
|
81 |
|
82 |
-
|
83 |
-
- Contract provision drafting
|
84 |
-
- Structured JSON information extraction
|
85 |
-
- Foundation for downstream optimization
|
86 |
-
- Base model for domain-specific fine-tuning
|
87 |
-
|
88 |
-
## Getting Started
|
89 |
|
90 |
```python
|
91 |
import json
|
@@ -110,12 +95,13 @@ print(
|
|
110 |
```json
|
111 |
[
|
112 |
"Under this rule, the operator of a vessel in the Gulf reef fish fishery ",
|
113 |
-
"Under this proposed rule, the Department is proposing to amend the regulations in
|
114 |
"Under this proposed rule, CBP would need to collect information from all entities to perform the necessary"
|
115 |
]
|
116 |
```
|
117 |
|
118 |
-
|
|
|
119 |
```python
|
120 |
text = "Governing Law."
|
121 |
print(
|
@@ -137,7 +123,21 @@ print(
|
|
137 |
]
|
138 |
```
|
139 |
|
140 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
141 |
|
142 |
The model implements several techniques during training:
|
143 |
|
@@ -146,6 +146,82 @@ The model implements several techniques during training:
|
|
146 |
- Randomized padding
|
147 |
- Traditional fixed-attention mechanisms
|
148 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
149 |
## License
|
150 |
|
151 |
This model was originally developed by 273 Ventures and has been donated to the ALEA Institute.
|
@@ -164,14 +240,4 @@ The KL3M model family is now maintained by the [ALEA Institute](https://aleainst
|
|
164 |
|
165 |
Special thanks to 273 Ventures for developing and donating this model to the open-source community through the Alea Institute.
|
166 |
|
167 |
-
|
168 |
-
## Citation
|
169 |
-
|
170 |
-
Tokenizer, dataset, and model publications are pending.
|
171 |
-
|
172 |
-
## Contact
|
173 |
-
|
174 |
-
For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [[email protected]](mailto:[email protected]) or
|
175 |
-
create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-model-research).
|
176 |
-
|
177 |
-

|
|
|
10 |
- financial
|
11 |
- enterprise
|
12 |
- slm
|
13 |
+
- mixtral
|
14 |
date: '2024-02-20T00:00:00.000Z'
|
15 |
pipeline_tag: text-generation
|
16 |
widget:
|
|
|
19 |
- do_sample: True
|
20 |
---
|
21 |
|
22 |
+
# kl3m-002-520m
|
23 |
|
24 |
+
**This model was part of our scale-up efforts to build `kl3m-003-3.7b`, another Mixtral-architecture model. We are
|
25 |
making this model public for historical reference and research, but you should probably consider using other models
|
26 |
for production purposes.**
|
27 |
|
28 |
+
kl3m-002-520m is a (very) small language model (SLM) trained on clean, legally-permissible data. Originally
|
29 |
developed by [273 Ventures](https://273ventures.com) and donated to the [ALEA Institute](https://aleainstitute.ai),
|
30 |
+
kl3m-002-520m was the first LLM to obtain the [Fairly Trained L-Certification](https://www.fairlytrained.org/certifications)
|
31 |
for its ethical training data and practices. The model is designed for legal, regulatory, and financial workflows,
|
32 |
with a focus on low toxicity and high efficiency.
|
33 |
|
34 |
+
Given its small size and lack of instruction-aligned training data, kl3m-002-520m is best suited for use either in
|
35 |
SLM fine-tuning or as part of training larger models without using unethical data or models.
|
36 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
37 |
## Model Details
|
38 |
|
|
|
39 |
- **Architecture**: Mixtral (`num_local_experts=4, num_experts_per_tok=2`)
|
40 |
+
- **Size**: 520 million parameters
|
41 |
+
- **Hidden Size**: 1024
|
42 |
+
- **Layers**: 16
|
43 |
+
- **Attention Heads**: 16
|
44 |
+
- **Key-Value Heads**: 8
|
45 |
+
- **Intermediate Size**: 2048
|
46 |
+
- **Max Sequence Length**: 1,024 tokens (`sliding_window=256`)
|
47 |
+
- **Tokenizer**: [kl3m-001-32k](https://huggingface.co/alea-institute/kl3m-001-32k) BPE tokenizer (32,768 vocabulary size with unorthodox whitespace handling)
|
48 |
- **Language(s)**: Primarily English
|
49 |
+
- **Training Objective**: Next token prediction
|
50 |
- **Developed by**: Originally by [273 Ventures LLC](https://273ventures.com), donated to [ALEA Institute](https://aleainstitute.ai)
|
51 |
- **License**: [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)
|
52 |
- **Hardware Requirements**: Runs real-time in fp32 on CPU/M1+
|
53 |
|
54 |
+
## Use Cases
|
55 |
|
56 |
+
kl3m-002-520m is particularly effective for:
|
57 |
+
|
58 |
+
- Basic regulatory question answering
|
59 |
+
- Contract provision drafting
|
60 |
+
- Structured JSON information extraction
|
61 |
+
- Foundation for downstream optimization
|
62 |
+
- Base model for domain-specific fine-tuning
|
63 |
|
64 |
## Key Features
|
65 |
|
|
|
68 |
- **Enterprise Focus**: Specifically designed for legal, regulatory, and financial workflows.
|
69 |
- **Efficient Deployment**: Optimized for real-time inference on consumer hardware.
|
70 |
|
71 |
+
## Usage
|
72 |
|
73 |
+
Basic usage for text generation:
|
|
|
|
|
|
|
|
|
|
|
|
|
74 |
|
75 |
```python
|
76 |
import json
|
|
|
95 |
```json
|
96 |
[
|
97 |
"Under this rule, the operator of a vessel in the Gulf reef fish fishery ",
|
98 |
+
"Under this proposed rule, the Department is proposing to amend the regulations in §§ 51.2 ",
|
99 |
"Under this proposed rule, CBP would need to collect information from all entities to perform the necessary"
|
100 |
]
|
101 |
```
|
102 |
|
103 |
+
### Contract Example
|
104 |
+
|
105 |
```python
|
106 |
text = "Governing Law."
|
107 |
print(
|
|
|
123 |
]
|
124 |
```
|
125 |
|
126 |
+
### Generation Parameters
|
127 |
+
|
128 |
+
The model supports various parameters to control the generation process:
|
129 |
+
|
130 |
+
- `temperature`: Controls randomness (lower = more deterministic)
|
131 |
+
- `top_p`: Nucleus sampling parameter (lower = more focused)
|
132 |
+
- `top_k`: Limits vocabulary selection to top k tokens
|
133 |
+
- `max_new_tokens`: Maximum number of tokens to generate
|
134 |
+
- `do_sample`: Whether to use sampling vs. greedy decoding
|
135 |
+
- `num_return_sequences`: Number of different sequences to generate
|
136 |
+
|
137 |
+
## Training
|
138 |
+
|
139 |
+
The model was originally trained between November 2023 and January 2024 on a 12xRTX4090 node in DDP. A similar model is
|
140 |
+
being provided with complete source and data replication as part of the `kl3m-004` family to be released in Q4 2024.
|
141 |
|
142 |
The model implements several techniques during training:
|
143 |
|
|
|
146 |
- Randomized padding
|
147 |
- Traditional fixed-attention mechanisms
|
148 |
|
149 |
+
### Training Data
|
150 |
+
|
151 |
+
While the original training data collection and training infrastructure relies on software that was not donated by
|
152 |
+
273 Ventures, ALEA Institute is open-sourcing an improved dataset, including both replication and an API.
|
153 |
+
|
154 |
+
[https://github.com/alea-institute/kl3m-data](https://github.com/alea-institute/kl3m-data)
|
155 |
+
|
156 |
+
Data is available upon request at this time via S3 under a Requester Pays model. We are actively working on a
|
157 |
+
zero-cost distribution model as soon as we can obtain additional support.
|
158 |
+
|
159 |
+
This model, the original `kl3m-002-520m` model, was trained on a US-only subset of the Kelvin Legal DataPack that
|
160 |
+
we believe is 100% public domain material. However, so as to enforce maximum transparency to all
|
161 |
+
downstream users in the event of any future determination otherwise, we are licensing this model under CC-BY 4.0.
|
162 |
+
|
163 |
+
## Intended Usage
|
164 |
+
|
165 |
+
This model is intended for use in:
|
166 |
+
|
167 |
+
- Legal and regulatory document processing systems
|
168 |
+
- Contract drafting assistance
|
169 |
+
- Financial and enterprise document workflows
|
170 |
+
- Educational contexts for learning about domain-specific language models
|
171 |
+
- Research on small, efficient language models with Mixture of Experts architecture
|
172 |
+
|
173 |
+
## Special Tokens
|
174 |
+
|
175 |
+
kl3m-002-520m uses the following special tokens:
|
176 |
+
|
177 |
+
- `<s>` (ID: 0): Beginning of sequence token (BOS)
|
178 |
+
- `</s>` (ID: 1): End of sequence token (EOS)
|
179 |
+
- `<pad>` (ID: 2): Padding token
|
180 |
+
|
181 |
+
## Limitations
|
182 |
+
|
183 |
+
- Limited to a 1,024 token context window with a 256 token sliding window
|
184 |
+
- As a small language model (520M parameters), it has limited general knowledge
|
185 |
+
- Not instruction-tuned or aligned with human preferences
|
186 |
+
- May generate plausible-sounding but incorrect legal or regulatory text
|
187 |
+
- Not a substitute for professional legal advice or domain expertise
|
188 |
+
- Performance is optimized for legal and financial domains; general performance may be lower
|
189 |
+
|
190 |
+
## Ethical Considerations
|
191 |
+
|
192 |
+
- This model should not be used to generate legal advice without human expert review
|
193 |
+
- The model may reflect biases present in the training data despite efforts to use clean data
|
194 |
+
- Generated text should be reviewed by qualified professionals before use in formal legal contexts
|
195 |
+
- While trained on ethically sourced data, users should verify outputs for accuracy and appropriateness
|
196 |
+
|
197 |
+
## Source
|
198 |
+
|
199 |
+
[https://github.com/alea-institute/kl3m-model-research](https://github.com/alea-institute/kl3m-model-research)
|
200 |
+
|
201 |
+
## References
|
202 |
+
|
203 |
+
- [KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications](https://arxiv.org/abs/2503.17247)
|
204 |
+
- Additional tokenizer, dataset, and model publications are pending.
|
205 |
+
|
206 |
+
## Citation
|
207 |
+
|
208 |
+
```bibtex
|
209 |
+
@misc{kl3m-002-520m,
|
210 |
+
author = {ALEA Institute},
|
211 |
+
title = {kl3m-002-520m: A Small Language Model for Legal and Regulatory Text},
|
212 |
+
year = {2024},
|
213 |
+
publisher = {Hugging Face},
|
214 |
+
howpublished = {\url{https://huggingface.co/alea-institute/kl3m-002-520m}}
|
215 |
+
}
|
216 |
+
|
217 |
+
@article{bommarito2025kl3m,
|
218 |
+
title={KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications},
|
219 |
+
author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian},
|
220 |
+
journal={arXiv preprint arXiv:2503.17247},
|
221 |
+
year={2025}
|
222 |
+
}
|
223 |
+
```
|
224 |
+
|
225 |
## License
|
226 |
|
227 |
This model was originally developed by 273 Ventures and has been donated to the ALEA Institute.
|
|
|
240 |
|
241 |
Special thanks to 273 Ventures for developing and donating this model to the open-source community through the Alea Institute.
|
242 |
|
243 |
+

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|