alea-institute commited on
Commit
11a0fe7
·
verified ·
1 Parent(s): 8829ee9

Update README with KL3M tokenizer paper citation - README.md

Browse files
Files changed (1) hide show
  1. README.md +120 -54
README.md CHANGED
@@ -10,6 +10,7 @@ tags:
10
  - financial
11
  - enterprise
12
  - slm
 
13
  date: '2024-02-20T00:00:00.000Z'
14
  pipeline_tag: text-generation
15
  widget:
@@ -18,57 +19,47 @@ widget:
18
  - do_sample: True
19
  ---
20
 
21
- # kl3m-002-520m (Draft) Model
22
 
23
- **This model was part of our scale-up efforts to build `kl3m-003-3.7b`, another Mixtral-architecture model. We are
24
  making this model public for historical reference and research, but you should probably consider using other models
25
  for production purposes.**
26
 
27
- kl3m-520m is a (very) small language model (SLM) model trained on clean, legally-permissible data. Originally
28
  developed by [273 Ventures](https://273ventures.com) and donated to the [ALEA Institute](https://aleainstitute.ai),
29
- kl3m-520m was the first LLM to obtain the [Fairly Trained L-Certification](https://www.fairlytrained.org/certifications)
30
  for its ethical training data and practices. The model is designed for legal, regulatory, and financial workflows,
31
  with a focus on low toxicity and high efficiency.
32
 
33
- Given its small size and lack of instruction-aligned training data, kl3m-520m is best suited for use either in
34
  SLM fine-tuning or as part of training larger models without using unethical data or models.
35
 
36
- The model was originally trained between November 2023 and January 2024 on a 12xRTX4090 node in DDP. A similar model is
37
- being provided with complete source and data replication as part of the `kl3m-004` family to be released in Q4 2024.
38
-
39
- ## Source
40
-
41
- [https://github.com/alea-institute/kl3m-model-research](https://github.com/alea-institute/kl3m-model-research)
42
-
43
-
44
- ## Training Data
45
- While the original training data collection and training infrastructure relies on software that was not donated by
46
- 273 Ventures, ALEA Institute is open-sourcing an improved dataset, including both replication and an API.
47
-
48
- [https://github.com/alea-institute/kl3m-data](https://github.com/alea-institute/kl3m-data)
49
-
50
- Data is available upon request at this time via S3 under a Requester Pays model. We are actively working on a
51
- zero-cost distribution model as soon as we can obtain additional support.
52
-
53
- This model, the original `kl3m-002-520m` model, was trained on a US-only subset of the Kelvin Legal DataPack that
54
- we believe is 100% public domain material. However, so as to enforce maximum transparency to all
55
- downstream users in the event of any future determination otherwise, we are licensing this model under CC-BY 4.0.
56
-
57
  ## Model Details
58
 
59
- ### Summary
60
  - **Architecture**: Mixtral (`num_local_experts=4, num_experts_per_tok=2`)
61
- - **Parameters**: 520 million
62
- - **Context Window**: 1,024 tokens (`sliding_window=256`)
 
 
 
 
 
 
63
  - **Language(s)**: Primarily English
64
- - **Tokenizer**: kl3m-001-32k BPE tokenizer (32,768 vocabulary size with unorthodox whitespace handling)
65
  - **Developed by**: Originally by [273 Ventures LLC](https://273ventures.com), donated to [ALEA Institute](https://aleainstitute.ai)
66
  - **License**: [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)
67
  - **Hardware Requirements**: Runs real-time in fp32 on CPU/M1+
68
 
69
- ## Performance Metrics
70
 
71
- N/A
 
 
 
 
 
 
72
 
73
  ## Key Features
74
 
@@ -77,15 +68,9 @@ N/A
77
  - **Enterprise Focus**: Specifically designed for legal, regulatory, and financial workflows.
78
  - **Efficient Deployment**: Optimized for real-time inference on consumer hardware.
79
 
80
- ## Use Cases
81
 
82
- - Basic regulatory question answering
83
- - Contract provision drafting
84
- - Structured JSON information extraction
85
- - Foundation for downstream optimization
86
- - Base model for domain-specific fine-tuning
87
-
88
- ## Getting Started
89
 
90
  ```python
91
  import json
@@ -110,12 +95,13 @@ print(
110
  ```json
111
  [
112
  "Under this rule, the operator of a vessel in the Gulf reef fish fishery ",
113
- "Under this proposed rule, the Department is proposing to amend the regulations in \u00a7\u00a7\u200951.2 ",
114
  "Under this proposed rule, CBP would need to collect information from all entities to perform the necessary"
115
  ]
116
  ```
117
 
118
- ## Contract Example
 
119
  ```python
120
  text = "Governing Law."
121
  print(
@@ -137,7 +123,21 @@ print(
137
  ]
138
  ```
139
 
140
- ## Technical Implementation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141
 
142
  The model implements several techniques during training:
143
 
@@ -146,6 +146,82 @@ The model implements several techniques during training:
146
  - Randomized padding
147
  - Traditional fixed-attention mechanisms
148
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
149
  ## License
150
 
151
  This model was originally developed by 273 Ventures and has been donated to the ALEA Institute.
@@ -164,14 +240,4 @@ The KL3M model family is now maintained by the [ALEA Institute](https://aleainst
164
 
165
  Special thanks to 273 Ventures for developing and donating this model to the open-source community through the Alea Institute.
166
 
167
-
168
- ## Citation
169
-
170
- Tokenizer, dataset, and model publications are pending.
171
-
172
- ## Contact
173
-
174
- For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [[email protected]](mailto:[email protected]) or
175
- create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-model-research).
176
-
177
- ![https://aleainstitute.ai](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png)
 
10
  - financial
11
  - enterprise
12
  - slm
13
+ - mixtral
14
  date: '2024-02-20T00:00:00.000Z'
15
  pipeline_tag: text-generation
16
  widget:
 
19
  - do_sample: True
20
  ---
21
 
22
+ # kl3m-002-520m
23
 
24
+ **This model was part of our scale-up efforts to build `kl3m-003-3.7b`, another Mixtral-architecture model. We are
25
  making this model public for historical reference and research, but you should probably consider using other models
26
  for production purposes.**
27
 
28
+ kl3m-002-520m is a (very) small language model (SLM) trained on clean, legally-permissible data. Originally
29
  developed by [273 Ventures](https://273ventures.com) and donated to the [ALEA Institute](https://aleainstitute.ai),
30
+ kl3m-002-520m was the first LLM to obtain the [Fairly Trained L-Certification](https://www.fairlytrained.org/certifications)
31
  for its ethical training data and practices. The model is designed for legal, regulatory, and financial workflows,
32
  with a focus on low toxicity and high efficiency.
33
 
34
+ Given its small size and lack of instruction-aligned training data, kl3m-002-520m is best suited for use either in
35
  SLM fine-tuning or as part of training larger models without using unethical data or models.
36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
  ## Model Details
38
 
 
39
  - **Architecture**: Mixtral (`num_local_experts=4, num_experts_per_tok=2`)
40
+ - **Size**: 520 million parameters
41
+ - **Hidden Size**: 1024
42
+ - **Layers**: 16
43
+ - **Attention Heads**: 16
44
+ - **Key-Value Heads**: 8
45
+ - **Intermediate Size**: 2048
46
+ - **Max Sequence Length**: 1,024 tokens (`sliding_window=256`)
47
+ - **Tokenizer**: [kl3m-001-32k](https://huggingface.co/alea-institute/kl3m-001-32k) BPE tokenizer (32,768 vocabulary size with unorthodox whitespace handling)
48
  - **Language(s)**: Primarily English
49
+ - **Training Objective**: Next token prediction
50
  - **Developed by**: Originally by [273 Ventures LLC](https://273ventures.com), donated to [ALEA Institute](https://aleainstitute.ai)
51
  - **License**: [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)
52
  - **Hardware Requirements**: Runs real-time in fp32 on CPU/M1+
53
 
54
+ ## Use Cases
55
 
56
+ kl3m-002-520m is particularly effective for:
57
+
58
+ - Basic regulatory question answering
59
+ - Contract provision drafting
60
+ - Structured JSON information extraction
61
+ - Foundation for downstream optimization
62
+ - Base model for domain-specific fine-tuning
63
 
64
  ## Key Features
65
 
 
68
  - **Enterprise Focus**: Specifically designed for legal, regulatory, and financial workflows.
69
  - **Efficient Deployment**: Optimized for real-time inference on consumer hardware.
70
 
71
+ ## Usage
72
 
73
+ Basic usage for text generation:
 
 
 
 
 
 
74
 
75
  ```python
76
  import json
 
95
  ```json
96
  [
97
  "Under this rule, the operator of a vessel in the Gulf reef fish fishery ",
98
+ "Under this proposed rule, the Department is proposing to amend the regulations in §§ 51.2 ",
99
  "Under this proposed rule, CBP would need to collect information from all entities to perform the necessary"
100
  ]
101
  ```
102
 
103
+ ### Contract Example
104
+
105
  ```python
106
  text = "Governing Law."
107
  print(
 
123
  ]
124
  ```
125
 
126
+ ### Generation Parameters
127
+
128
+ The model supports various parameters to control the generation process:
129
+
130
+ - `temperature`: Controls randomness (lower = more deterministic)
131
+ - `top_p`: Nucleus sampling parameter (lower = more focused)
132
+ - `top_k`: Limits vocabulary selection to top k tokens
133
+ - `max_new_tokens`: Maximum number of tokens to generate
134
+ - `do_sample`: Whether to use sampling vs. greedy decoding
135
+ - `num_return_sequences`: Number of different sequences to generate
136
+
137
+ ## Training
138
+
139
+ The model was originally trained between November 2023 and January 2024 on a 12xRTX4090 node in DDP. A similar model is
140
+ being provided with complete source and data replication as part of the `kl3m-004` family to be released in Q4 2024.
141
 
142
  The model implements several techniques during training:
143
 
 
146
  - Randomized padding
147
  - Traditional fixed-attention mechanisms
148
 
149
+ ### Training Data
150
+
151
+ While the original training data collection and training infrastructure relies on software that was not donated by
152
+ 273 Ventures, ALEA Institute is open-sourcing an improved dataset, including both replication and an API.
153
+
154
+ [https://github.com/alea-institute/kl3m-data](https://github.com/alea-institute/kl3m-data)
155
+
156
+ Data is available upon request at this time via S3 under a Requester Pays model. We are actively working on a
157
+ zero-cost distribution model as soon as we can obtain additional support.
158
+
159
+ This model, the original `kl3m-002-520m` model, was trained on a US-only subset of the Kelvin Legal DataPack that
160
+ we believe is 100% public domain material. However, so as to enforce maximum transparency to all
161
+ downstream users in the event of any future determination otherwise, we are licensing this model under CC-BY 4.0.
162
+
163
+ ## Intended Usage
164
+
165
+ This model is intended for use in:
166
+
167
+ - Legal and regulatory document processing systems
168
+ - Contract drafting assistance
169
+ - Financial and enterprise document workflows
170
+ - Educational contexts for learning about domain-specific language models
171
+ - Research on small, efficient language models with Mixture of Experts architecture
172
+
173
+ ## Special Tokens
174
+
175
+ kl3m-002-520m uses the following special tokens:
176
+
177
+ - `<s>` (ID: 0): Beginning of sequence token (BOS)
178
+ - `</s>` (ID: 1): End of sequence token (EOS)
179
+ - `<pad>` (ID: 2): Padding token
180
+
181
+ ## Limitations
182
+
183
+ - Limited to a 1,024 token context window with a 256 token sliding window
184
+ - As a small language model (520M parameters), it has limited general knowledge
185
+ - Not instruction-tuned or aligned with human preferences
186
+ - May generate plausible-sounding but incorrect legal or regulatory text
187
+ - Not a substitute for professional legal advice or domain expertise
188
+ - Performance is optimized for legal and financial domains; general performance may be lower
189
+
190
+ ## Ethical Considerations
191
+
192
+ - This model should not be used to generate legal advice without human expert review
193
+ - The model may reflect biases present in the training data despite efforts to use clean data
194
+ - Generated text should be reviewed by qualified professionals before use in formal legal contexts
195
+ - While trained on ethically sourced data, users should verify outputs for accuracy and appropriateness
196
+
197
+ ## Source
198
+
199
+ [https://github.com/alea-institute/kl3m-model-research](https://github.com/alea-institute/kl3m-model-research)
200
+
201
+ ## References
202
+
203
+ - [KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications](https://arxiv.org/abs/2503.17247)
204
+ - Additional tokenizer, dataset, and model publications are pending.
205
+
206
+ ## Citation
207
+
208
+ ```bibtex
209
+ @misc{kl3m-002-520m,
210
+ author = {ALEA Institute},
211
+ title = {kl3m-002-520m: A Small Language Model for Legal and Regulatory Text},
212
+ year = {2024},
213
+ publisher = {Hugging Face},
214
+ howpublished = {\url{https://huggingface.co/alea-institute/kl3m-002-520m}}
215
+ }
216
+
217
+ @article{bommarito2025kl3m,
218
+ title={KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications},
219
+ author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian},
220
+ journal={arXiv preprint arXiv:2503.17247},
221
+ year={2025}
222
+ }
223
+ ```
224
+
225
  ## License
226
 
227
  This model was originally developed by 273 Ventures and has been donated to the ALEA Institute.
 
240
 
241
  Special thanks to 273 Ventures for developing and donating this model to the open-source community through the Alea Institute.
242
 
243
+ ![logo](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png)