|
--- |
|
language: |
|
- en |
|
library_name: transformers |
|
license: cc-by-4.0 |
|
tags: |
|
- kl3m |
|
- kl3m-002 |
|
- legal |
|
- financial |
|
- enterprise |
|
- slm |
|
- mixtral |
|
date: '2024-02-20T00:00:00.000Z' |
|
pipeline_tag: text-generation |
|
widget: |
|
- text: "Medical devices are regulated by" |
|
- temperature: 0.3 |
|
- do_sample: True |
|
--- |
|
|
|
# kl3m-002-520m |
|
|
|
**This model was part of our scale-up efforts to build `kl3m-003-3.7b`, another Mixtral-architecture model. We are |
|
making this model public for historical reference and research, but you should probably consider using other models |
|
for production purposes.** |
|
|
|
kl3m-002-520m is a (very) small language model (SLM) trained on clean, legally-permissible data. Originally |
|
developed by [273 Ventures](https://273ventures.com) and donated to the [ALEA Institute](https://aleainstitute.ai), |
|
kl3m-002-520m was the first LLM to obtain the [Fairly Trained L-Certification](https://www.fairlytrained.org/certifications) |
|
for its ethical training data and practices. The model is designed for legal, regulatory, and financial workflows, |
|
with a focus on low toxicity and high efficiency. |
|
|
|
Given its small size and lack of instruction-aligned training data, kl3m-002-520m is best suited for use either in |
|
SLM fine-tuning or as part of training larger models without using unethical data or models. |
|
|
|
## Model Details |
|
|
|
- **Architecture**: Mixtral (`num_local_experts=4, num_experts_per_tok=2`) |
|
- **Size**: 520 million parameters |
|
- **Hidden Size**: 1024 |
|
- **Layers**: 16 |
|
- **Attention Heads**: 16 |
|
- **Key-Value Heads**: 8 |
|
- **Intermediate Size**: 2048 |
|
- **Max Sequence Length**: 1,024 tokens (`sliding_window=256`) |
|
- **Tokenizer**: [kl3m-001-32k](https://huggingface.co/alea-institute/kl3m-001-32k) BPE tokenizer (32,768 vocabulary size with unorthodox whitespace handling) |
|
- **Language(s)**: Primarily English |
|
- **Training Objective**: Next token prediction |
|
- **Developed by**: Originally by [273 Ventures LLC](https://273ventures.com), donated to [ALEA Institute](https://aleainstitute.ai) |
|
- **License**: [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) |
|
- **Hardware Requirements**: Runs real-time in fp32 on CPU/M1+ |
|
|
|
## Use Cases |
|
|
|
kl3m-002-520m is particularly effective for: |
|
|
|
- Basic regulatory question answering |
|
- Contract provision drafting |
|
- Structured JSON information extraction |
|
- Foundation for downstream optimization |
|
- Base model for domain-specific fine-tuning |
|
|
|
## Key Features |
|
|
|
- **Clean Training Data**: Built on what was originally referred to as the Kelvin Legal DataPack, ensuring all training data is ethically sourced and legally permissible. |
|
- **Low Toxicity**: [Empirically lower toxicity and bias](https://github.com/alea-institute/kl3m-toxicity) |
|
- **Enterprise Focus**: Specifically designed for legal, regulatory, and financial workflows. |
|
- **Efficient Deployment**: Optimized for real-time inference on consumer hardware. |
|
|
|
## Usage |
|
|
|
Basic usage for text generation: |
|
|
|
```python |
|
import json |
|
from transformers import pipeline |
|
|
|
# Load the model and tokenizer |
|
p = pipeline('text-generation', 'alea-institute/kl3m-002-520m', device='cpu') |
|
|
|
# Example usage on CPU |
|
text = "Under this" |
|
print( |
|
json.dumps( |
|
[ |
|
r.get("generated_text") |
|
for r in p(text, do_sample=True, temperature=0.5, num_return_sequences=3, max_new_tokens=32) |
|
], |
|
indent=2 |
|
) |
|
) |
|
``` |
|
|
|
```json |
|
[ |
|
"Under this rule, the operator of a vessel in the Gulf reef fish fishery ", |
|
"Under this proposed rule, the Department is proposing to amend the regulations in §§ 51.2 ", |
|
"Under this proposed rule, CBP would need to collect information from all entities to perform the necessary" |
|
] |
|
``` |
|
|
|
### Contract Example |
|
|
|
```python |
|
text = "Governing Law." |
|
print( |
|
json.dumps( |
|
[ |
|
r.get("generated_text") |
|
for r in p(text, do_sample=True, temperature=0.5, num_return_sequences=3, max_new_tokens=32) |
|
], |
|
indent=2 |
|
) |
|
) |
|
``` |
|
|
|
```json |
|
[ |
|
"Governing Law.\n (a) No provision of this Agreement shall be interpreted or construed to confer ", |
|
"Governing Law.\nThe law of the United States shall be interpreted and enforced in accordance", |
|
"Governing Law.\n (a) The validity of any contract or agreement to which the \nUnited States is " |
|
] |
|
``` |
|
|
|
### Generation Parameters |
|
|
|
The model supports various parameters to control the generation process: |
|
|
|
- `temperature`: Controls randomness (lower = more deterministic) |
|
- `top_p`: Nucleus sampling parameter (lower = more focused) |
|
- `top_k`: Limits vocabulary selection to top k tokens |
|
- `max_new_tokens`: Maximum number of tokens to generate |
|
- `do_sample`: Whether to use sampling vs. greedy decoding |
|
- `num_return_sequences`: Number of different sequences to generate |
|
|
|
## Training |
|
|
|
The model was originally trained between November 2023 and January 2024 on a 12xRTX4090 node in DDP. A similar model is |
|
being provided with complete source and data replication as part of the `kl3m-004` family to be released in Q4 2024. |
|
|
|
The model implements several techniques during training: |
|
|
|
- Hybrid NTP and SFT cotraining |
|
- Dynamic, document-aware segmentation |
|
- Randomized padding |
|
- Traditional fixed-attention mechanisms |
|
|
|
### Training Data |
|
|
|
While the original training data collection and training infrastructure relies on software that was not donated by |
|
273 Ventures, ALEA Institute is open-sourcing an improved dataset, including both replication and an API. |
|
|
|
[https://github.com/alea-institute/kl3m-data](https://github.com/alea-institute/kl3m-data) |
|
|
|
Data is available upon request at this time via S3 under a Requester Pays model. We are actively working on a |
|
zero-cost distribution model as soon as we can obtain additional support. |
|
|
|
This model, the original `kl3m-002-520m` model, was trained on a US-only subset of the Kelvin Legal DataPack that |
|
we believe is 100% public domain material. However, so as to enforce maximum transparency to all |
|
downstream users in the event of any future determination otherwise, we are licensing this model under CC-BY 4.0. |
|
|
|
## Intended Usage |
|
|
|
This model is intended for use in: |
|
|
|
- Legal and regulatory document processing systems |
|
- Contract drafting assistance |
|
- Financial and enterprise document workflows |
|
- Educational contexts for learning about domain-specific language models |
|
- Research on small, efficient language models with Mixture of Experts architecture |
|
|
|
## Special Tokens |
|
|
|
kl3m-002-520m uses the following special tokens: |
|
|
|
- `<s>` (ID: 0): Beginning of sequence token (BOS) |
|
- `</s>` (ID: 1): End of sequence token (EOS) |
|
- `<pad>` (ID: 2): Padding token |
|
|
|
## Limitations |
|
|
|
- Limited to a 1,024 token context window with a 256 token sliding window |
|
- As a small language model (520M parameters), it has limited general knowledge |
|
- Not instruction-tuned or aligned with human preferences |
|
- May generate plausible-sounding but incorrect legal or regulatory text |
|
- Not a substitute for professional legal advice or domain expertise |
|
- Performance is optimized for legal and financial domains; general performance may be lower |
|
|
|
## Ethical Considerations |
|
|
|
- This model should not be used to generate legal advice without human expert review |
|
- The model may reflect biases present in the training data despite efforts to use clean data |
|
- Generated text should be reviewed by qualified professionals before use in formal legal contexts |
|
- While trained on ethically sourced data, users should verify outputs for accuracy and appropriateness |
|
|
|
## Source |
|
|
|
[https://github.com/alea-institute/kl3m-model-research](https://github.com/alea-institute/kl3m-model-research) |
|
|
|
## References |
|
|
|
- [KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications](https://arxiv.org/abs/2503.17247) |
|
- Additional tokenizer, dataset, and model publications are pending. |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@misc{kl3m-002-520m, |
|
author = {ALEA Institute}, |
|
title = {kl3m-002-520m: A Small Language Model for Legal and Regulatory Text}, |
|
year = {2024}, |
|
publisher = {Hugging Face}, |
|
howpublished = {\url{https://huggingface.co/alea-institute/kl3m-002-520m}} |
|
} |
|
|
|
@article{bommarito2025kl3m, |
|
title={KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications}, |
|
author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian}, |
|
journal={arXiv preprint arXiv:2503.17247}, |
|
year={2025} |
|
} |
|
``` |
|
|
|
## License |
|
|
|
This model was originally developed by 273 Ventures and has been donated to the ALEA Institute. |
|
|
|
The model weights are released under the CC-BY 4.0 License. |
|
|
|
## Contact |
|
|
|
The KL3M model family is now maintained by the [ALEA Institute](https://aleainstitute.ai). For technical support, collaboration opportunities, or general inquiries: |
|
|
|
- GitHub: https://github.com/alea-institute/kl3m-model-research |
|
- Email: [email protected] |
|
- Website: https://aleainstitute.ai |
|
|
|
 |