kl3m-doc-pico-long-001

kl3m-doc-pico-long-001 is a domain-specific model based on the RoBERTa architecture, specifically designed for feature extraction in legal and financial document analysis with support for longer context windows. It shares the same basic architecture as the kl3m-doc-pico-001 model but has been trained with extended context capabilities of up to 4,096 tokens. While the model architecture supports masked language modeling (MLM), it is primarily optimized for feature extraction tasks.

Model Details

Architecture: RoBERTa
Size: 41M parameters
Hidden Size: 256
Layers: 8
Attention Heads: 8
Max Sequence Length: 4,096
Tokenizer: alea-institute/kl3m-004-128k-cased

Use Cases

This model is particularly useful for:

Document classification in legal and financial domains with lengthy documents
Analyzing relationships between distant parts of a document
Processing lengthy agreements, contracts, and regulatory filings
Feature extraction for downstream legal analysis tasks requiring longer context

The extended context length makes this model especially suited for working with longer portions of legal documents, contracts, and financial statements that often exceed the context limits of standard models.

Usage

The primary use case for this model is feature extraction for document embedding and downstream classification tasks. Here's how to use it for feature extraction:

from transformers import AutoModel, AutoTokenizer
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("alea-institute/kl3m-doc-pico-long-001")
model = AutoModel.from_pretrained("alea-institute/kl3m-doc-pico-long-001")

# Example with legal document context
text = "<|cls|> This Credit Agreement is made and entered into as of [DATE], by and between [BORROWER NAME], a Delaware corporation with its principal place of business at [ADDRESS] (the \"Borrower\"), and [LENDER NAME], a national banking association (the \"Lender\"). <|sep|>"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Get the embeddings
# The CLS token embedding is typically used for classification tasks
cls_embedding = outputs.last_hidden_state[:, 0, :]
print(f"CLS embedding shape: {cls_embedding.shape}")  # Should be [1, 256]

# For document similarity, you can use the mean of all token embeddings
mean_embedding = outputs.last_hidden_state.mean(dim=1)
print(f"Mean embedding shape: {mean_embedding.shape}")  # Should be [1, 256]

# You can also process multiple documents in a batch
texts = [
    "<|cls|> This Credit Agreement is made and entered into as of [DATE]... <|sep|>",
    "<|cls|> Form 10-Q is a report filed with the Securities and Exchange Commission... <|sep|>"
]
batch_inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
batch_outputs = model(**batch_inputs)

# Get batch embeddings
batch_embeddings = batch_outputs.last_hidden_state[:, 0, :]
print(f"Batch embeddings shape: {batch_embeddings.shape}")  # Should be [2, 256]

While the model architecture supports masked language modeling (MLM), it was not specifically trained for this task. If you still want to experiment with it for MLM, you can use the following code, but be aware that performance may be limited:

from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

# Load model with MLM head (note: this part wasn't specifically trained)
tokenizer = AutoTokenizer.from_pretrained("alea-institute/kl3m-doc-pico-long-001")
mlm_model = AutoModelForMaskedLM.from_pretrained("alea-institute/kl3m-doc-pico-long-001")

# Example with masked token
text = "<|cls|> Form 10-<|mask|> is a report filed with the Securities and Exchange Commission... <|sep|>"
inputs = tokenizer(text, return_tensors="pt")
outputs = mlm_model(**inputs)

# Get predictions for masked token
masked_index = torch.where(inputs.input_ids[0] == tokenizer.mask_token_id)[0].item()
probs = outputs.logits[0, masked_index].softmax(dim=0)
top_5 = torch.topk(probs, 5)

print("Top 5 predictions for masked token:")
for i, (score, idx) in enumerate(zip(top_5.values, top_5.indices)):
    token = tokenizer.decode(idx).strip()
    print(f"{i+1}. {token} ({score.item():.3f})")

Long Context Capabilities

This model extends the context window from the standard 512 tokens to 4,096 tokens, enabling it to process much longer documents. The extended context window allows the model to:

Process full legal agreements and contracts without truncation
Maintain awareness of context from the beginning of a document when analyzing later sections
Better handle documents with complex cross-references and definitions
Reduce the need for document chunking in downstream applications

Standard Test Examples

Using our standardized test examples for comparing embedding models:

Fill-Mask Results

While this model is primarily designed for feature extraction rather than masked language modeling, we tested it on standard examples for comparison purposes. Note that performance on the MLM task is extremely limited compared to models specifically optimized for it:

Contract Clause Heading:
"<|cls|> 8. REPRESENTATIONS AND<|mask|>. Each party hereby represents and warrants to the other party as of the date hereof as follows: <|sep|>"

Top 5 predictions:
1. Programs (0.0003)
2. for (0.0003)
3. to (0.0002)
4. , (0.0002)
5. the (0.0002)
Note: Unlike the other models in the family, this model shows virtually no confidence in its masked token predictions, with extremely low probability scores that suggest it should not be used for masked language modeling tasks.
Defined Term Example:
"<|cls|> \"Effective<|mask|>\" means the date on which all conditions precedent set forth in Article V are satisfied or waived by the Administrative Agent. <|sep|>"

Top 5 predictions:
1. Applicants (0.0002)
2. volunteers (0.0001)
3. Individuals (0.0001)
4. Carriers (0.0001)
5. inventors (0.0001)
Regulation Example:
"<|cls|> All transactions shall comply with the requirements set forth in the Truth in<|mask|> Act and its implementing Regulation Z. <|sep|>"

Top 5 predictions:
1. warrant (0.0002)
2. service (0.0001)
3. protective (0.0001)
4. permit (0.0001)
5. authorization (0.0001)

Document Similarity Results

Where this model truly shines is in document embedding and similarity, especially with its mean pooling strategy:

Document Pair	Cosine Similarity (CLS token)	Cosine Similarity (Mean pooling)
Court Complaint vs. Consumer Terms	0.595	0.659
Court Complaint vs. Credit Agreement	0.604	0.854
Consumer Terms vs. Credit Agreement	0.658	0.727

Both CLS token and mean pooling strategies produce high-quality similarity scores, with notable differences in magnitude. While CLS token embeddings show moderately high similarity (especially between Consumer Terms and Credit Agreement at 0.658), mean pooling produces even stronger document-level similarity detection (especially between Court Complaint and Credit Agreement at 0.854). This suggests that for document similarity tasks with this model, both strategies are effective, though mean pooling may provide more pronounced similarity signals in some cases.

Training

The model was trained on a diverse corpus of legal and financial documents, ensuring high-quality performance in these domains. This version has been specifically optimized for feature extraction with longer contexts, incorporating the following key aspects:

Position embeddings extended to 4,096 tokens
Training focused on dense document representation for retrieval and classification tasks
Objectives optimized for feature extraction rather than token prediction
Additional training on full-length documents to ensure contextual understanding

It leverages the KL3M tokenizer which provides 9-17% more efficient tokenization for domain-specific content than general-purpose tokenizers.

Intended Usage

This model was specifically designed and trained for:

Document embedding: Generating fixed-length vector representations of documents for similarity comparison
Feature extraction: Creating inputs for downstream classification and regression tasks
Semantic search: Finding similar documents across large collections
Document clustering: Discovering patterns across legal and financial document collections

While the model architecture includes the capability for masked language modeling, the model weights were not specifically optimized for this task. For masked language modeling applications, consider using a model explicitly trained for that purpose.

Special Tokens

This model includes the following special tokens:

CLS token: <|cls|> (ID: 5) - Used for the beginning of input text
MASK token: <|mask|> (ID: 6) - Used to mark tokens for prediction
SEP token: <|sep|> (ID: 4) - Used for the end of input text
PAD token: <|pad|> (ID: 2) - Used for padding sequences to a uniform length
BOS token: <|start|> (ID: 0) - Beginning of sequence
EOS token: <|end|> (ID: 1) - End of sequence
UNK token: <|unk|> (ID: 3) - Unknown token

The model also includes additional special tokens for chat and instruction contexts:

<|system|> (ID: 7)
</|system|> (ID: 8)
<|user|> (ID: 9)
</|user|> (ID: 10)
<|instruction|> (ID: 11)
</|instruction|> (ID: 12)

Important usage notes:

When using the MASK token for predictions, be aware that this model uses a space-prefixed BPE tokenizer. The <|mask|> token should be placed IMMEDIATELY after the previous token with NO space, because most tokens in this tokenizer have an initial space encoded within them. For example: "word<|mask|>" rather than "word <|mask|>".

This space-aware placement is crucial for getting accurate predictions.

Limitations

While providing extended context capabilities, this model has some limitations:

Smaller parameter count (41M) compared to larger language models
Primarily focused on English legal and financial texts
Best suited for domain-specific rather than general-purpose tasks
Requires domain expertise to interpret results effectively
May show decreased performance at the far edges of the context window

References

Citation

If you use this model in your research, please cite:

@misc{bommarito2025kl3m,
  title={KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications},
  author={Bommarito II, Michael J. and Katz, Daniel Martin and Bommarito, Jillian},
  year={2025},
  eprint={2503.17247},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

@misc{bommarito2025kl3mdata,
  title={The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models},
  author={Bommarito II, Michael J. and Bommarito, Jillian and Katz, Daniel Martin},
  year={2025},
  eprint={2504.07854},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

License

This model is licensed under CC-BY 4.0.

Contact

The KL3M model family is maintained by the ALEA Institute. For technical support, collaboration opportunities, or general inquiries:

Email: [email protected]
Website: https://aleainstitute.ai
GitHub: https://github.com/alea-institute/kl3m-model-research

alea-institute
/

kl3m-doc-pico-long-001