kl3m-doc-nano-long-002

kl3m-doc-nano-long-002 is a domain-specific model based on the ModernBERT architecture, specifically designed for feature extraction in legal and financial document analysis with support for longer context windows. It combines the more powerful representation capacity of the nano variant (84M parameters) with an extended context length of 4,096 tokens. While the model is capable of performing masked language modeling (MLM), it is primarily optimized for feature extraction tasks. As a ModernBERT model, it differs from RoBERTa in implementation details and doesn't use token_type_ids inputs.

Model Details

Architecture: ModernBERT
Size: 84M parameters
Hidden Size: 512
Layers: 8
Attention Heads: 8
Max Sequence Length: 4,096
Tokenizer: alea-institute/kl3m-004-128k-cased

Use Cases

This model is particularly useful for:

Document classification in legal and financial domains with lengthy documents
Analyzing relationships between distant parts of a document
Processing lengthy agreements, contracts, and regulatory filings
Feature extraction for downstream legal analysis tasks requiring longer context
Semantic search and retrieval across collections of long documents

The extended context length combined with the more powerful nano-sized architecture makes this model especially suited for working with longer portions of legal documents, contracts, and financial statements that often exceed the context limits of standard models.

Standard Test Examples

Using our standardized test examples for comparing embedding models:

Fill-Mask Results

While this model is primarily designed for feature extraction, it performs reasonably well on masked language modeling tasks:

Contract Clause Heading:
"<|cls|> 8. REPRESENTATIONS AND<|mask|>. Each party hereby represents and warrants to the other party as of the date hereof as follows: <|sep|>"

Top 5 predictions:
1. WARRANTIES (0.4808)
2. RESERVED (0.1020)
3. CONTINGENCIES (0.0774)
4. DUTIES (0.0564)
5. WARR (0.0256)
Defined Term Example:
"<|cls|> \"Effective<|mask|>\" means the date on which all conditions precedent set forth in Article V are satisfied or waived by the Administrative Agent. <|sep|>"

Top 5 predictions:
1. Date (0.8864)
2. date (0.0721)
3. Time (0.0158)
4. Period (0.0051)
5. Dates (0.0040)
Regulation Example:
"<|cls|> All transactions shall comply with the requirements set forth in the Truth in<|mask|> Act and its implementing Regulation Z. <|sep|>"

Top 5 predictions:
1. Lending (0.7695)
2. Exchange (0.0437)
3. the (0.0237)
4. Tax (0.0119)
5. in (0.0106)

Document Similarity Results

The model shows strong performance in document embedding and similarity tasks:

Document Pair	Cosine Similarity (CLS token)	Cosine Similarity (Mean pooling)
Court Complaint vs. Consumer Terms	0.757	0.700
Court Complaint vs. Credit Agreement	0.805	0.844
Consumer Terms vs. Credit Agreement	0.850	0.756

This model shows generally higher similarity scores across document pairs compared to the nano-001 variant, particularly between Consumer Terms and Credit Agreement (0.850) when using CLS tokens. With mean pooling, it shows the strongest similarity between Court Complaint and Credit Agreement (0.844), suggesting the model effectively captures domain-specific relationships.

Usage

The primary use case for this model is feature extraction for document embedding and downstream classification tasks. Since this is a ModernBERT model, there are a few important differences from standard BERT/RoBERTa models:

The model doesn't use token_type_ids as inputs, so these should be removed if present
The model's architecture is optimized for longer context windows (4,096 tokens)
Some standard transformers pipeline APIs might need adjustments for ModernBERT models

Here's how to use it:

from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("alea-institute/kl3m-doc-nano-long-002")
model = AutoModel.from_pretrained("alea-institute/kl3m-doc-nano-long-002")

# Standard test samples for embedding comparison
texts = [
    # Court Complaint
    "<|cls|> IN THE UNITED STATES DISTRICT COURT FOR THE EASTERN DISTRICT OF PENNSYLVANIA\n\nJOHN DOE,\nPlaintiff,\n\nvs.\n\nACME CORPORATION,\nDefendant.\n\nCIVIL ACTION NO. 21-12345\n\nCOMPLAINT\n\nPlaintiff John Doe, by and through his undersigned counsel, hereby files this Complaint against Defendant Acme Corporation, and in support thereof, alleges as follows: <|sep|>",
    
    # Consumer Terms and Conditions
    "<|cls|> TERMS AND CONDITIONS\n\nLast Updated: April 10, 2025\n\nThese Terms and Conditions (\"Terms\") govern your access to and use of the Service. By accessing or using the Service, you agree to be bound by these Terms. If you do not agree to these Terms, you may not access or use the Service. These Terms constitute a legally binding agreement between you and the Company. <|sep|>",
    
    # Credit Agreement
    "<|cls|> CREDIT AGREEMENT\n\nDated as of April 10, 2025\n\nAmong\n\nACME BORROWER INC.,\nas the Borrower,\n\nBANK OF FINANCE,\nas Administrative Agent,\n\nand\n\nTHE LENDERS PARTY HERETO\n\nThis CREDIT AGREEMENT (\"Agreement\") is entered into as of April 10, 2025, among ACME BORROWER INC., a Delaware corporation (the \"Borrower\"), each lender from time to time party hereto (collectively, the \"Lenders\"), and BANK OF FINANCE, as Administrative Agent. <|sep|>"
]

# Tokenize and encode
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=512)

# Remove token_type_ids if present (this model doesn't use them)
if 'token_type_ids' in inputs:
    del inputs['token_type_ids']

# Generate embeddings
with torch.no_grad():
    model_output = model(**inputs)
    
# Strategy 1: CLS token embeddings
# Uses only the embedding from the CLS token for document representation
cls_embeddings = model_output.last_hidden_state[:, 0, :]
print(f"CLS embeddings shape: {cls_embeddings.shape}")  # Should be [3, 512]

# Strategy 2: Mean Pooling
# Averages all token embeddings, weighted by attention mask
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Get mean-pooled embeddings
mean_embeddings = mean_pooling(model_output, inputs["attention_mask"])
print(f"Mean-pooled embeddings shape: {mean_embeddings.shape}")  # Should be [3, 512]

# Normalize embeddings (important for cosine similarity)
cls_embeddings = F.normalize(cls_embeddings, p=2, dim=1)
mean_embeddings = F.normalize(mean_embeddings, p=2, dim=1)

# Calculate document similarities using both methods
cls_similarity = cosine_similarity(cls_embeddings.numpy())
mean_similarity = cosine_similarity(mean_embeddings.numpy())

print("\nCLS token embeddings similarity matrix:")
print(np.round(cls_similarity, 3))
# Output:
# [[1.    0.757 0.805]
#  [0.757 1.    0.850]
#  [0.805 0.850 1.   ]]

print("\nMean pooling embeddings similarity matrix:")
print(np.round(mean_similarity, 3))
# Output:
# [[1.    0.700 0.844]
#  [0.700 1.    0.756]
#  [0.844 0.756 1.   ]]

# Print pairwise similarities
doc_names = ["Court Complaint", "Consumer Terms", "Credit Agreement"]
print("\nPairwise similarities:")
for i in range(len(doc_names)):
    for j in range(i+1, len(doc_names)):
        print(f"{doc_names[i]} vs. {doc_names[j]}:")
        print(f"  - CLS token: {cls_similarity[i, j]:.4f}")
        print(f"  - Mean pooling: {mean_similarity[i, j]:.4f}")

# Example: For a long document that utilizes the extended context
long_text = "<|cls|> [LONG DOCUMENT TEXT THAT SPANS THOUSANDS OF TOKENS] <|sep|>"
long_inputs = tokenizer(long_text, return_tensors="pt", max_length=4096, truncation=True)

# Remove token_type_ids here too
if 'token_type_ids' in long_inputs:
    del long_inputs['token_type_ids']

with torch.no_grad():
    long_output = model(**long_inputs)
    
long_embedding = long_output.last_hidden_state[:, 0, :]  # CLS token embedding

The model can also perform masked language modeling. Note that you can use either the pipeline API (which will handle the token_type_ids internally) or directly access the model with appropriate modifications:

from transformers import pipeline

# Load the fill-mask pipeline - this handles ModernBERT compatibility 
# by automatically managing token_type_ids
fill_mask = pipeline('fill-mask', model="alea-institute/kl3m-doc-nano-long-002")

# Example: Contract term with mask token
# Note: The mask token is placed immediately after the previous word with no space
text = "<|cls|> 8. REPRESENTATIONS AND<|mask|>. Each party hereby represents and warrants to the other party as of the date hereof as follows: <|sep|>"
results = fill_mask(text)

# Display predictions
print("Top 5 predictions:")
for i, result in enumerate(results[:5]):
    print(f"{i+1}. {result['token_str']} ({result['score']:.4f})")

# Output:
# 1. WARRANTIES (0.4808)
# 2. RESERVED (0.1020) 
# 3. CONTINGENCIES (0.0774)
# 4. DUTIES (0.0564)
# 5. WARR (0.0256)

Alternatively, if you need more control, you can use the model directly:

from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

# Load model with MLM head
tokenizer = AutoTokenizer.from_pretrained("alea-institute/kl3m-doc-nano-long-002")
mlm_model = AutoModelForMaskedLM.from_pretrained("alea-institute/kl3m-doc-nano-long-002")

# Example text with mask token
text = "<|cls|> \"Effective<|mask|>\" means the date on which all conditions precedent set forth in Article V are satisfied or waived by the Administrative Agent. <|sep|>"
inputs = tokenizer(text, return_tensors="pt")

# Remove token_type_ids if present (this model doesn't use them)
if 'token_type_ids' in inputs:
    del inputs['token_type_ids']

# Get predictions
with torch.no_grad():
    outputs = mlm_model(**inputs)

# Find the masked token index
masked_index = torch.where(inputs.input_ids[0] == tokenizer.mask_token_id)[0].item()
probs = outputs.logits[0, masked_index].softmax(dim=0)
top_5 = torch.topk(probs, 5)

# Print top predictions
print("Top 5 predictions:")
for i, (score, idx) in enumerate(zip(top_5.values, top_5.indices)):
    token = tokenizer.decode(idx).strip()
    print(f"{i+1}. {token} ({score.item():.4f})")

# Output:
# 1. Date (0.8864)
# 2. date (0.0721)
# 3. Time (0.0158)
# 4. Period (0.0051)
# 5. Dates (0.0040)

Long Context Capabilities with ModernBERT

This model uses the ModernBERT architecture which enables an extended context window from the standard 512 tokens to 4,096 tokens, allowing it to process much longer documents. The ModernBERT architecture includes modified attention mechanisms and positional encodings to effectively handle longer contexts. These capabilities allow the model to:

Process full legal agreements and contracts without truncation
Maintain awareness of context from the beginning of a document when analyzing later sections
Better handle documents with complex cross-references and definitions
Reduce the need for document chunking in downstream applications

Training with ModernBERT

The model was trained using the ModernBERT architecture on a diverse corpus of legal and financial documents, ensuring high-quality performance in these domains. ModernBERT differs from standard BERT/RoBERTa in its handling of longer sequences and internal architecture details. This version has been specifically optimized for feature extraction with longer contexts, incorporating the following key aspects:

ModernBERT architecture that enables efficient processing of sequences up to 4,096 tokens
Modified attention mechanisms and positional embeddings optimized for long contexts
Training focused on dense document representation for retrieval and classification tasks
Larger hidden size (512) compared to pico variants for more powerful representations
Additional training on full-length documents to ensure contextual understanding

It leverages the KL3M tokenizer which provides 9-17% more efficient tokenization for domain-specific content than general-purpose tokenizers. The ModernBERT implementation removes the token_type_ids input parameter (as compared to BERT or RoBERTa models) and has other implementation differences that improve performance on longer documents.

Special Tokens

This model includes the following special tokens:

CLS token: <|cls|> (ID: 5) - Used for the beginning of input text
MASK token: <|mask|> (ID: 6) - Used to mark tokens for prediction
SEP token: <|sep|> (ID: 4) - Used for the end of input text
PAD token: <|pad|> (ID: 2) - Used for padding sequences to a uniform length
BOS token: <|start|> (ID: 0) - Beginning of sequence
EOS token: <|end|> (ID: 1) - End of sequence
UNK token: <|unk|> (ID: 3) - Unknown token

Important usage notes:

When using the MASK token for predictions, be aware that this model uses a space-prefixed BPE tokenizer. The <|mask|> token should be placed IMMEDIATELY after the previous token with NO space, because most tokens in this tokenizer have an initial space encoded within them. For example: "word<|mask|>" rather than "word <|mask|>".

This space-aware placement is crucial for getting accurate predictions.

Limitations

While providing a good balance between powerful representations and extended context capabilities, this model has some limitations:

Moderate parameter count (84M) compared to billion-parameter large language models
Primarily focused on English legal and financial texts
Best suited for domain-specific rather than general-purpose tasks
May show decreased performance at the far edges of the context window
Requires domain expertise to interpret results effectively

References

Citation

If you use this model in your research, please cite:

@misc{bommarito2025kl3m,
  title={KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications},
  author={Bommarito II, Michael J. and Katz, Daniel Martin and Bommarito, Jillian},
  year={2025},
  eprint={2503.17247},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

@misc{bommarito2025kl3mdata,
  title={The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models},
  author={Bommarito II, Michael J. and Bommarito, Jillian and Katz, Daniel Martin},
  year={2025},
  eprint={2504.07854},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

License

This model is licensed under CC-BY 4.0.

Contact

The KL3M model family is maintained by the ALEA Institute. For technical support, collaboration opportunities, or general inquiries:

Email: [email protected]
Website: https://aleainstitute.ai
GitHub: https://github.com/alea-institute/kl3m-model-research

alea-institute
/

kl3m-doc-nano-long-002