CharBoundary large ONNX Model

This is the large ONNX model for the CharBoundary library (v0.5.0), a fast character-based sentence and paragraph boundary detection system optimized for legal text.

Model Details

Size: large
Model Size: 12.0 MB (ONNX compressed)
Memory Usage: 5734 MB at runtime (non-ONNX version)
Training Data: Legal text with ~5,000,000 samples from KL3M dataset
Model Type: Random Forest (100 trees, max depth 24) converted to ONNX
Format: ONNX optimized for inference
Task: Character-level boundary detection for text segmentation
License: MIT
Throughput: ~518K characters/second (base model; ONNX is typically 2-4x faster)

Usage

Security Advantage: This ONNX model format provides enhanced security compared to SKOPS models, as it doesn't require bypassing security measures with trust_model=True. ONNX models are the recommended option for security-sensitive environments.

# Make sure to install with the onnx extra to get ONNX runtime support
# pip install charboundary[onnx]
from charboundary import get_large_onnx_segmenter

# First load can be slow
segmenter = get_large_onnx_segmenter()

# Use the model
text = "This is a test sentence. Here's another one!"
sentences = segmenter.segment_to_sentences(text)
print(sentences)
# Output: ['This is a test sentence.', " Here's another one!"]

# Segment to spans
sentence_spans = segmenter.get_sentence_spans(text)
print(sentence_spans)
# Output: [(0, 24), (24, 44)]

Performance

ONNX models provide significantly faster inference compared to the standard scikit-learn models while maintaining the same accuracy metrics. The performance differences between model sizes are shown below.

Base Model Performance

Dataset	Precision	F1	Recall
ALEA SBD Benchmark	0.637	0.727	0.847
SCOTUS	0.950	0.778	0.658
Cyber Crime	0.968	0.853	0.762
BVA	0.963	0.881	0.813
Intellectual Property	0.954	0.890	0.834

Size and Speed Comparison

Model	Format	Size (MB)	Memory Usage	Throughput (chars/sec)	F1 Score
Small	SKOPS / ONNX	3.0 / 0.5	1,026 MB	~748K	0.773
Medium	SKOPS / ONNX	13.0 / 2.6	1,897 MB	~587K	0.779
Large	SKOPS / ONNX	60.0 / 13.0	5,734 MB	~518K	0.782

Paper and Citation

This model is part of the research presented in the following paper:

@article{bommarito2025precise,
  title={Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary},
  author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian},
  journal={arXiv preprint arXiv:2504.04131},
  year={2025}
}

For more details on the model architecture, training, and evaluation, please see:

Contact

This model is developed and maintained by the ALEA Institute.

For technical support, collaboration opportunities, or general inquiries:

GitHub: https://github.com/alea-institute/kl3m-model-research
Email: [email protected]
Website: https://aleainstitute.ai

For any questions, please contact ALEA Institute at [email protected] or create an issue on this repository or GitHub.

alea-institute
/

charboundary-large-onnx