CharBoundary large ONNX Model

This is the large ONNX model for the CharBoundary library (v0.5.0), a fast character-based sentence and paragraph boundary detection system optimized for legal text.

Model Details

  • Size: large
  • Model Size: 12.0 MB (ONNX compressed)
  • Memory Usage: 5734 MB at runtime (non-ONNX version)
  • Training Data: Legal text with ~5,000,000 samples from KL3M dataset
  • Model Type: Random Forest (100 trees, max depth 24) converted to ONNX
  • Format: ONNX optimized for inference
  • Task: Character-level boundary detection for text segmentation
  • License: MIT
  • Throughput: ~518K characters/second (base model; ONNX is typically 2-4x faster)

Usage

Security Advantage: This ONNX model format provides enhanced security compared to SKOPS models, as it doesn't require bypassing security measures with trust_model=True. ONNX models are the recommended option for security-sensitive environments.

# Make sure to install with the onnx extra to get ONNX runtime support
# pip install charboundary[onnx]
from charboundary import get_large_onnx_segmenter

# First load can be slow
segmenter = get_large_onnx_segmenter()

# Use the model
text = "This is a test sentence. Here's another one!"
sentences = segmenter.segment_to_sentences(text)
print(sentences)
# Output: ['This is a test sentence.', " Here's another one!"]

# Segment to spans
sentence_spans = segmenter.get_sentence_spans(text)
print(sentence_spans)
# Output: [(0, 24), (24, 44)]

Performance

ONNX models provide significantly faster inference compared to the standard scikit-learn models while maintaining the same accuracy metrics. The performance differences between model sizes are shown below.

Base Model Performance

Dataset Precision F1 Recall
ALEA SBD Benchmark 0.637 0.727 0.847
SCOTUS 0.950 0.778 0.658
Cyber Crime 0.968 0.853 0.762
BVA 0.963 0.881 0.813
Intellectual Property 0.954 0.890 0.834

Size and Speed Comparison

Model Format Size (MB) Memory Usage Throughput (chars/sec) F1 Score
Small SKOPS / ONNX 3.0 / 0.5 1,026 MB ~748K 0.773
Medium SKOPS / ONNX 13.0 / 2.6 1,897 MB ~587K 0.779
Large SKOPS / ONNX 60.0 / 13.0 5,734 MB ~518K 0.782

Paper and Citation

This model is part of the research presented in the following paper:

@article{bommarito2025precise,
  title={Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary},
  author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian},
  journal={arXiv preprint arXiv:2504.04131},
  year={2025}
}

For more details on the model architecture, training, and evaluation, please see:

Contact

This model is developed and maintained by the ALEA Institute.

For technical support, collaboration opportunities, or general inquiries:

For any questions, please contact ALEA Institute at [email protected] or create an issue on this repository or GitHub.

https://aleainstitute.ai

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train alea-institute/charboundary-large-onnx