CharBoundary medium (default) ONNX Model

This is the medium (default) ONNX model for the CharBoundary library (v0.5.0), a fast character-based sentence and paragraph boundary detection system optimized for legal text.

Model Details

  • Size: medium (default)
  • Model Size: 2.6 MB (ONNX compressed)
  • Memory Usage: 1897 MB at runtime (non-ONNX version)
  • Training Data: Legal text with ~500,000 samples from KL3M dataset
  • Model Type: Random Forest (64 trees, max depth 20) converted to ONNX
  • Format: ONNX optimized for inference
  • Task: Character-level boundary detection for text segmentation
  • License: MIT
  • Throughput: ~587K characters/second (base model; ONNX is typically 2-4x faster)

Usage

Security Advantage: This ONNX model format provides enhanced security compared to SKOPS models, as it doesn't require bypassing security measures with trust_model=True. ONNX models are the recommended option for security-sensitive environments.

# Make sure to install with the onnx extra to get ONNX runtime support
# pip install charboundary[onnx]
from charboundary import get_medium (default)_onnx_segmenter

# First load can be slow
segmenter = get_medium (default)_onnx_segmenter()

# Use the model
text = "This is a test sentence. Here's another one!"
sentences = segmenter.segment_to_sentences(text)
print(sentences)
# Output: ['This is a test sentence.', " Here's another one!"]

# Segment to spans
sentence_spans = segmenter.get_sentence_spans(text)
print(sentence_spans)
# Output: [(0, 24), (24, 44)]

Performance

ONNX models provide significantly faster inference compared to the standard scikit-learn models while maintaining the same accuracy metrics. The performance differences between model sizes are shown below.

Base Model Performance

Dataset Precision F1 Recall
ALEA SBD Benchmark 0.631 0.722 0.842
SCOTUS 0.938 0.775 0.661
Cyber Crime 0.961 0.853 0.767
BVA 0.957 0.875 0.806
Intellectual Property 0.948 0.889 0.837

Size and Speed Comparison

Model Format Size (MB) Memory Usage Throughput (chars/sec) F1 Score
Small SKOPS / ONNX 3.0 / 0.5 1,026 MB ~748K 0.773
Medium SKOPS / ONNX 13.0 / 2.6 1,897 MB ~587K 0.779
Large SKOPS / ONNX 60.0 / 13.0 5,734 MB ~518K 0.782

Paper and Citation

This model is part of the research presented in the following paper:

@article{bommarito2025precise,
  title={Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary},
  author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian},
  journal={arXiv preprint arXiv:2504.04131},
  year={2025}
}

For more details on the model architecture, training, and evaluation, please see:

Contact

This model is developed and maintained by the ALEA Institute.

For technical support, collaboration opportunities, or general inquiries:

For any questions, please contact ALEA Institute at [email protected] or create an issue on this repository or GitHub.

https://aleainstitute.ai

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train alea-institute/charboundary-medium-onnx