alea-institute commited on
Commit
fb2c028
·
verified ·
1 Parent(s): de7f46d

Update README for large ONNX model

Browse files
Files changed (1) hide show
  1. README.md +110 -0
README.md ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - charboundary
6
+ - sentence-boundary-detection
7
+ - paragraph-detection
8
+ - legal-text
9
+ - legal-nlp
10
+ - text-segmentation
11
+ - onnx
12
+ - cpu
13
+ - document-processing
14
+ - rag
15
+ - optimized-inference
16
+ license: mit
17
+ library_name: charboundary
18
+ pipeline_tag: text-classification
19
+ datasets:
20
+ - alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries
21
+ - alea-institute/kl3m-data-snapshot-20250324
22
+ metrics:
23
+ - accuracy
24
+ - f1
25
+ - precision
26
+ - recall
27
+ - throughput
28
+ papers:
29
+ - https://arxiv.org/abs/2504.04131
30
+ ---
31
+
32
+ # CharBoundary large ONNX Model
33
+
34
+ This is the large ONNX model for the [CharBoundary](https://github.com/alea-institute/charboundary) library (v0.5.0),
35
+ a fast character-based sentence and paragraph boundary detection system optimized for legal text.
36
+
37
+ ## Model Details
38
+
39
+ - **Size**: large
40
+ - **Model Size**: 12.0 MB (ONNX compressed)
41
+ - **Memory Usage**: 5734 MB at runtime (non-ONNX version)
42
+ - **Training Data**: Legal text with ~5,000,000 samples from [KL3M dataset](https://huggingface.co/datasets/alea-institute/kl3m-data-snapshot-20250324)
43
+ - **Model Type**: Random Forest (100 trees, max depth 24) converted to ONNX
44
+ - **Format**: ONNX optimized for inference
45
+ - **Task**: Character-level boundary detection for text segmentation
46
+ - **License**: MIT
47
+ - **Throughput**: ~518K characters/second (base model; ONNX is typically 2-4x faster)
48
+
49
+ ## Usage
50
+
51
+ > **Security Advantage:** This ONNX model format provides enhanced security compared to SKOPS models, as it doesn't require bypassing security measures with `trust_model=True`. ONNX models are the recommended option for security-sensitive environments.
52
+
53
+ ```python
54
+ from charboundary import get_large_onnx_segmenter
55
+
56
+ # First load can be slow
57
+ segmenter = get_large_onnx_segmenter()
58
+
59
+ # Use the model
60
+ text = "This is a test sentence. Here's another one!"
61
+ sentences = segmenter.segment_to_sentences(text)
62
+ print(sentences)
63
+ # Output: ['This is a test sentence.', " Here's another one!"]
64
+
65
+ # Segment to spans
66
+ sentence_spans = segmenter.get_sentence_spans(text)
67
+ print(sentence_spans)
68
+ # Output: [(0, 24), (24, 44)]
69
+ ```
70
+
71
+ ## Performance
72
+
73
+ ONNX models provide significantly faster inference compared to the standard scikit-learn models
74
+ while maintaining the same accuracy metrics. The performance differences between model sizes are shown below.
75
+
76
+ ### Base Model Performance
77
+
78
+ | Dataset | Precision | F1 | Recall |
79
+ |---------|-----------|-------|--------|
80
+ | ALEA SBD Benchmark | 0.637 | 0.727 | 0.847 |
81
+ | SCOTUS | 0.950 | 0.778 | 0.658 |
82
+ | Cyber Crime | 0.968 | 0.853 | 0.762 |
83
+ | BVA | 0.963 | 0.881 | 0.813 |
84
+ | Intellectual Property | 0.954 | 0.890 | 0.834 |
85
+
86
+ ### Size and Speed Comparison
87
+
88
+ | Model | Format | Size (MB) | Memory Usage | Throughput (chars/sec) | F1 Score |
89
+ |-------|--------|-----------|--------------|------------------------|----------|
90
+ | Small | [SKOPS](https://huggingface.co/alea-institute/charboundary-small) / [ONNX](https://huggingface.co/alea-institute/charboundary-small-onnx) | 3.0 / 0.5 | 1,026 MB | ~748K | 0.773 |
91
+ | Medium | [SKOPS](https://huggingface.co/alea-institute/charboundary-medium) / [ONNX](https://huggingface.co/alea-institute/charboundary-medium-onnx) | 13.0 / 2.6 | 1,897 MB | ~587K | 0.779 |
92
+ | Large | [SKOPS](https://huggingface.co/alea-institute/charboundary-large) / [ONNX](https://huggingface.co/alea-institute/charboundary-large-onnx) | 60.0 / 13.0 | 5,734 MB | ~518K | 0.782 |
93
+
94
+ ## Paper and Citation
95
+
96
+ This model is part of the research presented in the following paper:
97
+
98
+ ```
99
+ @article{bommarito2025precise,
100
+ title={Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary},
101
+ author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian},
102
+ journal={arXiv preprint arXiv:2504.04131},
103
+ year={2025}
104
+ }
105
+ ```
106
+
107
+ For more details on the model architecture, training, and evaluation, please see:
108
+ - [Paper: "Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary"](https://arxiv.org/abs/2504.04131)
109
+ - [CharBoundary GitHub repository](https://github.com/alea-institute/charboundary)
110
+ - [Annotated training data](https://huggingface.co/datasets/alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries)