Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,291 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
metrics:
|
6 |
+
- precision
|
7 |
+
- recall
|
8 |
+
- f1
|
9 |
+
- accuracy
|
10 |
+
new_version: v1.0
|
11 |
+
datasets:
|
12 |
+
- BookCorpus
|
13 |
+
- Wikipedia
|
14 |
+
tags:
|
15 |
+
- BERT
|
16 |
+
- MNLI
|
17 |
+
- NLI
|
18 |
+
- transformer
|
19 |
+
- pre-training
|
20 |
+
- NLP
|
21 |
+
- MIT-NLP-v1
|
22 |
+
base_model:
|
23 |
+
- google/bert-base-uncased
|
24 |
+
library_name: transformers
|
25 |
+
---
|
26 |
+
|
27 |
+
[](https://opensource.org/licenses/MIT)
|
28 |
+
[](#)
|
29 |
+
[](#)
|
30 |
+
[](#)
|
31 |
+
|
32 |
+
# Model Card for boltuix/bert-mobile
|
33 |
+
|
34 |
+
The `boltuix/bert-mobile` model is a mobile-optimized BERT variant designed for natural language processing tasks requiring efficient performance on resource-constrained devices like mobile phones and edge hardware. Pretrained on English text using masked language modeling (MLM) and next sentence prediction (NSP) objectives, it is optimized for fine-tuning on a range of NLP tasks, including sequence classification, token classification, and question answering. With a size of ~140 MB, it can be quantized to ~25 MB with no major loss in performance, making it ideal for mobile and edge applications needing strong performance with minimal resource usage.
|
35 |
+
|
36 |
+
## Model Details
|
37 |
+
|
38 |
+
### Model Description
|
39 |
+
|
40 |
+
The `boltuix/bert-mobile` model is a PyTorch-based transformer model derived from TensorFlow checkpoints in the Google BERT repository. It builds on research from *On the Importance of Pre-training Compact Models* ([arXiv](https://arxiv.org/abs/1908.08962)) and *Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics* ([arXiv](https://arxiv.org/abs/1908.08962)). Ported to Hugging Face, this uncased model (~140 MB) is engineered for mobile-optimized NLP applications, such as sentiment analysis, named entity recognition, and natural language inference, making it suitable for developers and researchers targeting efficient deployment on mobile and edge devices.
|
41 |
+
|
42 |
+
- **Developed by:** BoltUIX
|
43 |
+
- **Funded by:** BoltUIX Research Fund
|
44 |
+
- **Shared by:** Hugging Face
|
45 |
+
- **Model type:** Transformer (BERT)
|
46 |
+
- **Language(s) (NLP):** English (`en`)
|
47 |
+
- **License:** MIT
|
48 |
+
- **Finetuned from model:** google-bert/bert-base-uncased
|
49 |
+
|
50 |
+
### Model Sources
|
51 |
+
|
52 |
+
- **Repository:** [Hugging Face Model Hub](https://huggingface.co/boltuix/bert-mobile)
|
53 |
+
- **Paper:** [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](http://arxiv.org/abs/1810.04805)
|
54 |
+
|
55 |
+
## Model Variants
|
56 |
+
|
57 |
+
BoltUIX offers a range of BERT-based models tailored to different performance and resource requirements. The `boltuix/bert-mobile` model is optimized for mobile and edge devices, offering strong performance with the ability to quantize to ~25 MB without significant loss. Below is a summary of available models:
|
58 |
+
|
59 |
+
| Tier | Model ID | Size (MB) | Notes |
|
60 |
+
|------------|-------------------------|-----------|----------------------------------------------------|
|
61 |
+
| Micro | boltuix/bert-micro | ~15 MB | Smallest, blazing-fast, moderate accuracy |
|
62 |
+
| Mini | boltuix/bert-mini | ~17 MB | Ultra-compact, fast, slightly better accuracy |
|
63 |
+
| Tinyplus | boltuix/bert-tinyplus | ~20 MB | Slightly bigger, better capacity |
|
64 |
+
| Small | boltuix/bert-small | ~45 MB | Good compact/accuracy balance |
|
65 |
+
| Mid | boltuix/bert-mid | ~50 MB | Well-rounded mid-tier performance |
|
66 |
+
| Medium | boltuix/bert-medium | ~160 MB | Strong general-purpose model |
|
67 |
+
| Large | boltuix/bert-large | ~365 MB | Top performer below full-BERT |
|
68 |
+
| Pro | boltuix/bert-pro | ~420 MB | Use only if max accuracy is mandatory |
|
69 |
+
| Mobile | boltuix/bert-mobile | ~140 MB | Mobile-optimized; quantize to ~25 MB with no major loss |
|
70 |
+
|
71 |
+
For more details on each variant, visit the [BoltUIX Model Hub](https://huggingface.co/boltuix).
|
72 |
+
|
73 |
+
## Uses
|
74 |
+
|
75 |
+
### Direct Use
|
76 |
+
|
77 |
+
The model can be used directly for masked language modeling or next sentence prediction tasks, such as predicting missing words in sentences or determining sentence coherence, delivering strong accuracy for mobile applications.
|
78 |
+
|
79 |
+
### Downstream Use
|
80 |
+
|
81 |
+
The model is designed for fine-tuning on a variety of downstream NLP tasks optimized for mobile and edge devices, including:
|
82 |
+
- Sequence classification (e.g., sentiment analysis, intent detection)
|
83 |
+
- Token classification (e.g., named entity recognition, part-of-speech tagging)
|
84 |
+
- Question answering (e.g., extractive QA, reading comprehension)
|
85 |
+
- Natural language inference (e.g., MNLI, RTE)
|
86 |
+
It is recommended for developers and enterprises deploying NLP solutions on mobile devices or edge hardware where efficiency and performance are critical.
|
87 |
+
|
88 |
+
### Out-of-Scope Use
|
89 |
+
|
90 |
+
The model is not suitable for:
|
91 |
+
- Text generation tasks (use generative models like GPT-3 instead).
|
92 |
+
- Non-English language tasks without significant fine-tuning.
|
93 |
+
- Applications requiring maximum accuracy (use `boltuix/bert-large` or `boltuix/bert-pro` instead).
|
94 |
+
|
95 |
+
## Bias, Risks, and Limitations
|
96 |
+
|
97 |
+
The model may inherit biases from its training data (BookCorpus and English Wikipedia), potentially reinforcing stereotypes, such as gender or occupational biases. For example:
|
98 |
+
```python
|
99 |
+
from transformers import pipeline
|
100 |
+
unmasker = pipeline('fill-mask', model='boltuix/bert-mobile')
|
101 |
+
unmasker("The man worked as a [MASK].")
|
102 |
+
```
|
103 |
+
**Output**:
|
104 |
+
```json
|
105 |
+
[
|
106 |
+
{'sequence': '[CLS] the man worked as a engineer. [SEP]', 'token_str': 'engineer'},
|
107 |
+
{'sequence': '[CLS] the man worked as a doctor. [SEP]', 'token_str': 'doctor'},
|
108 |
+
...
|
109 |
+
]
|
110 |
+
```
|
111 |
+
```python
|
112 |
+
unmasker("The woman worked as a [MASK].")
|
113 |
+
```
|
114 |
+
**Output**:
|
115 |
+
```json
|
116 |
+
[
|
117 |
+
{'sequence': '[CLS] the woman worked as a teacher. [SEP]', 'token_str': 'teacher'},
|
118 |
+
{'sequence': '[CLS] the woman worked as a nurse. [SEP]', 'token_str': 'nurse'},
|
119 |
+
...
|
120 |
+
]
|
121 |
+
```
|
122 |
+
These biases may propagate to downstream tasks. While the model’s size (~140 MB, quantizable to ~25 MB) makes it suitable for mobile devices, its performance may be limited for complex tasks compared to larger variants.
|
123 |
+
|
124 |
+
### Recommendations
|
125 |
+
|
126 |
+
Users should:
|
127 |
+
- Conduct bias audits tailored to their application.
|
128 |
+
- Fine-tune with diverse, representative datasets to reduce bias.
|
129 |
+
- Apply quantization to reduce the model size to ~25 MB for ultra-efficient mobile deployment.
|
130 |
+
|
131 |
+
## How to Get Started with the Model
|
132 |
+
|
133 |
+
Use the code below to get started with the model.
|
134 |
+
|
135 |
+
```python
|
136 |
+
from transformers import pipeline, BertTokenizer, BertModel
|
137 |
+
|
138 |
+
# Masked Language Modeling
|
139 |
+
unmasker = pipeline('fill-mask', model='boltuix/bert-mobile')
|
140 |
+
result = unmasker("Hello I'm a [MASK] model.")
|
141 |
+
print(result)
|
142 |
+
|
143 |
+
# Feature Extraction (PyTorch)
|
144 |
+
tokenizer = BertTokenizer.from_pretrained('boltuix/bert-mobile')
|
145 |
+
model = BertModel.from_pretrained('boltuix/bert-mobile')
|
146 |
+
text = "Replace me by any text you'd like."
|
147 |
+
encoded_input = tokenizer(text, return_tensors='pt')
|
148 |
+
output = model(**encoded_input)
|
149 |
+
```
|
150 |
+
|
151 |
+
## Training Details
|
152 |
+
|
153 |
+
### Training Data
|
154 |
+
|
155 |
+
The model was pretrained on:
|
156 |
+
- **BookCorpus**: ~11,038 unpublished books, providing diverse narrative text.
|
157 |
+
- **English Wikipedia**: Excluding lists, tables, and headers for clean, factual content.
|
158 |
+
|
159 |
+
See the [BoltUIX Dataset Card](https://huggingface.co/boltuix/datasets) for more details.
|
160 |
+
|
161 |
+
### Training Procedure
|
162 |
+
|
163 |
+
#### Preprocessing
|
164 |
+
|
165 |
+
- Texts are lowercased and tokenized using WordPiece with a vocabulary size of 30,000.
|
166 |
+
- Inputs are formatted as: `[CLS] Sentence A [SEP] Sentence B [SEP]`.
|
167 |
+
- 50% of the time, Sentence A and B are consecutive; otherwise, Sentence B is random.
|
168 |
+
- Masking:
|
169 |
+
- 15% of tokens are masked.
|
170 |
+
- 80% of masked tokens are replaced with `[MASK]`.
|
171 |
+
- 10% are replaced with a random token.
|
172 |
+
- 10% are left unchanged.
|
173 |
+
|
174 |
+
#### Training Hyperparameters
|
175 |
+
|
176 |
+
- **Training regime:** fp16 mixed precision
|
177 |
+
- **Optimizer**: Adam (learning rate 1e-4, β1=0.9, β2=0.999, weight decay 0.01)
|
178 |
+
- **Batch size**: 256
|
179 |
+
- **Steps**: 900,000
|
180 |
+
- **Sequence length**: 128 tokens (90% of steps), 512 tokens (10% of steps)
|
181 |
+
- **Warmup**: 9,000 steps with linear learning rate decay
|
182 |
+
|
183 |
+
#### Speeds, Sizes, Times
|
184 |
+
|
185 |
+
- **Training time**: Approximately 180 hours
|
186 |
+
- **Checkpoint size**: ~140 MB (quantizable to ~25 MB)
|
187 |
+
- **Throughput**: ~110 sentences/second on TPU infrastructure
|
188 |
+
|
189 |
+
## Evaluation
|
190 |
+
|
191 |
+
### Testing Data, Factors & Metrics
|
192 |
+
|
193 |
+
#### Testing Data
|
194 |
+
|
195 |
+
Evaluated on the GLUE benchmark, including tasks like MNLI, QQP, QNLI, SST-2, CoLA, STS-B, MRPC, and RTE.
|
196 |
+
|
197 |
+
#### Factors
|
198 |
+
|
199 |
+
- **Subpopulations**: General English text, academic, and professional domains
|
200 |
+
- **Domains**: News, books, Wikipedia, scientific articles
|
201 |
+
|
202 |
+
#### Metrics
|
203 |
+
|
204 |
+
- **Accuracy**: For classification tasks (e.g., MNLI, SST-2)
|
205 |
+
- **F1 Score**: For tasks like QQP, MRPC
|
206 |
+
- **Pearson/Spearman Correlation**: For STS-B
|
207 |
+
|
208 |
+
### Results
|
209 |
+
|
210 |
+
GLUE test results (fine-tuned):
|
211 |
+
| Task | MNLI-(m/mm) | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | Average |
|
212 |
+
|------------|-------------|------|------|-------|------|-------|------|------|---------|
|
213 |
+
| Score | 83.9/82.7 | 71.5 | 89.9 | 92.7 | 51.8 | 85.0 | 88.0 | 66.2 | 79.0 |
|
214 |
+
|
215 |
+
#### Summary
|
216 |
+
|
217 |
+
The model delivers strong performance across GLUE tasks for a mobile-optimized model, with notable results in SST-2 and QNLI. It outperforms smaller variants like `boltuix/bert-mid` in tasks such as RTE and CoLA, making it a robust choice for mobile applications.
|
218 |
+
|
219 |
+
## Model Examination
|
220 |
+
|
221 |
+
The model’s attention mechanisms were analyzed to ensure effective contextual understanding optimized for mobile deployment, with no significant overfitting observed during pretraining. Ablation studies validated the training configuration for efficient performance.
|
222 |
+
|
223 |
+
## Environmental Impact
|
224 |
+
|
225 |
+
Carbon emissions estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) from [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
226 |
+
|
227 |
+
- **Hardware Type**: 4 cloud TPUs (16 TPU chips)
|
228 |
+
- **Hours used**: 180 hours
|
229 |
+
- **Cloud Provider**: Google Cloud
|
230 |
+
- **Compute Region**: us-central1
|
231 |
+
- **Carbon Emitted**: ~130 kg CO2eq (estimated based on TPU energy consumption and regional grid carbon intensity)
|
232 |
+
|
233 |
+
## Technical Specifications
|
234 |
+
|
235 |
+
### Model Architecture and Objective
|
236 |
+
|
237 |
+
- **Architecture**: BERT (transformer-based, bidirectional)
|
238 |
+
- **Objective**: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP)
|
239 |
+
- **Layers**: 8
|
240 |
+
- **Hidden Size**: 512
|
241 |
+
- **Attention Heads**: 8
|
242 |
+
|
243 |
+
### Compute Infrastructure
|
244 |
+
|
245 |
+
#### Hardware
|
246 |
+
|
247 |
+
- 4 cloud TPUs in Pod configuration (16 TPU chips total)
|
248 |
+
|
249 |
+
#### Software
|
250 |
+
|
251 |
+
- PyTorch
|
252 |
+
- Transformers library (Hugging Face)
|
253 |
+
|
254 |
+
## Citation
|
255 |
+
|
256 |
+
**BibTeX:**
|
257 |
+
```bibtex
|
258 |
+
@article{DBLP:journals/corr/abs-1810-04805,
|
259 |
+
author = {Jacob Devlin and Ming{-}Wei Chang and Kenton Lee and Kristina Toutanova},
|
260 |
+
title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language Understanding},
|
261 |
+
journal = {CoRR},
|
262 |
+
volume = {abs/1810.04805},
|
263 |
+
year = {2018},
|
264 |
+
url = {http://arxiv.org/abs/1810.04805},
|
265 |
+
archivePrefix = {arXiv},
|
266 |
+
eprint = {1810.04805}
|
267 |
+
}
|
268 |
+
```
|
269 |
+
|
270 |
+
**APA:**
|
271 |
+
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. *CoRR, abs/1810.04805*. http://arxiv.org/abs/1810.04805
|
272 |
+
|
273 |
+
## Glossary
|
274 |
+
|
275 |
+
- **MLM**: Masked Language Modeling, where 15% of tokens are masked for prediction.
|
276 |
+
- **NSP**: Next Sentence Prediction, determining if two sentences are consecutive.
|
277 |
+
- **WordPiece**: Tokenization method splitting words into subword units.
|
278 |
+
|
279 |
+
## More Information
|
280 |
+
|
281 |
+
- See the [Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/bert) for advanced usage details.
|
282 |
+
- Contact: [email protected]
|
283 |
+
|
284 |
+
## Model Card Authors
|
285 |
+
|
286 |
+
- Hugging Face team
|
287 |
+
- BoltUIX contributors
|
288 |
+
|
289 |
+
## Model Card Contact
|
290 |
+
|
291 |
+
For questions, please contact [email protected] or open an issue on the [model repository](https://huggingface.co/boltuix/bert-mobile).
|