Language Detection Model
A BERT-based language detection model trained on hac541309/open-lid-dataset, which includes 121 million sentences across 200 languages. This model is optimized for fast and accurate language identification in text classification tasks.
Model Details
- Architecture: BertForSequenceClassification
- Hidden Size: 384
- Number of Layers: 4
- Attention Heads: 6
- Max Sequence Length: 512
- Dropout: 0.1
- Vocabulary Size: 50,257
Training Process
- Dataset:
- Used the open-lid-dataset
- Split into train (90%) and test (10%)
- Tokenizer: A custom
BertTokenizerFast
with special tokens for[UNK]
,[CLS]
,[SEP]
,[PAD]
,[MASK]
- Hyperparameters:
- Learning Rate: 2e-5
- Batch Size: 256 (training) / 512 (testing)
- Epochs: 1
- Scheduler: Cosine
- Trainer: Leveraged the Hugging Face Trainer API with Weights & Biases for logging
Evaluation
The model was evaluated on the test split. Below are the overall metrics:
- Accuracy: 0.969466
- Precision: 0.969586
- Recall: 0.969466
- F1 Score: 0.969417
Detailled evaluation (Size is the number of languages supported)
Script | Support | Precision | Recall | F1 Score | Size |
---|---|---|---|---|---|
Arab | 819219 | 0.9038 | 0.9014 | 0.9023 | 21 |
Latn | 7924704 | 0.9678 | 0.9663 | 0.9670 | 125 |
Ethi | 144403 | 0.9967 | 0.9964 | 0.9966 | 2 |
Beng | 163983 | 0.9949 | 0.9935 | 0.9942 | 3 |
Deva | 423895 | 0.9495 | 0.9326 | 0.9405 | 10 |
Cyrl | 831949 | 0.9899 | 0.9883 | 0.9891 | 12 |
Tibt | 35683 | 0.9925 | 0.9930 | 0.9927 | 2 |
Grek | 131155 | 0.9984 | 0.9990 | 0.9987 | 1 |
Gujr | 86912 | 0.99999 | 0.9999 | 0.99995 | 1 |
Hebr | 100530 | 0.9966 | 0.9995 | 0.9981 | 2 |
Armn | 67203 | 0.9999 | 0.9998 | 0.9998 | 1 |
Jpan | 88004 | 0.9983 | 0.9987 | 0.9985 | 1 |
Knda | 67170 | 0.9999 | 0.9998 | 0.9999 | 1 |
Geor | 70769 | 0.99997 | 0.9998 | 0.9999 | 1 |
Khmr | 39708 | 1.0000 | 0.9997 | 0.9999 | 1 |
Hang | 108509 | 0.9997 | 0.9999 | 0.9998 | 1 |
Laoo | 29389 | 0.9999 | 0.9999 | 0.9999 | 1 |
Mlym | 68418 | 0.99996 | 0.9999 | 0.9999 | 1 |
Mymr | 100857 | 0.9999 | 0.9992 | 0.9995 | 2 |
Orya | 44976 | 0.9995 | 0.9998 | 0.9996 | 1 |
Guru | 67106 | 0.99999 | 0.9999 | 0.9999 | 1 |
Olck | 22279 | 1.0000 | 0.9991 | 0.9995 | 1 |
Sinh | 67492 | 1.0000 | 0.9998 | 0.9999 | 1 |
Taml | 76373 | 0.99997 | 0.9999 | 0.9999 | 1 |
Tfng | 41325 | 0.8512 | 0.8246 | 0.8247 | 2 |
Telu | 62387 | 0.99997 | 0.9999 | 0.9999 | 1 |
Thai | 83820 | 0.99995 | 0.9998 | 0.9999 | 1 |
Hant | 152723 | 0.9945 | 0.9954 | 0.9949 | 2 |
Hans | 92689 | 0.9893 | 0.9870 | 0.9882 | 1 |
A detailed per-script classification report is also provided in the repository for further analysis.
How to Use
You can quickly load and run inference with this model using the Transformers pipeline:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
tokenizer = AutoTokenizer.from_pretrained("alexneakameni/language_detection")
model = AutoModelForSequenceClassification.from_pretrained("alexneakameni/language_detection")
language_detection = pipeline("text-classification", model=model, tokenizer=tokenizer)
text = "Hello world!"
predictions = language_detection(text)
print(predictions)
This will output the predicted language code or label with the corresponding confidence score.
Note: The model’s performance may vary depending on text length, language variety, and domain-specific vocabulary. Always validate results against your own datasets for critical applications.
For more information, see the repository documentation.
Thank you for using this model—feedback and contributions are welcome!
- Downloads last month
- 28