File size: 2,677 Bytes
d2be5f2 4a06daf |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
---
license: mit
datasets:
- oscar-corpus/OSCAR-2301
language:
- ta
base_model:
- google-bert/bert-base-multilingual-cased
pipeline_tag: fill-mask
library_name: transformers
---
# **Fine-Tuned mBERT for Enhanced Tamil NLP**
### *Optimized with 100K OSCAR Tamil Data Points*
## **Model Overview**
This model is a fine-tuned version of **Multilingual BERT (mBERT)** on the **OSCAR Tamil dataset**, leveraging 100,000 data points for enhanced Tamil language understanding. The fine-tuning process was performed to improve the model's ability to handle Tamil text effectively, making it suitable for various NLP tasks such as classification, named entity recognition, and text generation.
## **Dataset Details**
- **Dataset Name**: OSCAR (Open Super-large Crawled ALMAnaCH Research dataset) – Tamil subset
- **Size**: 100K samples
- **Preprocessing**: Text normalization, tokenization using the mBERT tokenizer, and removal of noise for improved data quality.
## **Model Specifications**
- **Base Model**: `bert-base-multilingual-cased`
- **Training Steps**: Custom fine-tuning with Tamil text
- **Tokenizer Used**: mBERT tokenizer
- **Batch Size**: Optimized for performance
- **Objective**: Improve Tamil language representation in mBERT for downstream NLP tasks
## **Usage**
This model can be used for multiple NLP tasks in Tamil, such as:
✅ Text Classification
✅ Named Entity Recognition (NER)
✅ Sentiment Analysis
✅ Question Answering
✅ Sentence Embeddings
## **How to Use the Model**
To load the model in Python using **Hugging Face Transformers**, use the following code snippet:
```python
from transformers import AutoTokenizer, AutoModel
model_name = "viswadarshan06/Tamil-MLM" # Replace with your model path
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Tokenizing a sample Tamil text
text = "தமிழ் மொழியில் இயற்கை மொழி செயலாக்கம் முக்கியம்!"
tokens = tokenizer(text, return_tensors="pt")
# Getting model embeddings
outputs = model(**tokens)
print(outputs.last_hidden_state.shape) # Output shape: (batch_size, seq_length, hidden_size)
```
## Performance & Evaluation
Evaluated on downstream tasks to validate improved Tamil language representation.
Shows better contextual understanding of Tamil text compared to the base mBERT model.
## Conclusion
This fine-tuned mBERT model bridges the gap in Tamil NLP by leveraging large-scale pretraining and task-specific fine-tuning, making it a valuable resource for researchers and developers working on Tamil NLP applications. |