File size: 2,677 Bytes
d2be5f2
 
 
 
 
 
 
 
 
 
 
4a06daf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
---
license: mit
datasets:
- oscar-corpus/OSCAR-2301
language:
- ta
base_model:
- google-bert/bert-base-multilingual-cased
pipeline_tag: fill-mask
library_name: transformers
---
# **Fine-Tuned mBERT for Enhanced Tamil NLP**  
### *Optimized with 100K OSCAR Tamil Data Points*

## **Model Overview**
This model is a fine-tuned version of **Multilingual BERT (mBERT)** on the **OSCAR Tamil dataset**, leveraging 100,000 data points for enhanced Tamil language understanding. The fine-tuning process was performed to improve the model's ability to handle Tamil text effectively, making it suitable for various NLP tasks such as classification, named entity recognition, and text generation.

## **Dataset Details**
- **Dataset Name**: OSCAR (Open Super-large Crawled ALMAnaCH Research dataset) – Tamil subset  
- **Size**: 100K samples  
- **Preprocessing**: Text normalization, tokenization using the mBERT tokenizer, and removal of noise for improved data quality.  

## **Model Specifications**
- **Base Model**: `bert-base-multilingual-cased`  
- **Training Steps**: Custom fine-tuning with Tamil text  
- **Tokenizer Used**: mBERT tokenizer  
- **Batch Size**: Optimized for performance  
- **Objective**: Improve Tamil language representation in mBERT for downstream NLP tasks  

## **Usage**
This model can be used for multiple NLP tasks in Tamil, such as:  
✅ Text Classification  
✅ Named Entity Recognition (NER)  
✅ Sentiment Analysis  
✅ Question Answering  
✅ Sentence Embeddings  

## **How to Use the Model**
To load the model in Python using **Hugging Face Transformers**, use the following code snippet:

```python
from transformers import AutoTokenizer, AutoModel

model_name = "viswadarshan06/Tamil-MLM"  # Replace with your model path
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Tokenizing a sample Tamil text
text = "தமிழ் மொழியில் இயற்கை மொழி செயலாக்கம் முக்கியம்!"
tokens = tokenizer(text, return_tensors="pt")

# Getting model embeddings
outputs = model(**tokens)
print(outputs.last_hidden_state.shape)  # Output shape: (batch_size, seq_length, hidden_size)
```

## Performance & Evaluation
Evaluated on downstream tasks to validate improved Tamil language representation.
Shows better contextual understanding of Tamil text compared to the base mBERT model.

## Conclusion
This fine-tuned mBERT model bridges the gap in Tamil NLP by leveraging large-scale pretraining and task-specific fine-tuning, making it a valuable resource for researchers and developers working on Tamil NLP applications.