ahs95 commited on
Commit
83f8d5b
·
verified ·
1 Parent(s): 8b3d44a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +80 -41
README.md CHANGED
@@ -4,69 +4,108 @@ language:
4
  - bn
5
  metrics:
6
  - f1
 
 
7
  base_model:
8
  - csebuetnlp/banglabert_small
9
  pipeline_tag: text-classification
10
  library_name: transformers
11
  tags:
12
- - bangla-nlp
13
  - sentiment-analysis
14
  - sarcasm-detection
 
 
 
15
  ---
16
 
17
- # Bangla Sentiment and Sarcasm Detection Model
18
 
19
- This repository hosts the trained model for detecting sentiment and sarcasm in Bangla social media comments, specifically focusing on reactions to Bangladesh's performance in the 2023 ICC Cricket World Cup. The model is designed to classify comments into sentiment categories (positive, negative, neutral) and identify sarcasm (sarcastic, non-sarcastic).
20
 
21
- ## 📚 Overview
 
22
 
23
- The model is based on a dual-head transformer architecture fine-tuned using **BanglaBERT**. It addresses class imbalance through focal loss and employs multilabel stratified k-fold cross-validation for robust evaluation.
24
 
25
- ## 🧠 Key Features
 
26
 
27
- - **Manually Annotated Dataset**: Utilizes a comprehensive collection of **5,635** Bangla comments.
28
- - **Custom Dual-Head Classification Model**: Jointly detects sentiment and sarcasm.
29
- - **Focal Loss Integration**: Effectively manages class imbalance in the dataset.
30
- - **Multilabel Stratified K-Fold Cross-Validation**: Ensures reliable model evaluation.
31
- - **Interactive Gradio Interface**: Provides real-time predictions and user interaction.
32
- - **Open Source**: Publicly available [code and dataset](https://github.com/ahs95/sentiment-analysis-cwcbd23) for reproducibility and further research.
33
 
34
- ## 📁 Dataset
 
 
35
 
36
- The dataset used for training is the largest publicly available collection of Bangla comments focused on sentiment and sarcasm detection:
37
 
38
- - **Source**: Social media comments related to Bangladesh’s 2023 ICC Cricket World Cup performance.
39
- - **Size**: **5,635** manually annotated samples.
40
- - **Labels**:
41
- - **Sentiment**: Positive / Negative / Neutral
42
- - **Sarcasm**: Sarcastic / Non-sarcastic
43
 
44
- ## 🤖 Model Architecture
 
 
45
 
46
- - **Base Model**: BanglaBERT
47
- - **Custom Head**: Dual-output head for multi-task classification.
48
- - **Loss Function**: Combined focal loss for both tasks.
49
- - **Training Strategy**: Multilabel stratified k-fold cross-validation to enhance model performance and reliability.
50
 
51
- ## 🚀 Usage
 
 
 
52
 
53
- To use the model for inference, you can follow these steps:
54
 
55
- 1. Install the required libraries:
56
- ```bash
57
- pip install transformers torch
58
- ```
59
 
60
- 2. Load the model:
61
- ```python
62
- from transformers import AutoModelForSequenceClassification, AutoTokenizer
63
 
64
- model = AutoModelForSequenceClassification.from_pretrained("ahs95/sentiment-sarcasm-detection-BanglaBERT")
65
- tokenizer = AutoTokenizer.from_pretrained("ahs95/sentiment-sarcasm-detection-BanglaBERT")
66
- ```
67
 
68
- 3. Make predictions:
69
- ```python
70
- inputs = tokenizer("মায়ের দোয়া ক্রিকেট বোর্ডে আপনাকে স্বাগতম", return_tensors="pt")
71
- outputs = model(**inputs)
72
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - bn
5
  metrics:
6
  - f1
7
+ - precision
8
+ - recall
9
  base_model:
10
  - csebuetnlp/banglabert_small
11
  pipeline_tag: text-classification
12
  library_name: transformers
13
  tags:
14
+ - bangla
15
  - sentiment-analysis
16
  - sarcasm-detection
17
+ - low-resource
18
+ - sports-analytics
19
+ - social-media
20
  ---
21
 
22
+ # BanglaBERT Dual-Head Model for Sentiment and Sarcasm Detection
23
 
24
+ ## Overview
25
 
26
+ This repository contains a **fine-tuned BanglaBERT model** for **dual-head multi-label classification** — detecting both **sentiment** (positive, neutral, negative) and **sarcasm** (sarcastic, non-sarcastic) in Bangla social media text.
27
+ The model is designed for **low-resource NLP** and is trained on a manually annotated dataset of **5,635 Bangla Facebook and YouTube comments** related to Bangladesh’s performance in the **2023 ICC Cricket World Cup**.
28
 
29
+ ## Model Architecture
30
 
31
+ * **Base Model:** [csebuetnlp/banglabert_small](https://huggingface.co/csebuetnlp/banglabert_small)
32
+ * **Architecture:** Transformer-based dual-head classification
33
 
34
+ * Head 1 Sentiment Classification (3 classes)
35
+ * Head 2 Sarcasm Detection (2 classes)
36
+ * **Training Techniques:**
 
 
 
37
 
38
+ * Focal Loss with class weighting to handle **severe data imbalance**
39
+ * Multilabel stratified K-fold cross-validation
40
+ * Domain-specific data preprocessing for Bangla text
41
 
42
+ ## Dataset
43
 
44
+ * **Size:** 5,635 manually annotated comments
45
+ * **Labels:**
 
 
 
46
 
47
+ * Sentiment: Positive, Neutral, Negative
48
+ * Sarcasm: Sarcastic, Non-Sarcastic
49
+ * **Source:** Publicly available Facebook & YouTube comments (2023 ICC Cricket World Cup)
50
 
51
+ ## Performance
 
 
 
52
 
53
+ | Task | Weighted F1 | Class-wise F1 (Minority) | Class-wise F1 (Majority) |
54
+ | ----------------- | ----------- | ----------------------------- | ------------------------ |
55
+ | Sentiment | **0.89** | Neutral: 0.69, Positive: 0.73 | Negative: 0.96 |
56
+ | Sarcasm Detection | **0.84** | Sarcastic: 0.60 | Non-Sarcastic: 0.91 |
57
 
58
+ **Key Gains:**
59
 
60
+ * +0.20 F1 improvement for Neutral sentiment
61
+ * +0.18 F1 improvement for Sarcastic content
62
+ * Attributed to focal loss + inverse class weighting
 
63
 
64
+ ## Example Usage
 
 
65
 
66
+ ```python
67
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
68
+ import torch
69
 
70
+ # Load tokenizer and model
71
+ tokenizer = AutoTokenizer.from_pretrained("your-username/banglabert-sentiment-sarcasm")
72
+ model = AutoModelForSequenceClassification.from_pretrained("your-username/banglabert-sentiment-sarcasm")
73
+
74
+ # Example Bangla text
75
+ text = "শিক্ষা সফর 2023 বাংলাদেশ টু ইন্ডিয়া সফল হোক"
76
+
77
+ # Tokenize
78
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
79
+
80
+ # Predict
81
+ with torch.no_grad():
82
+ outputs = model(**inputs)
83
+
84
+ # Raw logits
85
+ print(outputs.logits)
86
+ ```
87
+
88
+ ## Intended Use
89
+
90
+ * **Sports analytics:** Track fan sentiment and sarcasm during live matches
91
+ * **Social media monitoring:** Identify sarcastic backlash and emotional trends
92
+ * **Brand reputation analysis:** Understand nuanced customer feedback in Bangla
93
+
94
+ ## Limitations
95
+
96
+ * Domain-specific: Trained on cricket-related data; performance may drop in other contexts
97
+ * Context sensitivity: Some sarcasm requires cultural or multimodal cues (e.g., emojis)
98
+ * Not suitable for toxic speech moderation without additional fine-tuning
99
+
100
+ ## Citation
101
+
102
+ If you use this model in your work, please cite:
103
+
104
+ @misc{hoque2025banglabertsentimentsarcasm,
105
+ author = {Arshadul Hoque, Nasrin Sultana, Risul Islam Rasel},
106
+ title = {Bangla Sentiment and Sarcasm Detection: Reactions to Bangladesh's 2023 World Cup},
107
+ note = {Manuscript under review},
108
+ year = {2025},
109
+ publisher = {Hugging Face},
110
+ url = {https://huggingface.co/ahs95/sentiment-sarcasm-detection-BanglaBERT}
111
+ }