brivil1 commited on
Commit
881c601
·
verified ·
1 Parent(s): a2f1511

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +120 -3
README.md CHANGED
@@ -1,3 +1,120 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - lt
5
+ library_name: bertopic
6
+ pipeline_tag: text-classification
7
+ tags:
8
+ - DistilBERT
9
+ - SentimentAnalysis
10
+ - LithuanianReviews
11
+ ---
12
+ # DistilBERT Base Model for Lithuanian Reviews Sentiment Analysis
13
+
14
+ ## Overview
15
+ This repository contains a fine-tuned version of the distilbert/distilbert-base-multilingual-cased model for sentiment analysis classification.
16
+ It was specifically trained using Lithuanian internet reviews from various domains as part of a master's degree research project on the topic
17
+ "Sentiment Analysis of Lithuanian Online Reviews Using Deep Language Models".<br><br>
18
+ DistilBERT is a smaller, faster, and more efficient version of BERT, retaining 97% of BERT’s language understanding while being 60% faster and 40% smaller.
19
+ The base DistilBERT model was pre-trained on the Wikipedia dataset across 104 languages, including Lithuanian. The case-sensitive model can differentiate between 'labai nepatiko' and 'LABAI nepatiko'.
20
+ For more architectural details refer to [distilbert/distilbert-base-multilingual-cased](https://huggingface.co/distilbert/distilbert-base-multilingual-cased) model description.
21
+
22
+ ### Model Details
23
+
24
+ #### Model Description
25
+
26
+ - **Developed by:** Brigita Vileikytė
27
+ - **Model type:** Transformer-based language model
28
+ - **Language(s) (NLP):** fine-tuned for Lithuanian, pre-trained on 104 languages;
29
+ - **License:** Apache 2.0
30
+ - **Finetuned from model:** distilbert/distilbert-base-multilingual-cased
31
+
32
+ #### Bias, Risks, and Limitations
33
+
34
+ While the fine-tuned DistilBERT model shows promising results in classifying sentiments from Lithuanian reviews, it is important to be aware of potential biases and limitations:
35
+
36
+ ##### Dataset Bias
37
+
38
+ 1. **Imbalance in Sentiment Distribution**: The dataset contains more positive reviews than negative or neutral ones. This imbalance can lead the model to perform better on positive sentiments and less accurately on neutral or negative ones.
39
+ 2. **Source Bias**: Reviews were collected from specific sources (Pigu.lt, Atsiliepimai.lt, Google Maps). These sources may not represent the full spectrum of sentiments expressed across all Lithuanian internet domains.
40
+
41
+ ##### Practical Considerations
42
+
43
+ 1. **Interpretation of Sentiments**: Sentiments are subjective, and the model's classification might not always align with human judgment. Users should consider the model's predictions as one of several tools for sentiment analysis.
44
+ 2. **Updates and Maintenance**: The model's performance may degrade as language usage evolves. Regular updates and retraining with new data can help maintain accuracy.
45
+
46
+ ## Training Details
47
+
48
+ ### Training Data
49
+
50
+ The dataset for fine-tuning the model was collected from three sources:
51
+ 1. [Pigu.lt](https://pigu.lt/lt/) - 5993 reviews
52
+ 2. [Atsiliepimai.lt](https://atsiliepimai.lt/) - 3212 reviews
53
+ 3. [Google Maps](https://www.google.com/maps) - 122795 reviews
54
+
55
+ The reviews were classified into five categories based on a 5-star rating system:
56
+ - **5 stars**: Emotionally positive sentiment (Category 4)
57
+ - **4 stars**: Rationally positive sentiment (Category 3)
58
+ - **3 stars**: Neutral sentiment (Category 2)
59
+ - **2 stars**: Rationally negative sentiment (Category 1)
60
+ - **1 star**: Emotionally negative sentiment (Category 0)
61
+
62
+ ## Evaluation
63
+
64
+ ### Performance Metrics
65
+
66
+ | Model | Accuracy | F1 Score Overall | F1 Scores by Category |
67
+ |-------------|----------|------------------|----------------------|
68
+ | DistilBERT | 0.6845 | 0.6751 | 0.7601, 0.3556, 0.4938, 0.4513, 0.8354 |
69
+
70
+ ### Results
71
+
72
+ The model's performance was evaluated using a confusion matrix and various metrics. The table below presents the results for all five sentiment categories:
73
+
74
+ | True Category | Emotionally Negative | Rationally Negative | Neutral | Rationally Positive | Emotionally Positive |
75
+ |-------------------------|----------------------|---------------------|--------|---------------------|----------------------|
76
+ | Emotionally Negative | 2135 (80.74%) | 248 (9.38%) | 197 (7.45%) | 82 (3.10%) | 83 (3.14%) |
77
+ | Rationally Negative | 362 (26.32%) | 402 (29.20%) | 232 (16.85%) | 71 (5.15%) | 40 (2.91%) |
78
+ | Neutral | 237 (12.76%) | 217 (11.69%) | 984 (53.00%) | 396 (21.31%) | 280 (15.08%) |
79
+ | Rationally Positive | 48 (2.63%) | 32 (1.75%) | 299 (16.41%) | 1030 (56.51%) | 978 (53.60%) |
80
+ | Emotionally Positive | 71 (1.14%) | 25 (0.40%) | 149 (2.37%) | 590 (9.39%) | 5645 (89.61%) |
81
+
82
+ The table below presents the results for three sentiment categories:
83
+
84
+ | True Category | Negative | Neutral | Positive |
85
+ |---------------|---------------------|----------------|-------------------|
86
+ | Negative | 3147 (75.79%) | 429 (10.34%) | 276 (6.65%) |
87
+ | Neutral | 454 (14.90%) | 984 (32.18%) | 676 (22.09%) |
88
+ | Positive | 217 (2.98%) | 445 (6.11%) | 8243 (91.01%) |
89
+
90
+
91
+ ### Getting Started
92
+
93
+ #### Model Usage
94
+
95
+ To use the fine-tuned model for sentiment analysis, you can follow the steps below:
96
+
97
+ ```python
98
+ from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer
99
+
100
+ # Load the fine-tuned model and tokenizer
101
+ model_output_dir = "brivil1/lithuanian-sentiment-analysis-ByT5"
102
+ trained_model = AutoModelForSequenceClassification.from_pretrained(model_output_dir)
103
+ trained_tokenizer = AutoTokenizer.from_pretrained(model_output_dir)
104
+
105
+ # Create a sentiment analysis pipeline
106
+ sentiment_pipeline = pipeline("text-classification", model=trained_model, tokenizer=trained_tokenizer)
107
+ ```
108
+
109
+ #### Example
110
+ ```
111
+ print(sentiment_pipeline("Blogai. ziauru ir nepatiko"))
112
+ print(sentiment_pipeline("Labai puiku"))
113
+ print(sentiment_pipeline("Nežinau, visai nepatinka"))
114
+ ```
115
+ Results:
116
+ ```
117
+ [{'label': 'negative', 'score': 0.9424479007720947}]
118
+ [{'label': 'positive', 'score': 0.8821539282798767}]
119
+ [{'label': 'neutral', 'score': 0.9761189222335815}]
120
+ ```