brivil1
/

lithuanian-sentiment-analysis-DistilBERT

+---
+license: apache-2.0
+language:
+- lt
+library_name: bertopic
+pipeline_tag: text-classification
+tags:
+- DistilBERT
+- SentimentAnalysis
+- LithuanianReviews
+---
+# DistilBERT Base Model for Lithuanian Reviews Sentiment Analysis
+## Overview
+This repository contains a fine-tuned version of the distilbert/distilbert-base-multilingual-cased model for sentiment analysis classification.
+It was specifically trained using Lithuanian internet reviews from various domains as part of a master's degree research project on the topic
+"Sentiment Analysis of Lithuanian Online Reviews Using Deep Language Models".<br><br>
+DistilBERT is a smaller, faster, and more efficient version of BERT, retaining 97% of BERT’s language understanding while being 60% faster and 40% smaller.
+The base DistilBERT model was pre-trained on the Wikipedia dataset across 104 languages, including Lithuanian. The case-sensitive model can differentiate between 'labai nepatiko' and 'LABAI nepatiko'.
+For more architectural details refer to [distilbert/distilbert-base-multilingual-cased](https://huggingface.co/distilbert/distilbert-base-multilingual-cased) model description.
+### Model Details
+#### Model Description
+- **Developed by:** Brigita Vileikytė
+- **Model type:** Transformer-based language model
+- **Language(s) (NLP):** fine-tuned for Lithuanian, pre-trained on 104 languages;
+- **License:** Apache 2.0
+- **Finetuned from model:** distilbert/distilbert-base-multilingual-cased
+#### Bias, Risks, and Limitations
+While the fine-tuned DistilBERT model shows promising results in classifying sentiments from Lithuanian reviews, it is important to be aware of potential biases and limitations:
+##### Dataset Bias
+1. **Imbalance in Sentiment Distribution**: The dataset contains more positive reviews than negative or neutral ones. This imbalance can lead the model to perform better on positive sentiments and less accurately on neutral or negative ones.
+2. **Source Bias**: Reviews were collected from specific sources (Pigu.lt, Atsiliepimai.lt, Google Maps). These sources may not represent the full spectrum of sentiments expressed across all Lithuanian internet domains.
+##### Practical Considerations
+1. **Interpretation of Sentiments**: Sentiments are subjective, and the model's classification might not always align with human judgment. Users should consider the model's predictions as one of several tools for sentiment analysis.
+2. **Updates and Maintenance**: The model's performance may degrade as language usage evolves. Regular updates and retraining with new data can help maintain accuracy.
+## Training Details
+### Training Data
+The dataset for fine-tuning the model was collected from three sources:
+1. [Pigu.lt](https://pigu.lt/lt/) - 5993 reviews
+2. [Atsiliepimai.lt](https://atsiliepimai.lt/) - 3212 reviews
+3. [Google Maps](https://www.google.com/maps) - 122795 reviews
+The reviews were classified into five categories based on a 5-star rating system:
+- **5 stars**: Emotionally positive sentiment (Category 4)
+- **4 stars**: Rationally positive sentiment (Category 3)
+- **3 stars**: Neutral sentiment (Category 2)
+- **2 stars**: Rationally negative sentiment (Category 1)
+- **1 star**: Emotionally negative sentiment (Category 0)
+## Evaluation
+### Performance Metrics
+| Model        | Accuracy | F1 Score Overall | F1 Scores by Category |
+|-------------|----------|------------------|----------------------|
+| DistilBERT  | 0.6845   | 0.6751           | 0.7601, 0.3556, 0.4938, 0.4513, 0.8354 |
+### Results
+The model's performance was evaluated using a confusion matrix and various metrics. The table below presents the results for all five sentiment categories:
+| True Category           | Emotionally Negative | Rationally Negative | Neutral | Rationally Positive | Emotionally Positive |
+|-------------------------|----------------------|---------------------|--------|---------------------|----------------------|
+| Emotionally Negative    | 2135 (80.74%)        | 248 (9.38%)         | 197 (7.45%) | 82 (3.10%)         | 83 (3.14%)           |
+| Rationally Negative     | 362 (26.32%)         | 402 (29.20%)        | 232 (16.85%) | 71 (5.15%)         | 40 (2.91%)           |
+| Neutral                 | 237 (12.76%)         | 217 (11.69%)        | 984 (53.00%) | 396 (21.31%)       | 280 (15.08%)         |
+| Rationally Positive     | 48 (2.63%)           | 32 (1.75%)          | 299 (16.41%) | 1030 (56.51%)      | 978 (53.60%)         |
+| Emotionally Positive    | 71 (1.14%)           | 25 (0.40%)          | 149 (2.37%) | 590 (9.39%)        | 5645 (89.61%)        |
+The table below presents the results for three sentiment categories:
+| True Category | Negative            | Neutral        | Positive          |
+|---------------|---------------------|----------------|-------------------|
+| Negative      | 3147 (75.79%)       | 429 (10.34%)   | 276 (6.65%)       |
+| Neutral       | 454 (14.90%)        | 984 (32.18%)   | 676 (22.09%)      |
+| Positive      | 217 (2.98%)         | 445 (6.11%)    | 8243 (91.01%)     |
+### Getting Started
+#### Model Usage
+To use the fine-tuned model for sentiment analysis, you can follow the steps below:
+```python
+from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer
+# Load the fine-tuned model and tokenizer
+model_output_dir = "brivil1/lithuanian-sentiment-analysis-ByT5"
+trained_model = AutoModelForSequenceClassification.from_pretrained(model_output_dir)
+trained_tokenizer = AutoTokenizer.from_pretrained(model_output_dir)
+# Create a sentiment analysis pipeline
+sentiment_pipeline = pipeline("text-classification", model=trained_model, tokenizer=trained_tokenizer)
+```
+#### Example
+```
+print(sentiment_pipeline("Blogai. ziauru ir nepatiko"))
+print(sentiment_pipeline("Labai puiku"))
+print(sentiment_pipeline("Nežinau, visai nepatinka"))
+```
+Results:
+```
+[{'label': 'negative', 'score': 0.9424479007720947}]
+[{'label': 'positive', 'score': 0.8821539282798767}]
+[{'label': 'neutral', 'score': 0.9761189222335815}]
+```