DistilBERT IMDb Sentiment Classifier

Model Description

This is a fine-tuned version of DistilBERT for sentiment analysis on the IMDb movie review dataset. DistilBERT is a smaller, faster, and lighter variant of BERT, designed to perform efficiently while retaining the core strengths of BERT in natural language understanding.

The model is trained to classify movie reviews as either positive or negative sentiments, making it ideal for applications where sentiment analysis is needed, such as analyzing customer feedback, social media posts, or reviews.

Intended Use

This model is intended for text classification tasks, specifically sentiment analysis. It can be used to automatically label a piece of text as either having a positive or negative sentiment.

Use Cases

Movie review sentiment analysis
Customer feedback analysis
Social media sentiment monitoring
Product review classification

How to Use

Here is how you can use this model with the Hugging Face transformers library:

from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
import torch

# Load the model and tokenizer
model_name = "Ashaduzzaman/imdb-distilbert-funetuned",
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name)

# Example text
text = "The movie was absolutely fantastic! The acting was superb and the story was gripping."

# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
predictions = torch.softmax(logits, dim=1)

# Get the predicted label
predicted_label = torch.argmax(predictions).item()
labels = ["Negative", "Positive"]
print(f"Predicted sentiment: {labels[predicted_label]}")

Training Data

This model was trained on the IMDb movie review dataset, a large dataset for binary sentiment classification. The dataset contains 50,000 highly polarized movie reviews. This dataset is balanced, with 25,000 positive and 25,000 negative reviews.

Training Procedure

The model was fine-tuned using the IMDb dataset with the following configuration:

Optimizer: AdamW (Adam with betas=(0.9,0.999) and epsilon=1e-08)
Learning Rate: 2e-5
Batch Size: 16
Epochs: 2
Max Sequence Length: 512 tokens

Training results

Training Loss	Epoch	Step	Validation Loss	Accuracy
0.2239	1.0	1563	0.2026	0.9227
0.1468	2.0	3126	0.2319	0.9320

Loss: 0.2319
Accuracy: 0.9320

Limitations

The model is specifically trained on the IMDb dataset, so its effectiveness may be reduced when applied to other domains or types of text.
Sentiment detection is binary (positive or negative). Neutral sentiments or more nuanced emotions are not captured.
The model may not perform well on text that is highly sarcastic, contains slang, or is very short (e.g., one-word reviews).

Ethical Considerations

Bias: The model may reflect biases present in the IMDb dataset. Users should be cautious about applying this model to sensitive applications.
Content: Since the IMDb dataset includes movie reviews, the model might not generalize well to text outside of this context.

Acknowledgments

The original DistilBERT model was developed by Hugging Face.
The IMDb dataset is provided by Stanford and can be found here.

Framework versions

Transformers 4.42.4
Pytorch 2.3.1+cu121
Datasets 2.21.0
Tokenizers 0.19.1

ashaduzzaman
/

imdb-distilbert-funetuned