dasdristanta13
commited on
Upload folder using huggingface_hub
Browse files- .gitignore +3 -0
- README.md +97 -12
- __pycache__/inference.cpython-310.pyc +0 -0
- freepik__pixel-art-8bits-create-an-icon-for-tweet-sentiment__72933.jpeg +0 -0
- inference.py +96 -0
- requirements.txt +10 -0
- run_gradio_v3.py +105 -0
- train_model_v2.py +246 -0
.gitignore
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
*.xlsx
|
2 |
+
.gradio/
|
3 |
+
emotion_model/
|
README.md
CHANGED
@@ -1,12 +1,97 @@
|
|
1 |
-
---
|
2 |
-
title:
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
title: Twitter_Emotion_and_Target_Prediction
|
3 |
+
app_file: run_gradio_v3.py
|
4 |
+
sdk: gradio
|
5 |
+
sdk_version: 5.1.0
|
6 |
+
---
|
7 |
+
# Tweet Emotion and Target Prediction
|
8 |
+
|
9 |
+
This project implements a machine learning pipeline for predicting the emotion and target of tweets. It includes model training, data preprocessing, data augmentation, inference, and a Gradio-based web interface for easy interaction.
|
10 |
+
|
11 |
+
## Project Structure
|
12 |
+
|
13 |
+
- `train_model_v2.py`: Trains the emotion classification model using a fine-tuned RoBERTa model.
|
14 |
+
- `inference.py`: Implements the prediction pipeline using the trained models.
|
15 |
+
- `run_gradio_v3.py`: Creates a Gradio web interface for interactive predictions.
|
16 |
+
|
17 |
+
## Setup and Installation
|
18 |
+
|
19 |
+
1. Clone this repository:
|
20 |
+
```bash
|
21 |
+
git clone https://github.com/yourusername/tweet-emotion-target-prediction.git
|
22 |
+
cd tweet-emotion-target-prediction
|
23 |
+
```
|
24 |
+
|
25 |
+
2. Install the required packages:
|
26 |
+
```bash
|
27 |
+
pip install pandas numpy scikit-learn transformers datasets torch gradio joblib imbalanced-learn xgboost
|
28 |
+
```
|
29 |
+
|
30 |
+
3. Download the dataset file `NLP Engineer Assignment Dataset (1) (1) (1) (1).xlsx` and place it in the project root directory.
|
31 |
+
|
32 |
+
## Training the Models
|
33 |
+
|
34 |
+
### Emotion Classification Model
|
35 |
+
|
36 |
+
Run the following command to train the emotion classification model:
|
37 |
+
|
38 |
+
```bash
|
39 |
+
python train_model_v2.py
|
40 |
+
```
|
41 |
+
|
42 |
+
This script will:
|
43 |
+
- **Preprocess the data**: Apply basic cleaning techniques and tokenization.
|
44 |
+
- **Data augmentation**: Use oversampling techniques to handle class imbalance, ensuring the model learns well from underrepresented emotions.
|
45 |
+
- **Fine-tune a RoBERTa model**: Use the `cardiffnlp/twitter-roberta-base-sentiment` for transfer learning, fine-tuning it on the tweet emotion dataset.
|
46 |
+
- **Save artifacts**: The fine-tuned RoBERTa model and tokenizer will be saved for inference.
|
47 |
+
|
48 |
+
## Data Augmentation and Handling Imbalance
|
49 |
+
|
50 |
+
- **Random Word Drop**: A function that removes a random subset of words from the input text.
|
51 |
+
- This operation is probabilistically applied to reduce the highest class' dominance and augment the lower classes.
|
52 |
+
|
53 |
+
- **Synonym Replacement**: A function leveraging the WordNet corpus to replace random words with their synonyms, generating alternative versions of the input text.
|
54 |
+
- Synonym replacement is more heavily applied to the minority classes to balance the dataset.
|
55 |
+
|
56 |
+
- **Augmentation Strategy**:
|
57 |
+
- The largest class undergoes minimal augmentation, while the smaller classes receive extra augmentation (both word drop and synonym replacement). The smallest class gets further augmentation by combining both techniques (word drop after synonym replacement).
|
58 |
+
|
59 |
+
## Running Inference
|
60 |
+
|
61 |
+
To process the test data and generate predictions, run:
|
62 |
+
|
63 |
+
```bash
|
64 |
+
python inference.py
|
65 |
+
```
|
66 |
+
|
67 |
+
This script will:
|
68 |
+
- **Load the trained models**: Load both the target classification and emotion classification models.
|
69 |
+
- **Process the test data**: The test dataset is preprocessed similarly to the training dataset.
|
70 |
+
- **Generate predictions**: Predictions for both target and emotion are produced for each tweet.
|
71 |
+
- **Save the results**: The predictions are saved to `test_results.csv` for analysis.
|
72 |
+
|
73 |
+
## Launching the Gradio Interface
|
74 |
+
|
75 |
+
To launch the Gradio web interface for interactive predictions, run:
|
76 |
+
|
77 |
+
```bash
|
78 |
+
python run_gradio_v2.py
|
79 |
+
```
|
80 |
+
|
81 |
+
This will start a local server and provide a URL to access the web interface.
|
82 |
+
|
83 |
+
## Model Details
|
84 |
+
|
85 |
+
### Emotion Classification Model
|
86 |
+
- **Model Architecture**: Fine-tuned `RoBERTa` model ("cardiffnlp/twitter-roberta-base-sentiment").
|
87 |
+
- **Data Augmentation**: Uses SMOTE and other oversampling methods to handle imbalanced class distribution.
|
88 |
+
- **Transfer Learning**: Leverages pre-trained `RoBERTa` for sentiment analysis, fine-tuning it on emotion-labeled tweet data.
|
89 |
+
|
90 |
+
## Gradio Interface
|
91 |
+
|
92 |
+
The Gradio interface provides:
|
93 |
+
- **Input field** for tweet text.
|
94 |
+
- **Text analysis**: Displays word count, character count, hashtags, mentions, URLs, emojis in the tweet.
|
95 |
+
- **Predicted target and emotion**: Real-time display of predictions based on user input.
|
96 |
+
- **Emotion probabilities**: Displays the probability distribution of the predicted emotions.
|
97 |
+
- **Summary table of predictions**: A table summarizing the tweet text, predicted target, emotion, and associated probabilities.
|
__pycache__/inference.cpython-310.pyc
ADDED
Binary file (3.61 kB). View file
|
|
freepik__pixel-art-8bits-create-an-icon-for-tweet-sentiment__72933.jpeg
ADDED
inference.py
ADDED
@@ -0,0 +1,96 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
import os
|
3 |
+
import numpy as np
|
4 |
+
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification
|
5 |
+
import pandas as pd
|
6 |
+
import torch
|
7 |
+
from sklearn.metrics.pairwise import cosine_similarity
|
8 |
+
|
9 |
+
# Check if CUDA is available and set the device
|
10 |
+
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
|
11 |
+
print(f"Using device: {device}")
|
12 |
+
|
13 |
+
|
14 |
+
# Load the model and tokenizer
|
15 |
+
model_path = "./emotion_model"
|
16 |
+
hub_path = "dasdristanta13/twitter-emotion-model"
|
17 |
+
if os.path.isdir(model_path):
|
18 |
+
emotion_model = AutoModelForSequenceClassification.from_pretrained(model_path)
|
19 |
+
emotion_tokenizer = AutoTokenizer.from_pretrained(model_path)
|
20 |
+
else:
|
21 |
+
emotion_model = AutoModelForSequenceClassification.from_pretrained(hub_path)
|
22 |
+
emotion_tokenizer = AutoTokenizer.from_pretrained(hub_path)
|
23 |
+
|
24 |
+
# Move the model to the appropriate device
|
25 |
+
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
|
26 |
+
emotion_model = emotion_model.to(device)
|
27 |
+
|
28 |
+
# Load a pre-trained sentence transformer model for semantic similarity
|
29 |
+
semantic_model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
|
30 |
+
semantic_tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
|
31 |
+
semantic_model = semantic_model.to(device)
|
32 |
+
|
33 |
+
target_mapping = {
|
34 |
+
'Google': 'Google',
|
35 |
+
'Apple': 'Apple',
|
36 |
+
'iPad': 'Apple',
|
37 |
+
'iPhone': 'Apple',
|
38 |
+
'Other Google product or service': 'Google',
|
39 |
+
'Other Apple product or service': 'Apple',
|
40 |
+
'Android': 'Google',
|
41 |
+
'Android App': 'Google',
|
42 |
+
'iPad or iPhone App': 'Apple',
|
43 |
+
}
|
44 |
+
|
45 |
+
def get_embedding(text):
|
46 |
+
inputs = semantic_tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=128)
|
47 |
+
inputs = {k: v.to(device) for k, v in inputs.items()}
|
48 |
+
with torch.no_grad():
|
49 |
+
outputs = semantic_model(**inputs)
|
50 |
+
return outputs.last_hidden_state.mean(dim=1).cpu().numpy()
|
51 |
+
|
52 |
+
def predict_target(text):
|
53 |
+
text_embedding = get_embedding(text)
|
54 |
+
target_embeddings = {target: get_embedding(target) for target in target_mapping.keys()}
|
55 |
+
|
56 |
+
similarities = {target: cosine_similarity(text_embedding, emb)[0][0] for target, emb in target_embeddings.items()}
|
57 |
+
predicted_target = max(similarities, key=similarities.get)
|
58 |
+
|
59 |
+
return predicted_target
|
60 |
+
|
61 |
+
def predict_emotion(text, target):
|
62 |
+
combined_input = f"{text} [SEP] {target}"
|
63 |
+
inputs = emotion_tokenizer(combined_input, return_tensors="pt", truncation=True, padding=True)
|
64 |
+
inputs = {k: v.to(device) for k, v in inputs.items()}
|
65 |
+
|
66 |
+
with torch.no_grad():
|
67 |
+
outputs = emotion_model(**inputs)
|
68 |
+
|
69 |
+
probabilities = outputs.logits.softmax(dim=-1).squeeze().cpu().numpy()
|
70 |
+
|
71 |
+
emotion_labels = ['Negative emotion', 'Positive emotion','No emotion']
|
72 |
+
predicted_emotion = emotion_labels[np.argmax(probabilities)]
|
73 |
+
|
74 |
+
return predicted_emotion, {label: float(prob) for label, prob in zip(emotion_labels, probabilities)}
|
75 |
+
|
76 |
+
def process_test_data(test_df):
|
77 |
+
results = []
|
78 |
+
for _, row in test_df.iterrows():
|
79 |
+
text = row['Tweet']
|
80 |
+
predicted_target = predict_target(text)
|
81 |
+
predicted_emotion, emotion_probs = predict_emotion(text, predicted_target)
|
82 |
+
|
83 |
+
results.append({
|
84 |
+
'Tweet': text,
|
85 |
+
'Predicted Target': predicted_target,
|
86 |
+
'Predicted Emotion': predicted_emotion,
|
87 |
+
'Emotion Probabilities': emotion_probs
|
88 |
+
})
|
89 |
+
|
90 |
+
return pd.DataFrame(results)
|
91 |
+
|
92 |
+
if __name__ == "__main__":
|
93 |
+
test_df = pd.read_excel('NLP Engineer Assignment Dataset (1) (1) (1) (1).xlsx', sheet_name='Test')
|
94 |
+
results_df = process_test_data(test_df)
|
95 |
+
results_df.to_csv('test_results.csv', index=False)
|
96 |
+
print("Results saved to test_results.csv")
|
requirements.txt
ADDED
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
pandas
|
2 |
+
numpy
|
3 |
+
scikit-learn
|
4 |
+
transformers
|
5 |
+
datasets
|
6 |
+
torch
|
7 |
+
gradio
|
8 |
+
joblib
|
9 |
+
imbalanced-learn
|
10 |
+
xgboost
|
run_gradio_v3.py
ADDED
@@ -0,0 +1,105 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import gradio as gr
|
2 |
+
from inference import predict_target, predict_emotion
|
3 |
+
import pandas as pd
|
4 |
+
import re
|
5 |
+
from PIL import Image
|
6 |
+
|
7 |
+
def predict(text):
|
8 |
+
predicted_target = predict_target(text)
|
9 |
+
predicted_emotion, emotion_probs = predict_emotion(text, predicted_target)
|
10 |
+
|
11 |
+
summary_df = pd.DataFrame({
|
12 |
+
"Aspect": ["Target", "Emotion"],
|
13 |
+
"Prediction": [predicted_target, predicted_emotion]
|
14 |
+
})
|
15 |
+
|
16 |
+
return predicted_target, predicted_emotion, emotion_probs, summary_df
|
17 |
+
|
18 |
+
def analyze_text(text):
|
19 |
+
word_count = len(text.split())
|
20 |
+
char_count = len(text)
|
21 |
+
|
22 |
+
hashtags = re.findall(r'#\w+', text)
|
23 |
+
mentions = re.findall(r'@\w+', text)
|
24 |
+
urls = re.findall(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text)
|
25 |
+
emojis = re.findall(r'[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F1E0-\U0001F1FF]', text)
|
26 |
+
|
27 |
+
analysis = f"""Word count: {word_count}
|
28 |
+
Character count: {char_count}
|
29 |
+
Hashtags: {len(hashtags)} {', '.join(hashtags)}
|
30 |
+
Mentions: {len(mentions)} {', '.join(mentions)}
|
31 |
+
URLs: {len(urls)}
|
32 |
+
Emojis: {len(emojis)} {''.join(emojis)}"""
|
33 |
+
|
34 |
+
return analysis
|
35 |
+
|
36 |
+
def load_readme():
|
37 |
+
with open("README.md", "r") as file:
|
38 |
+
return file.read()
|
39 |
+
|
40 |
+
logo = Image.open("freepik__pixel-art-8bits-create-an-icon-for-tweet-sentiment__72933.jpeg")
|
41 |
+
logo.thumbnail((100, 100))
|
42 |
+
|
43 |
+
with gr.Blocks(title="Tweet Analysis Dashboard") as iface:
|
44 |
+
page = gr.State("inference")
|
45 |
+
|
46 |
+
with gr.Row():
|
47 |
+
gr.Markdown("# Tweet Analysis Dashboard")
|
48 |
+
gr.Image(logo, scale=1, min_width=100)
|
49 |
+
|
50 |
+
with gr.Row():
|
51 |
+
inference_btn = gr.Button("Inference")
|
52 |
+
readme_btn = gr.Button("README")
|
53 |
+
|
54 |
+
with gr.Column() as inference_page:
|
55 |
+
gr.Markdown("## Tweet Emotion and Target Prediction")
|
56 |
+
gr.Markdown("Enter a tweet to predict its target and emotion, and get additional text analysis.")
|
57 |
+
|
58 |
+
with gr.Row():
|
59 |
+
with gr.Column(scale=2):
|
60 |
+
input_text = gr.Textbox(label="Tweet Text", lines=5)
|
61 |
+
submit_btn = gr.Button("Analyze")
|
62 |
+
with gr.Column(scale=1):
|
63 |
+
text_analysis = gr.Textbox(label="Text Analysis", interactive=False)
|
64 |
+
|
65 |
+
with gr.Row():
|
66 |
+
target_output = gr.Textbox(label="Predicted Target")
|
67 |
+
emotion_output = gr.Textbox(label="Predicted Emotion")
|
68 |
+
|
69 |
+
emotion_probs_output = gr.Label(label="Emotion Probabilities")
|
70 |
+
summary_output = gr.Dataframe(label="Prediction Summary", headers=["Aspect", "Prediction"])
|
71 |
+
|
72 |
+
with gr.Column(visible=False) as readme_page:
|
73 |
+
readme_content = gr.Markdown(load_readme())
|
74 |
+
|
75 |
+
def show_inference():
|
76 |
+
return {
|
77 |
+
inference_page: gr.update(visible=True),
|
78 |
+
readme_page: gr.update(visible=False),
|
79 |
+
page: "inference"
|
80 |
+
}
|
81 |
+
|
82 |
+
def show_readme():
|
83 |
+
return {
|
84 |
+
inference_page: gr.update(visible=False),
|
85 |
+
readme_page: gr.update(visible=True),
|
86 |
+
page: "readme"
|
87 |
+
}
|
88 |
+
|
89 |
+
inference_btn.click(show_inference, outputs=[inference_page, readme_page, page])
|
90 |
+
readme_btn.click(show_readme, outputs=[inference_page, readme_page, page])
|
91 |
+
|
92 |
+
submit_btn.click(
|
93 |
+
fn=predict,
|
94 |
+
inputs=input_text,
|
95 |
+
outputs=[target_output, emotion_output, emotion_probs_output, summary_output]
|
96 |
+
)
|
97 |
+
|
98 |
+
submit_btn.click(
|
99 |
+
fn=analyze_text,
|
100 |
+
inputs=input_text,
|
101 |
+
outputs=text_analysis
|
102 |
+
)
|
103 |
+
|
104 |
+
if __name__ == "__main__":
|
105 |
+
iface.launch(share=True,debug=True)
|
train_model_v2.py
ADDED
@@ -0,0 +1,246 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import pandas as pd
|
2 |
+
from sklearn.model_selection import train_test_split
|
3 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
|
4 |
+
from datasets import Dataset
|
5 |
+
import torch
|
6 |
+
import joblib
|
7 |
+
from sklearn.metrics import classification_report
|
8 |
+
import numpy as np
|
9 |
+
import re
|
10 |
+
import wandb
|
11 |
+
from imblearn.over_sampling import RandomOverSampler
|
12 |
+
import random
|
13 |
+
from nltk.corpus import wordnet
|
14 |
+
import nltk
|
15 |
+
from warnings import filterwarnings
|
16 |
+
filterwarnings('ignore')
|
17 |
+
|
18 |
+
nltk.download('wordnet', quiet=True)
|
19 |
+
wandb.init(mode="disabled")
|
20 |
+
|
21 |
+
def clean_text(text):
|
22 |
+
text = re.sub(r'@\w+', '', text)
|
23 |
+
text = re.sub(r'#\w+', '', text)
|
24 |
+
text = re.sub(r'{links}', '', text)
|
25 |
+
text = re.sub(r'http\S+|www.\S+', '', text)
|
26 |
+
text = re.sub(r'[^A-Za-z0-9\s]', ' ', text)
|
27 |
+
text = re.sub(r'\s+', ' ', text).strip()
|
28 |
+
return text
|
29 |
+
|
30 |
+
def preprocess_data(df):
|
31 |
+
df.columns = ['text', 'target', 'emotion']
|
32 |
+
df = df[df['emotion'] != 'I can\'t tell']
|
33 |
+
df = df[~df['text'].isna()].reset_index(drop=True)
|
34 |
+
|
35 |
+
target_mapping = {
|
36 |
+
'Google': 'Google',
|
37 |
+
'Apple': 'Apple',
|
38 |
+
'iPad': 'iPad',
|
39 |
+
'iPhone': 'iPhone',
|
40 |
+
'Other Google product or service': 'Google',
|
41 |
+
'Other Apple product or service': 'Apple',
|
42 |
+
'Android': 'Android',
|
43 |
+
'Android App': 'Android App',
|
44 |
+
'iPad or iPhone App': 'iPad or iPhone App',
|
45 |
+
}
|
46 |
+
|
47 |
+
df['new_text'] = df['text'].apply(clean_text)
|
48 |
+
df['new_target'] = df['target'].apply(lambda x: target_mapping.get(x, 'No Product'))
|
49 |
+
df['target'] = df['target'].fillna('No Product')
|
50 |
+
|
51 |
+
return df
|
52 |
+
|
53 |
+
def compute_metrics(eval_pred):
|
54 |
+
logits, labels = eval_pred
|
55 |
+
predictions = np.argmax(logits, axis=-1)
|
56 |
+
return {
|
57 |
+
'accuracy': (predictions == labels).mean(),
|
58 |
+
}
|
59 |
+
|
60 |
+
def random_word_drop(text, max_words=3):
|
61 |
+
words = text.split()
|
62 |
+
if len(words) <= max_words:
|
63 |
+
return text
|
64 |
+
num_to_drop = random.randint(1, min(max_words, len(words) - 1))
|
65 |
+
drop_indices = random.sample(range(len(words)), num_to_drop)
|
66 |
+
return ' '.join([word for i, word in enumerate(words) if i not in drop_indices])
|
67 |
+
|
68 |
+
def synonym_replacement(text, n=1):
|
69 |
+
words = text.split()
|
70 |
+
new_words = words.copy()
|
71 |
+
random_word_list = list(set([word for word in words if word.isalnum()]))
|
72 |
+
random.shuffle(random_word_list)
|
73 |
+
num_replaced = 0
|
74 |
+
for random_word in random_word_list:
|
75 |
+
synonyms = []
|
76 |
+
for syn in wordnet.synsets(random_word):
|
77 |
+
for l in syn.lemmas():
|
78 |
+
synonyms.append(l.name())
|
79 |
+
if len(synonyms) >= 1:
|
80 |
+
synonym = random.choice(list(set(synonyms)))
|
81 |
+
new_words = [synonym if word == random_word else word for word in new_words]
|
82 |
+
num_replaced += 1
|
83 |
+
if num_replaced >= n:
|
84 |
+
break
|
85 |
+
return ' '.join(new_words)
|
86 |
+
|
87 |
+
def print_class_distribution(df, stage):
|
88 |
+
class_dist = df['sentiment_label'].value_counts().sort_index()
|
89 |
+
total = len(df)
|
90 |
+
print(f"\nClass distribution - {stage}:")
|
91 |
+
for label, count in class_dist.items():
|
92 |
+
percentage = (count / total) * 100
|
93 |
+
print(f"Class {label}: {count} ({percentage:.2f}%)")
|
94 |
+
|
95 |
+
def augment_data(df):
|
96 |
+
class_counts = df['sentiment_label'].value_counts()
|
97 |
+
max_class = class_counts.idxmax()
|
98 |
+
min_class = class_counts.idxmin()
|
99 |
+
|
100 |
+
augmented_data = []
|
101 |
+
for _, row in df.iterrows():
|
102 |
+
# Original data
|
103 |
+
augmented_data.append({
|
104 |
+
'new_text': row['new_text'],
|
105 |
+
'new_target': row['new_target'],
|
106 |
+
'sentiment_label': row['sentiment_label']
|
107 |
+
})
|
108 |
+
|
109 |
+
# Augment less for the highest class
|
110 |
+
if row['sentiment_label'] == max_class:
|
111 |
+
if random.random() < 0.4: # 50% chance to augment
|
112 |
+
new_text = random_word_drop(row['new_text'])
|
113 |
+
augmented_data.append({
|
114 |
+
'new_text': new_text,
|
115 |
+
'new_target': row['new_target'],
|
116 |
+
'sentiment_label': row['sentiment_label']
|
117 |
+
})
|
118 |
+
else:
|
119 |
+
# Random word drop
|
120 |
+
new_text = random_word_drop(row['new_text'])
|
121 |
+
augmented_data.append({
|
122 |
+
'new_text': new_text,
|
123 |
+
'new_target': row['new_target'],
|
124 |
+
'sentiment_label': row['sentiment_label']
|
125 |
+
})
|
126 |
+
|
127 |
+
# Synonym replacement
|
128 |
+
new_text = synonym_replacement(row['new_text'])
|
129 |
+
augmented_data.append({
|
130 |
+
'new_text': new_text,
|
131 |
+
'new_target': row['new_target'],
|
132 |
+
'sentiment_label': row['sentiment_label']
|
133 |
+
})
|
134 |
+
|
135 |
+
# Extra augmentation for the lowest class
|
136 |
+
if row['sentiment_label'] == min_class:
|
137 |
+
new_text = random_word_drop(synonym_replacement(row['new_text']))
|
138 |
+
augmented_data.append({
|
139 |
+
'new_text': new_text,
|
140 |
+
'new_target': row['new_target'],
|
141 |
+
'sentiment_label': row['sentiment_label']
|
142 |
+
})
|
143 |
+
|
144 |
+
augmented_df = pd.DataFrame(augmented_data)
|
145 |
+
augmented_df['combined_input'] = augmented_df['new_text'] + " [SEP] " + augmented_df['new_target']
|
146 |
+
return augmented_df
|
147 |
+
|
148 |
+
def balance_data(df):
|
149 |
+
X = df[['combined_input']]
|
150 |
+
y = df['sentiment_label']
|
151 |
+
|
152 |
+
ros = RandomOverSampler(random_state=42)
|
153 |
+
X_resampled, y_resampled = ros.fit_resample(X, y)
|
154 |
+
|
155 |
+
balanced_df = pd.DataFrame({
|
156 |
+
'combined_input': X_resampled['combined_input'],
|
157 |
+
'sentiment_label': y_resampled
|
158 |
+
})
|
159 |
+
|
160 |
+
return balanced_df
|
161 |
+
|
162 |
+
def train_model_v2(df):
|
163 |
+
emo_dict = {'Negative emotion': 0, 'Positive emotion': 1, 'No emotion toward brand or product': 2}
|
164 |
+
df['sentiment_label'] = df['emotion'].apply(lambda x: emo_dict[x])
|
165 |
+
|
166 |
+
train_data, val_data = train_test_split(df, test_size=0.2, stratify=df['sentiment_label'], random_state=42)
|
167 |
+
|
168 |
+
print_class_distribution(train_data, "Before augmentation and balancing")
|
169 |
+
|
170 |
+
# Augment and balance only the training data
|
171 |
+
train_data = augment_data(train_data)
|
172 |
+
print_class_distribution(train_data, "After augmentation")
|
173 |
+
|
174 |
+
train_data = balance_data(train_data)
|
175 |
+
print_class_distribution(train_data, "After balancing")
|
176 |
+
|
177 |
+
val_data['combined_input'] = val_data['new_text'] + " [SEP] " + val_data['new_target']
|
178 |
+
|
179 |
+
train_data = train_data[['combined_input', 'sentiment_label']]
|
180 |
+
val_data = val_data[['combined_input', 'sentiment_label']]
|
181 |
+
|
182 |
+
tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment")
|
183 |
+
|
184 |
+
def preprocess_function(examples):
|
185 |
+
tokenized_inputs = tokenizer(examples['combined_input'], padding='max_length', truncation=True, max_length=128)
|
186 |
+
tokenized_inputs['labels'] = examples['sentiment_label']
|
187 |
+
return tokenized_inputs
|
188 |
+
|
189 |
+
train_dataset = Dataset.from_pandas(train_data)
|
190 |
+
val_dataset = Dataset.from_pandas(val_data[['combined_input', 'sentiment_label']])
|
191 |
+
|
192 |
+
train_dataset = train_dataset.map(preprocess_function, batched=True)
|
193 |
+
val_dataset = val_dataset.map(preprocess_function, batched=True)
|
194 |
+
|
195 |
+
model = AutoModelForSequenceClassification.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment", num_labels=3)
|
196 |
+
|
197 |
+
for name, param in model.named_parameters():
|
198 |
+
if not any(layer in name for layer in ['encoder.layer.11', 'encoder.layer.10','encoder.layer.9','encoder.layer.8', 'pooler', 'classifier']):
|
199 |
+
param.requires_grad = False
|
200 |
+
|
201 |
+
training_args = TrainingArguments(
|
202 |
+
output_dir='./results',
|
203 |
+
evaluation_strategy="epoch",
|
204 |
+
save_strategy="epoch",
|
205 |
+
logging_steps=50,
|
206 |
+
learning_rate=1e-5,
|
207 |
+
per_device_train_batch_size=256,
|
208 |
+
per_device_eval_batch_size=256,
|
209 |
+
num_train_epochs=8,
|
210 |
+
weight_decay=0.01,
|
211 |
+
logging_strategy="steps",
|
212 |
+
load_best_model_at_end=True,
|
213 |
+
report_to="none",
|
214 |
+
metric_for_best_model="eval_loss",
|
215 |
+
)
|
216 |
+
|
217 |
+
trainer = Trainer(
|
218 |
+
model=model,
|
219 |
+
args=training_args,
|
220 |
+
train_dataset=train_dataset,
|
221 |
+
eval_dataset=val_dataset,
|
222 |
+
tokenizer=tokenizer,
|
223 |
+
compute_metrics=compute_metrics
|
224 |
+
)
|
225 |
+
|
226 |
+
trainer.train()
|
227 |
+
|
228 |
+
# Save the model and tokenizer
|
229 |
+
output_dir = "./emotion_model"
|
230 |
+
trainer.save_model(output_dir)
|
231 |
+
tokenizer.save_pretrained(output_dir)
|
232 |
+
|
233 |
+
print(f"Model and tokenizer saved to {output_dir}")
|
234 |
+
|
235 |
+
# Generate and print classification report
|
236 |
+
predictions = trainer.predict(val_dataset)
|
237 |
+
preds = np.argmax(predictions.predictions, axis=-1)
|
238 |
+
labels = predictions.label_ids
|
239 |
+
|
240 |
+
print("\nClassification Report:")
|
241 |
+
print(classification_report(labels, preds, target_names=list(emo_dict.keys())))
|
242 |
+
|
243 |
+
if __name__ == "__main__":
|
244 |
+
df = pd.read_excel('NLP Engineer Assignment Dataset (1) (1) (1) (1).xlsx', sheet_name='Train')
|
245 |
+
processed_df = preprocess_data(df)
|
246 |
+
train_model_v2(processed_df)
|