dasdristanta13 commited on
Commit
8ba260b
·
verified ·
1 Parent(s): 6c7cc05

Upload folder using huggingface_hub

Browse files
.gitignore ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ *.xlsx
2
+ .gradio/
3
+ emotion_model/
README.md CHANGED
@@ -1,12 +1,97 @@
1
- ---
2
- title: Twitter Emotion And Target Prediction
3
- emoji: 🌖
4
- colorFrom: yellow
5
- colorTo: purple
6
- sdk: gradio
7
- sdk_version: 5.1.0
8
- app_file: app.py
9
- pinned: false
10
- ---
11
-
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Twitter_Emotion_and_Target_Prediction
3
+ app_file: run_gradio_v3.py
4
+ sdk: gradio
5
+ sdk_version: 5.1.0
6
+ ---
7
+ # Tweet Emotion and Target Prediction
8
+
9
+ This project implements a machine learning pipeline for predicting the emotion and target of tweets. It includes model training, data preprocessing, data augmentation, inference, and a Gradio-based web interface for easy interaction.
10
+
11
+ ## Project Structure
12
+
13
+ - `train_model_v2.py`: Trains the emotion classification model using a fine-tuned RoBERTa model.
14
+ - `inference.py`: Implements the prediction pipeline using the trained models.
15
+ - `run_gradio_v3.py`: Creates a Gradio web interface for interactive predictions.
16
+
17
+ ## Setup and Installation
18
+
19
+ 1. Clone this repository:
20
+ ```bash
21
+ git clone https://github.com/yourusername/tweet-emotion-target-prediction.git
22
+ cd tweet-emotion-target-prediction
23
+ ```
24
+
25
+ 2. Install the required packages:
26
+ ```bash
27
+ pip install pandas numpy scikit-learn transformers datasets torch gradio joblib imbalanced-learn xgboost
28
+ ```
29
+
30
+ 3. Download the dataset file `NLP Engineer Assignment Dataset (1) (1) (1) (1).xlsx` and place it in the project root directory.
31
+
32
+ ## Training the Models
33
+
34
+ ### Emotion Classification Model
35
+
36
+ Run the following command to train the emotion classification model:
37
+
38
+ ```bash
39
+ python train_model_v2.py
40
+ ```
41
+
42
+ This script will:
43
+ - **Preprocess the data**: Apply basic cleaning techniques and tokenization.
44
+ - **Data augmentation**: Use oversampling techniques to handle class imbalance, ensuring the model learns well from underrepresented emotions.
45
+ - **Fine-tune a RoBERTa model**: Use the `cardiffnlp/twitter-roberta-base-sentiment` for transfer learning, fine-tuning it on the tweet emotion dataset.
46
+ - **Save artifacts**: The fine-tuned RoBERTa model and tokenizer will be saved for inference.
47
+
48
+ ## Data Augmentation and Handling Imbalance
49
+
50
+ - **Random Word Drop**: A function that removes a random subset of words from the input text.
51
+ - This operation is probabilistically applied to reduce the highest class' dominance and augment the lower classes.
52
+
53
+ - **Synonym Replacement**: A function leveraging the WordNet corpus to replace random words with their synonyms, generating alternative versions of the input text.
54
+ - Synonym replacement is more heavily applied to the minority classes to balance the dataset.
55
+
56
+ - **Augmentation Strategy**:
57
+ - The largest class undergoes minimal augmentation, while the smaller classes receive extra augmentation (both word drop and synonym replacement). The smallest class gets further augmentation by combining both techniques (word drop after synonym replacement).
58
+
59
+ ## Running Inference
60
+
61
+ To process the test data and generate predictions, run:
62
+
63
+ ```bash
64
+ python inference.py
65
+ ```
66
+
67
+ This script will:
68
+ - **Load the trained models**: Load both the target classification and emotion classification models.
69
+ - **Process the test data**: The test dataset is preprocessed similarly to the training dataset.
70
+ - **Generate predictions**: Predictions for both target and emotion are produced for each tweet.
71
+ - **Save the results**: The predictions are saved to `test_results.csv` for analysis.
72
+
73
+ ## Launching the Gradio Interface
74
+
75
+ To launch the Gradio web interface for interactive predictions, run:
76
+
77
+ ```bash
78
+ python run_gradio_v2.py
79
+ ```
80
+
81
+ This will start a local server and provide a URL to access the web interface.
82
+
83
+ ## Model Details
84
+
85
+ ### Emotion Classification Model
86
+ - **Model Architecture**: Fine-tuned `RoBERTa` model ("cardiffnlp/twitter-roberta-base-sentiment").
87
+ - **Data Augmentation**: Uses SMOTE and other oversampling methods to handle imbalanced class distribution.
88
+ - **Transfer Learning**: Leverages pre-trained `RoBERTa` for sentiment analysis, fine-tuning it on emotion-labeled tweet data.
89
+
90
+ ## Gradio Interface
91
+
92
+ The Gradio interface provides:
93
+ - **Input field** for tweet text.
94
+ - **Text analysis**: Displays word count, character count, hashtags, mentions, URLs, emojis in the tweet.
95
+ - **Predicted target and emotion**: Real-time display of predictions based on user input.
96
+ - **Emotion probabilities**: Displays the probability distribution of the predicted emotions.
97
+ - **Summary table of predictions**: A table summarizing the tweet text, predicted target, emotion, and associated probabilities.
__pycache__/inference.cpython-310.pyc ADDED
Binary file (3.61 kB). View file
 
freepik__pixel-art-8bits-create-an-icon-for-tweet-sentiment__72933.jpeg ADDED
inference.py ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ import os
3
+ import numpy as np
4
+ from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification
5
+ import pandas as pd
6
+ import torch
7
+ from sklearn.metrics.pairwise import cosine_similarity
8
+
9
+ # Check if CUDA is available and set the device
10
+ device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
11
+ print(f"Using device: {device}")
12
+
13
+
14
+ # Load the model and tokenizer
15
+ model_path = "./emotion_model"
16
+ hub_path = "dasdristanta13/twitter-emotion-model"
17
+ if os.path.isdir(model_path):
18
+ emotion_model = AutoModelForSequenceClassification.from_pretrained(model_path)
19
+ emotion_tokenizer = AutoTokenizer.from_pretrained(model_path)
20
+ else:
21
+ emotion_model = AutoModelForSequenceClassification.from_pretrained(hub_path)
22
+ emotion_tokenizer = AutoTokenizer.from_pretrained(hub_path)
23
+
24
+ # Move the model to the appropriate device
25
+ device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
26
+ emotion_model = emotion_model.to(device)
27
+
28
+ # Load a pre-trained sentence transformer model for semantic similarity
29
+ semantic_model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
30
+ semantic_tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
31
+ semantic_model = semantic_model.to(device)
32
+
33
+ target_mapping = {
34
+ 'Google': 'Google',
35
+ 'Apple': 'Apple',
36
+ 'iPad': 'Apple',
37
+ 'iPhone': 'Apple',
38
+ 'Other Google product or service': 'Google',
39
+ 'Other Apple product or service': 'Apple',
40
+ 'Android': 'Google',
41
+ 'Android App': 'Google',
42
+ 'iPad or iPhone App': 'Apple',
43
+ }
44
+
45
+ def get_embedding(text):
46
+ inputs = semantic_tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=128)
47
+ inputs = {k: v.to(device) for k, v in inputs.items()}
48
+ with torch.no_grad():
49
+ outputs = semantic_model(**inputs)
50
+ return outputs.last_hidden_state.mean(dim=1).cpu().numpy()
51
+
52
+ def predict_target(text):
53
+ text_embedding = get_embedding(text)
54
+ target_embeddings = {target: get_embedding(target) for target in target_mapping.keys()}
55
+
56
+ similarities = {target: cosine_similarity(text_embedding, emb)[0][0] for target, emb in target_embeddings.items()}
57
+ predicted_target = max(similarities, key=similarities.get)
58
+
59
+ return predicted_target
60
+
61
+ def predict_emotion(text, target):
62
+ combined_input = f"{text} [SEP] {target}"
63
+ inputs = emotion_tokenizer(combined_input, return_tensors="pt", truncation=True, padding=True)
64
+ inputs = {k: v.to(device) for k, v in inputs.items()}
65
+
66
+ with torch.no_grad():
67
+ outputs = emotion_model(**inputs)
68
+
69
+ probabilities = outputs.logits.softmax(dim=-1).squeeze().cpu().numpy()
70
+
71
+ emotion_labels = ['Negative emotion', 'Positive emotion','No emotion']
72
+ predicted_emotion = emotion_labels[np.argmax(probabilities)]
73
+
74
+ return predicted_emotion, {label: float(prob) for label, prob in zip(emotion_labels, probabilities)}
75
+
76
+ def process_test_data(test_df):
77
+ results = []
78
+ for _, row in test_df.iterrows():
79
+ text = row['Tweet']
80
+ predicted_target = predict_target(text)
81
+ predicted_emotion, emotion_probs = predict_emotion(text, predicted_target)
82
+
83
+ results.append({
84
+ 'Tweet': text,
85
+ 'Predicted Target': predicted_target,
86
+ 'Predicted Emotion': predicted_emotion,
87
+ 'Emotion Probabilities': emotion_probs
88
+ })
89
+
90
+ return pd.DataFrame(results)
91
+
92
+ if __name__ == "__main__":
93
+ test_df = pd.read_excel('NLP Engineer Assignment Dataset (1) (1) (1) (1).xlsx', sheet_name='Test')
94
+ results_df = process_test_data(test_df)
95
+ results_df.to_csv('test_results.csv', index=False)
96
+ print("Results saved to test_results.csv")
requirements.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ pandas
2
+ numpy
3
+ scikit-learn
4
+ transformers
5
+ datasets
6
+ torch
7
+ gradio
8
+ joblib
9
+ imbalanced-learn
10
+ xgboost
run_gradio_v3.py ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ from inference import predict_target, predict_emotion
3
+ import pandas as pd
4
+ import re
5
+ from PIL import Image
6
+
7
+ def predict(text):
8
+ predicted_target = predict_target(text)
9
+ predicted_emotion, emotion_probs = predict_emotion(text, predicted_target)
10
+
11
+ summary_df = pd.DataFrame({
12
+ "Aspect": ["Target", "Emotion"],
13
+ "Prediction": [predicted_target, predicted_emotion]
14
+ })
15
+
16
+ return predicted_target, predicted_emotion, emotion_probs, summary_df
17
+
18
+ def analyze_text(text):
19
+ word_count = len(text.split())
20
+ char_count = len(text)
21
+
22
+ hashtags = re.findall(r'#\w+', text)
23
+ mentions = re.findall(r'@\w+', text)
24
+ urls = re.findall(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text)
25
+ emojis = re.findall(r'[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F1E0-\U0001F1FF]', text)
26
+
27
+ analysis = f"""Word count: {word_count}
28
+ Character count: {char_count}
29
+ Hashtags: {len(hashtags)} {', '.join(hashtags)}
30
+ Mentions: {len(mentions)} {', '.join(mentions)}
31
+ URLs: {len(urls)}
32
+ Emojis: {len(emojis)} {''.join(emojis)}"""
33
+
34
+ return analysis
35
+
36
+ def load_readme():
37
+ with open("README.md", "r") as file:
38
+ return file.read()
39
+
40
+ logo = Image.open("freepik__pixel-art-8bits-create-an-icon-for-tweet-sentiment__72933.jpeg")
41
+ logo.thumbnail((100, 100))
42
+
43
+ with gr.Blocks(title="Tweet Analysis Dashboard") as iface:
44
+ page = gr.State("inference")
45
+
46
+ with gr.Row():
47
+ gr.Markdown("# Tweet Analysis Dashboard")
48
+ gr.Image(logo, scale=1, min_width=100)
49
+
50
+ with gr.Row():
51
+ inference_btn = gr.Button("Inference")
52
+ readme_btn = gr.Button("README")
53
+
54
+ with gr.Column() as inference_page:
55
+ gr.Markdown("## Tweet Emotion and Target Prediction")
56
+ gr.Markdown("Enter a tweet to predict its target and emotion, and get additional text analysis.")
57
+
58
+ with gr.Row():
59
+ with gr.Column(scale=2):
60
+ input_text = gr.Textbox(label="Tweet Text", lines=5)
61
+ submit_btn = gr.Button("Analyze")
62
+ with gr.Column(scale=1):
63
+ text_analysis = gr.Textbox(label="Text Analysis", interactive=False)
64
+
65
+ with gr.Row():
66
+ target_output = gr.Textbox(label="Predicted Target")
67
+ emotion_output = gr.Textbox(label="Predicted Emotion")
68
+
69
+ emotion_probs_output = gr.Label(label="Emotion Probabilities")
70
+ summary_output = gr.Dataframe(label="Prediction Summary", headers=["Aspect", "Prediction"])
71
+
72
+ with gr.Column(visible=False) as readme_page:
73
+ readme_content = gr.Markdown(load_readme())
74
+
75
+ def show_inference():
76
+ return {
77
+ inference_page: gr.update(visible=True),
78
+ readme_page: gr.update(visible=False),
79
+ page: "inference"
80
+ }
81
+
82
+ def show_readme():
83
+ return {
84
+ inference_page: gr.update(visible=False),
85
+ readme_page: gr.update(visible=True),
86
+ page: "readme"
87
+ }
88
+
89
+ inference_btn.click(show_inference, outputs=[inference_page, readme_page, page])
90
+ readme_btn.click(show_readme, outputs=[inference_page, readme_page, page])
91
+
92
+ submit_btn.click(
93
+ fn=predict,
94
+ inputs=input_text,
95
+ outputs=[target_output, emotion_output, emotion_probs_output, summary_output]
96
+ )
97
+
98
+ submit_btn.click(
99
+ fn=analyze_text,
100
+ inputs=input_text,
101
+ outputs=text_analysis
102
+ )
103
+
104
+ if __name__ == "__main__":
105
+ iface.launch(share=True,debug=True)
train_model_v2.py ADDED
@@ -0,0 +1,246 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pandas as pd
2
+ from sklearn.model_selection import train_test_split
3
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
4
+ from datasets import Dataset
5
+ import torch
6
+ import joblib
7
+ from sklearn.metrics import classification_report
8
+ import numpy as np
9
+ import re
10
+ import wandb
11
+ from imblearn.over_sampling import RandomOverSampler
12
+ import random
13
+ from nltk.corpus import wordnet
14
+ import nltk
15
+ from warnings import filterwarnings
16
+ filterwarnings('ignore')
17
+
18
+ nltk.download('wordnet', quiet=True)
19
+ wandb.init(mode="disabled")
20
+
21
+ def clean_text(text):
22
+ text = re.sub(r'@\w+', '', text)
23
+ text = re.sub(r'#\w+', '', text)
24
+ text = re.sub(r'{links}', '', text)
25
+ text = re.sub(r'http\S+|www.\S+', '', text)
26
+ text = re.sub(r'[^A-Za-z0-9\s]', ' ', text)
27
+ text = re.sub(r'\s+', ' ', text).strip()
28
+ return text
29
+
30
+ def preprocess_data(df):
31
+ df.columns = ['text', 'target', 'emotion']
32
+ df = df[df['emotion'] != 'I can\'t tell']
33
+ df = df[~df['text'].isna()].reset_index(drop=True)
34
+
35
+ target_mapping = {
36
+ 'Google': 'Google',
37
+ 'Apple': 'Apple',
38
+ 'iPad': 'iPad',
39
+ 'iPhone': 'iPhone',
40
+ 'Other Google product or service': 'Google',
41
+ 'Other Apple product or service': 'Apple',
42
+ 'Android': 'Android',
43
+ 'Android App': 'Android App',
44
+ 'iPad or iPhone App': 'iPad or iPhone App',
45
+ }
46
+
47
+ df['new_text'] = df['text'].apply(clean_text)
48
+ df['new_target'] = df['target'].apply(lambda x: target_mapping.get(x, 'No Product'))
49
+ df['target'] = df['target'].fillna('No Product')
50
+
51
+ return df
52
+
53
+ def compute_metrics(eval_pred):
54
+ logits, labels = eval_pred
55
+ predictions = np.argmax(logits, axis=-1)
56
+ return {
57
+ 'accuracy': (predictions == labels).mean(),
58
+ }
59
+
60
+ def random_word_drop(text, max_words=3):
61
+ words = text.split()
62
+ if len(words) <= max_words:
63
+ return text
64
+ num_to_drop = random.randint(1, min(max_words, len(words) - 1))
65
+ drop_indices = random.sample(range(len(words)), num_to_drop)
66
+ return ' '.join([word for i, word in enumerate(words) if i not in drop_indices])
67
+
68
+ def synonym_replacement(text, n=1):
69
+ words = text.split()
70
+ new_words = words.copy()
71
+ random_word_list = list(set([word for word in words if word.isalnum()]))
72
+ random.shuffle(random_word_list)
73
+ num_replaced = 0
74
+ for random_word in random_word_list:
75
+ synonyms = []
76
+ for syn in wordnet.synsets(random_word):
77
+ for l in syn.lemmas():
78
+ synonyms.append(l.name())
79
+ if len(synonyms) >= 1:
80
+ synonym = random.choice(list(set(synonyms)))
81
+ new_words = [synonym if word == random_word else word for word in new_words]
82
+ num_replaced += 1
83
+ if num_replaced >= n:
84
+ break
85
+ return ' '.join(new_words)
86
+
87
+ def print_class_distribution(df, stage):
88
+ class_dist = df['sentiment_label'].value_counts().sort_index()
89
+ total = len(df)
90
+ print(f"\nClass distribution - {stage}:")
91
+ for label, count in class_dist.items():
92
+ percentage = (count / total) * 100
93
+ print(f"Class {label}: {count} ({percentage:.2f}%)")
94
+
95
+ def augment_data(df):
96
+ class_counts = df['sentiment_label'].value_counts()
97
+ max_class = class_counts.idxmax()
98
+ min_class = class_counts.idxmin()
99
+
100
+ augmented_data = []
101
+ for _, row in df.iterrows():
102
+ # Original data
103
+ augmented_data.append({
104
+ 'new_text': row['new_text'],
105
+ 'new_target': row['new_target'],
106
+ 'sentiment_label': row['sentiment_label']
107
+ })
108
+
109
+ # Augment less for the highest class
110
+ if row['sentiment_label'] == max_class:
111
+ if random.random() < 0.4: # 50% chance to augment
112
+ new_text = random_word_drop(row['new_text'])
113
+ augmented_data.append({
114
+ 'new_text': new_text,
115
+ 'new_target': row['new_target'],
116
+ 'sentiment_label': row['sentiment_label']
117
+ })
118
+ else:
119
+ # Random word drop
120
+ new_text = random_word_drop(row['new_text'])
121
+ augmented_data.append({
122
+ 'new_text': new_text,
123
+ 'new_target': row['new_target'],
124
+ 'sentiment_label': row['sentiment_label']
125
+ })
126
+
127
+ # Synonym replacement
128
+ new_text = synonym_replacement(row['new_text'])
129
+ augmented_data.append({
130
+ 'new_text': new_text,
131
+ 'new_target': row['new_target'],
132
+ 'sentiment_label': row['sentiment_label']
133
+ })
134
+
135
+ # Extra augmentation for the lowest class
136
+ if row['sentiment_label'] == min_class:
137
+ new_text = random_word_drop(synonym_replacement(row['new_text']))
138
+ augmented_data.append({
139
+ 'new_text': new_text,
140
+ 'new_target': row['new_target'],
141
+ 'sentiment_label': row['sentiment_label']
142
+ })
143
+
144
+ augmented_df = pd.DataFrame(augmented_data)
145
+ augmented_df['combined_input'] = augmented_df['new_text'] + " [SEP] " + augmented_df['new_target']
146
+ return augmented_df
147
+
148
+ def balance_data(df):
149
+ X = df[['combined_input']]
150
+ y = df['sentiment_label']
151
+
152
+ ros = RandomOverSampler(random_state=42)
153
+ X_resampled, y_resampled = ros.fit_resample(X, y)
154
+
155
+ balanced_df = pd.DataFrame({
156
+ 'combined_input': X_resampled['combined_input'],
157
+ 'sentiment_label': y_resampled
158
+ })
159
+
160
+ return balanced_df
161
+
162
+ def train_model_v2(df):
163
+ emo_dict = {'Negative emotion': 0, 'Positive emotion': 1, 'No emotion toward brand or product': 2}
164
+ df['sentiment_label'] = df['emotion'].apply(lambda x: emo_dict[x])
165
+
166
+ train_data, val_data = train_test_split(df, test_size=0.2, stratify=df['sentiment_label'], random_state=42)
167
+
168
+ print_class_distribution(train_data, "Before augmentation and balancing")
169
+
170
+ # Augment and balance only the training data
171
+ train_data = augment_data(train_data)
172
+ print_class_distribution(train_data, "After augmentation")
173
+
174
+ train_data = balance_data(train_data)
175
+ print_class_distribution(train_data, "After balancing")
176
+
177
+ val_data['combined_input'] = val_data['new_text'] + " [SEP] " + val_data['new_target']
178
+
179
+ train_data = train_data[['combined_input', 'sentiment_label']]
180
+ val_data = val_data[['combined_input', 'sentiment_label']]
181
+
182
+ tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment")
183
+
184
+ def preprocess_function(examples):
185
+ tokenized_inputs = tokenizer(examples['combined_input'], padding='max_length', truncation=True, max_length=128)
186
+ tokenized_inputs['labels'] = examples['sentiment_label']
187
+ return tokenized_inputs
188
+
189
+ train_dataset = Dataset.from_pandas(train_data)
190
+ val_dataset = Dataset.from_pandas(val_data[['combined_input', 'sentiment_label']])
191
+
192
+ train_dataset = train_dataset.map(preprocess_function, batched=True)
193
+ val_dataset = val_dataset.map(preprocess_function, batched=True)
194
+
195
+ model = AutoModelForSequenceClassification.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment", num_labels=3)
196
+
197
+ for name, param in model.named_parameters():
198
+ if not any(layer in name for layer in ['encoder.layer.11', 'encoder.layer.10','encoder.layer.9','encoder.layer.8', 'pooler', 'classifier']):
199
+ param.requires_grad = False
200
+
201
+ training_args = TrainingArguments(
202
+ output_dir='./results',
203
+ evaluation_strategy="epoch",
204
+ save_strategy="epoch",
205
+ logging_steps=50,
206
+ learning_rate=1e-5,
207
+ per_device_train_batch_size=256,
208
+ per_device_eval_batch_size=256,
209
+ num_train_epochs=8,
210
+ weight_decay=0.01,
211
+ logging_strategy="steps",
212
+ load_best_model_at_end=True,
213
+ report_to="none",
214
+ metric_for_best_model="eval_loss",
215
+ )
216
+
217
+ trainer = Trainer(
218
+ model=model,
219
+ args=training_args,
220
+ train_dataset=train_dataset,
221
+ eval_dataset=val_dataset,
222
+ tokenizer=tokenizer,
223
+ compute_metrics=compute_metrics
224
+ )
225
+
226
+ trainer.train()
227
+
228
+ # Save the model and tokenizer
229
+ output_dir = "./emotion_model"
230
+ trainer.save_model(output_dir)
231
+ tokenizer.save_pretrained(output_dir)
232
+
233
+ print(f"Model and tokenizer saved to {output_dir}")
234
+
235
+ # Generate and print classification report
236
+ predictions = trainer.predict(val_dataset)
237
+ preds = np.argmax(predictions.predictions, axis=-1)
238
+ labels = predictions.label_ids
239
+
240
+ print("\nClassification Report:")
241
+ print(classification_report(labels, preds, target_names=list(emo_dict.keys())))
242
+
243
+ if __name__ == "__main__":
244
+ df = pd.read_excel('NLP Engineer Assignment Dataset (1) (1) (1) (1).xlsx', sheet_name='Train')
245
+ processed_df = preprocess_data(df)
246
+ train_model_v2(processed_df)