Model Card for kangelamw/negative-reviews-into-actionable-insights
Model Details
Parameter-Efficient Fine-tuned Causal Language Model (PEFT/QLoRA) for Review Analysis
Model Description
A fine-tuned language model specialized in analyzing and generating insights from customer reviews. This model builds upon Microsoft's Phi-2 (2.7B parameters) base architecture and has been specifically tailored to understand sentiment, extract key points, and provide analytical feedback on review content.
Technical Approach:
- Base Model: microsoft/phi-2 (2.7B parameters)
- Training Data: Yelp/yelp_review_full dataset, consisting of restaurant and business reviews with various sentiment levels
- Knowledge Distillation: Enhanced with review analysis patterns from Mistral 7B, which was used to generate high-quality analytical feedback that served as training examples
- Emotion Analysis: Incorporated emotion extraction techniques from samLowe's emotion detection model to enhance sentiment understanding
- Quantization Method: QLoRA 4-bit quantization to enable efficient fine-tuning with limited computational resources while preserving model performance
Primary Capability: Extracting actionable insights from negative reviews for business improvement
- Developed by: kangelamw
- Funded by [optional]: Personal/Private
- Shared by [optional]: Personal/Private
- Model type: Text Generation
- Language(s) (NLP): English
- License: MIT
- Finetuned from model [optional]: microsoft/phi-2
Model Sources [optional]
- Repository: https://huggingface.co/kangelamw/beyond_sentiment
- Training Dataset: Yelp/yelp_review_full
- Emotion Extraction: SamLowe/roberta-base-go_emotions
- GitHub Project: LLM-Project-DS
Uses
This model is designed to help businesses and organizations extract valuable insights from customer reviews, particularly negative ones, to drive improvements.
Direct Use
This model can be directly used for:
- Customer Feedback Analysis: Processing and analyzing customer reviews to identify key issues and concerns
- Action Item Extraction: Converting negative feedback into specific, actionable insights
- Business Intelligence: Providing structured analysis of customer sentiment to inform decision-making
- Review Summarization: Condensing lengthy reviews into key points and themes
Technical Requirements (Check the repo for more info)
- Python 3.8+
- PyTorch 1.10+
- Transformers 4.20+
- 12GB VRAM recommended
- It only used ~10.02GB during training and ~7.02GB during inference.
- GPU recommended for faster inference
Out-of-Scope Use
This model is NOT suitable for:
- Making final business decisions without human oversight !!! - Don't be the problem.
- Generating fake or misleading reviews
- Processing sensitive personal information
- Medical, legal, or financial advice generation
Bias, Risks, and Limitations
This model inherits biases and limitations from several sources that users should be aware of:
Inherited Biases
- Base Model Biases: As a fine-tuned version of Microsoft's Phi-2, this model likely contains biases present in the original pre-training data.
- Training Data Biases: The Yelp/yelp_review_full dataset may contain demographic, cultural, and geographic biases that influence the model's understanding of reviews and sentiment.
- Knowledge Distillation Biases: Patterns learned from Mistral 7B may transfer biases from that model into this one.
- Emotion Detection Biases: The emotion detection techniques from samLowe's model may be biased toward certain emotional expressions over others.
Technical Limitations
- Size Constraints: Being based on Phi-2 (2.7B parameters), the model is significantly smaller than state-of-the-art models, limiting its reasoning capabilities. However, it is also considered the best among models below 13B params.
- Quantization Effects: The 4-bit QLoRA quantization, while enabling efficient fine-tuning, may introduce subtle performance degradation compared to full-precision models.
- Limited Training: Due to hardware and time constraints, the model hasn't received optimal training iterations and may benefit from further refinement. I have about 15k train and 3k test rows. I was only able to finetune it with 999/333 of it. You can check the repo for the before/after performance. It is better, but definitely could use more training. I'll upload the train/test data when I figure out where that can be done on hugging face. If not, it'll be on the repo. (Save you the part where you have to distill another model to create the training data)
- Domain Specificity: The model is optimized for business review analysis and may perform poorly on other text types.
Risks in Deployment
- Over-reliance Risk: Businesses may over-rely on automated insights without human verification.
- Misinterpretation: The model may misinterpret sarcasm, cultural references, or domain-specific terminology.
- Language Limitations: The model performs best on English language reviews and may struggle with other languages or code-switched text.
- Unfair Characterizations: May occasionally extract insights that unfairly characterize customers or their concerns.
- Weird Responses: It sometimes tells you the story of the customer and how the business owner must have felt. This happens less on the fine-tuned version, but I've seen it at least once or twice in 1000. It's all that it would do as a base model. I also had issues where it turned into a little parrot. Hence,
no_repeat_ngram_size=3
.
Users should employ this model as a supplementary tool for review analysis rather than a complete replacement for human judgment.
Recommendations
Addressing Biases
- Human-in-the-loop Validation: Always have human reviewers validate model outputs, particularly for reviews from underrepresented demographics or geographic regions not well-represented in the Yelp dataset.
- Diverse Testing: Test the model with reviews from diverse businesses, geographic locations, and cultural contexts to identify potential bias blind spots.
- Bias Monitoring: Regularly monitor outputs for demographic or cultural biases and document any patterns observed.
Mitigating Technical Limitations
- Context Window Awareness: Be mindful of the model's context window limitations when processing longer reviews.
- Complementary Techniques: For critical business insights, consider using this model alongside other analytical approaches rather than relying solely on its outputs.
- Further Training: If resources become available, continue training with a larger subset of the data to improve performance.
- Hardware Optimization: When deploying, optimize inference parameters according to available hardware.
Best Practices for Deployment
- Confidence Scoring: Implement a confidence scoring system to flag low-confidence insights for human review.
- Clear Documentation: Clearly communicate the model's capabilities and limitations to end-users.
- Regular Evaluation: Periodically evaluate the model against new reviews to ensure continued performance.
- Domain Adaptation: For businesses outside the restaurant/service industry, consider additional fine-tuning on domain-specific reviews.
Future Improvements
- Expanded Training Data: Incorporate more diverse review sources beyond Yelp to broaden the model's understanding.
- Multilingual Capabilities: Explore extending the model to handle reviews in languages beyond English.
- Explainability Features: Develop methods to highlight which parts of the review influenced specific insights.
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
How to Get Started with the Model
Review the code below to get started with the model.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, re
# Load the model and tokenizer
model_id = "kangelamw/negative-reviews-into-actionable-insights" # Once pushed to Hub
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16, # Use float16 for efficiency
device_map="auto" # Automatically choose best device configuration
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.eval()
# How to extract insights from a review
def get_emotion_label(text):
"""
Get the top 3 predicted emotions for a given text.
"""
# Tokenize text
tokens = tokenizer(text,
padding=True,
truncation=True,
max_length=128,
return_tensors="pt").to(model.device)
# Run inference
with torch.no_grad():
outputs = model(**tokens)
# Get top 3 predicted labels
probabilities = torch.nn.functional.softmax(outputs.logits, dim=1) # Convert logits to probabilities
top3_indices = torch.argsort(probabilities, descending=True)[0][:3] # Get top 3 indices
# Convert indices to emotion labels
top3_emotions = [emotion_labels[i] for i in top3_indices]
# Return as string...
emotions = ", ".join(top3_emotions)
return emotions
def create_prompt(stars, review_text, emotions):
return prompt = f"A customer left us a {stars}-star review: {review_text}. The customer feels {emotions}. Concisely, how can we best improve our services for this customer's experience? \n\nResponse: "
def clean_phi2_output(input_text, output_text):
"""
Extracts the response from the model output by removing the input prompt.
Works by detecting markers like "Answer:", "Response:", "###", or newline separation.
"""
# Normalize whitespace
input_norm = re.sub(r'\s+', ' ', input_text.strip())
output_norm = re.sub(r'\s+', ' ', output_text.strip())
# Look for common response markers
response_markers = [r"Answer:",
r"Response:",
r"###",
r"\n\n",
r"##Your task:",
r"<|Question|>",
r"<|Answer|>",
r"Instruction: ",
r"Response: ",
r"OUTPUT:"]
# Try to find the first occurrence of any marker
for marker in response_markers:
match = re.search(marker, output_norm, re.IGNORECASE)
if match:
return output_norm[match.end():].strip() # Extract everything after the marker
# If no marker is found, attempt a basic subtraction
cleaned_output = re.sub(re.escape(input_norm), '', output_norm, flags=re.IGNORECASE).strip()
return cleaned_output if cleaned_output else output_norm # Default to original if no cleaning was effective
# ===== SINGLE USE: ===== #
inputs = tokenizer(prompt, return_tensors="pt").to(device)
# Generate response
with torch.no_grad():
outputs = model.generate(
inputs.input_ids,
max_new_tokens=1000,
temperature=0.5, # Adjust for creativity vs consistency
top_p=0.9,
top_k=50
do_sample=True,
pad_token_id=tokenizer.eos_token_id # Important
)
output = tokenizer.decode(output[0], skip_special_tokens=True)
print(clean_phi2_output(prompt, output))
# ===== BATCHED USE: ===== #
batch_responses = []
for j in range(len(outputs)):
try:
full_text = tokenizer.decode(outputs[j], skip_special_tokens=True)
prompt_text = tokenizer.decode(inputs['input_ids'][j], skip_special_tokens=True)
response = full_text.replace(prompt_text, "", 1).strip()
response = clean_response(response)
except Exception as e:
response = "ERROR_GENERATING_RESPONSE"
print(f"Error: {e}")
batch_responses.append(response) # You can check the repo for the full batched code, but this is the important part for cleaning up the responses by batch, the single-use one above (clean_phi2_output) was more of a monkey patch for testing.
Training Details
Training Data
- It is balanced, I've been stratified sampling by label.
- Relatively clean.
- Word Count per review: Min 32.0 and Max 315.0
This is the full train data that I have: 15k Train
This is the one it was trained on: 999 Train
Training Procedure
Preprocessing
import re
# Define punctuation to keep
keep_punctuation = {".", ",", "!", "?", "'"} # 'It was great.' vs 'It was great!!!' makes a difference.
# Cleaning function
def clean_text(review):
review = review.lower() # Lowercase, I'd say optional.
# Remove unwanted characters (keep only letters, numbers, and whitelisted punctuation)
cleaned_text = "".join(char if char.isalnum() or char in keep_punctuation or char.isspace() else " " for char in review)
# Remove extra spaces (from removed characters)
cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()
return cleaned_text
Training Hyperparameters
- Training regime: fp16
Evaluation
Testing Data, Factors & Metrics
Testing Data
- It is balanced, I've been stratified sampling by label.
- Relatively clean.
- Word Count per review: Min 32.0 and Max 315.0
- Includes mistral's response (for eval purposes), drop it if using for training.
This is the full test data that I have: 3k Test
This is the one it was trained on: 333 Test
Factors
- Review Length: Performance on reviews of varying lengths (32-315 words)
- Pre vs. Post Fine-tuning: Comparative performance between base and fine-tuned models
- Stars: 1-3 star-reviews only.
Metrics
The model's performance was measured using several NLP evaluation metrics (the baseline is mistral's output):
- BLEURT: A learned metric based on BERT that correlates well with human judgment of text quality.
- for capturing meaning
- BERTScore: Measures semantic similarity between generated insights and reference insights
- for semantic similarity
- METEOR: Evaluates the quality of generated text based on exact, stem, synonym, and paraphrase matches
- for synonymns/para-phrasing
During testing, the BERTScore is always great. This just means that the output was using similar words and the model understands what it has to talk about. The BLEURT is always low, so it's talking about something almost entirely different (tries making up stories about the review = it is missing the meaning). Low METEOR solidifies this standpoint.
Always consider ALL the metrics. They are useless by themselves.
Results
Initial performance | bleurt | bertscore | meteor |
---|---|---|---|
mean | -0.756606 | 0.838373 | 0.2222 |
median | -0.70085 | 0.8395 | 0.2222 |
Fine-tuned performance | bleurt | bertscore | meteor |
---|---|---|---|
mean | -0.376466 | 0.885482 | 0.2318 |
median | -0.33835 | 0.8873 | 0.2318 |
Check the repo itself's README if you want to see some visuals :)
Summary
Phi-2's semantic understanding of what needs to be done has improved. The fine-tuning process has led to better semantic alignment and contextual similarity, as shown by the improvements in BLEURT, BERTScore, and METEOR.
While Phi-2 now mirrors Mistral’s responses more effectively, there is still room for improvement. BLEURT is still negative, indicating that some fine-grained semantic details might be missing. Further refinements could help Phi-2 align even more closely with Mistral’s outputs.
This is great in a way that maybe an extra round of training with 1-2k more rows per training would get it on par with Mistral's. I have about 14k/2k of train/test unused.
[More Information Needed] -->
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: 12GB Nvidia RTX 3060TI
- Hours used: Approximately 8-16 hours everyday for 2-3 weeks for fine-tuning and inference
- Cloud Provider: None - personal workstation/local machine
- Compute Region: North America
- Carbon Emitted: // Not in the list of hardwares on the calculator
This fine-tuning process used parameter-efficient methods (QLoRA) specifically to reduce computational requirements and environmental impact. The significantly reduced resource usage compared to full model fine-tuning and inference represents a more sustainable approach to applied model development.
For the hours, this is a bootcamp project, I've never really fine-tuned a model or ran inference before this one. Please don't use it by itself: fix it/fine-tune it further and as needed before using on prod.
Model Card Contact
- Downloads last month
- 0
Model tree for kangelamw/negative-reviews-into-actionable-insights
Base model
microsoft/phi-2