Model Card for kangelamw/negative-reviews-into-actionable-insights

Model Details

Parameter-Efficient Fine-tuned Causal Language Model (PEFT/QLoRA) for Review Analysis

Model Description

A fine-tuned language model specialized in analyzing and generating insights from customer reviews. This model builds upon Microsoft's Phi-2 (2.7B parameters) base architecture and has been specifically tailored to understand sentiment, extract key points, and provide analytical feedback on review content.

Technical Approach:

Base Model: microsoft/phi-2 (2.7B parameters)
Training Data: Yelp/yelp_review_full dataset, consisting of restaurant and business reviews with various sentiment levels
Knowledge Distillation: Enhanced with review analysis patterns from Mistral 7B, which was used to generate high-quality analytical feedback that served as training examples
Emotion Analysis: Incorporated emotion extraction techniques from samLowe's emotion detection model to enhance sentiment understanding
Quantization Method: QLoRA 4-bit quantization to enable efficient fine-tuning with limited computational resources while preserving model performance

Primary Capability: Extracting actionable insights from negative reviews for business improvement

Developed by: kangelamw
Funded by [optional]: Personal/Private
Shared by [optional]: Personal/Private
Model type: Text Generation
Language(s) (NLP): English
License: MIT
Finetuned from model [optional]: microsoft/phi-2

Model Sources [optional]

Repository: https://huggingface.co/kangelamw/beyond_sentiment
Training Dataset: Yelp/yelp_review_full
Emotion Extraction: SamLowe/roberta-base-go_emotions
GitHub Project: LLM-Project-DS

Uses

This model is designed to help businesses and organizations extract valuable insights from customer reviews, particularly negative ones, to drive improvements.

Direct Use

This model can be directly used for:

Customer Feedback Analysis: Processing and analyzing customer reviews to identify key issues and concerns
Action Item Extraction: Converting negative feedback into specific, actionable insights
Business Intelligence: Providing structured analysis of customer sentiment to inform decision-making
Review Summarization: Condensing lengthy reviews into key points and themes

Technical Requirements (Check the repo for more info)

Python 3.8+
PyTorch 1.10+
Transformers 4.20+
12GB VRAM recommended
- It only used ~10.02GB during training and ~7.02GB during inference.
GPU recommended for faster inference

Out-of-Scope Use

This model is NOT suitable for:

Making final business decisions without human oversight !!! - Don't be the problem.
Generating fake or misleading reviews
Processing sensitive personal information
Medical, legal, or financial advice generation

Bias, Risks, and Limitations

This model inherits biases and limitations from several sources that users should be aware of:

Inherited Biases

Base Model Biases: As a fine-tuned version of Microsoft's Phi-2, this model likely contains biases present in the original pre-training data.
Training Data Biases: The Yelp/yelp_review_full dataset may contain demographic, cultural, and geographic biases that influence the model's understanding of reviews and sentiment.
Knowledge Distillation Biases: Patterns learned from Mistral 7B may transfer biases from that model into this one.
Emotion Detection Biases: The emotion detection techniques from samLowe's model may be biased toward certain emotional expressions over others.

Technical Limitations

Size Constraints: Being based on Phi-2 (2.7B parameters), the model is significantly smaller than state-of-the-art models, limiting its reasoning capabilities. However, it is also considered the best among models below 13B params.
Quantization Effects: The 4-bit QLoRA quantization, while enabling efficient fine-tuning, may introduce subtle performance degradation compared to full-precision models.
Limited Training: Due to hardware and time constraints, the model hasn't received optimal training iterations and may benefit from further refinement. I have about 15k train and 3k test rows. I was only able to finetune it with 999/333 of it. You can check the repo for the before/after performance. It is better, but definitely could use more training. I'll upload the train/test data when I figure out where that can be done on hugging face. If not, it'll be on the repo. (Save you the part where you have to distill another model to create the training data)
Domain Specificity: The model is optimized for business review analysis and may perform poorly on other text types.

Risks in Deployment

Over-reliance Risk: Businesses may over-rely on automated insights without human verification.
Misinterpretation: The model may misinterpret sarcasm, cultural references, or domain-specific terminology.
Language Limitations: The model performs best on English language reviews and may struggle with other languages or code-switched text.
Unfair Characterizations: May occasionally extract insights that unfairly characterize customers or their concerns.
Weird Responses: It sometimes tells you the story of the customer and how the business owner must have felt. This happens less on the fine-tuned version, but I've seen it at least once or twice in 1000. It's all that it would do as a base model. I also had issues where it turned into a little parrot. Hence, no_repeat_ngram_size=3.

Users should employ this model as a supplementary tool for review analysis rather than a complete replacement for human judgment.

Recommendations

Addressing Biases

Human-in-the-loop Validation: Always have human reviewers validate model outputs, particularly for reviews from underrepresented demographics or geographic regions not well-represented in the Yelp dataset.
Diverse Testing: Test the model with reviews from diverse businesses, geographic locations, and cultural contexts to identify potential bias blind spots.
Bias Monitoring: Regularly monitor outputs for demographic or cultural biases and document any patterns observed.

Mitigating Technical Limitations

Context Window Awareness: Be mindful of the model's context window limitations when processing longer reviews.
Complementary Techniques: For critical business insights, consider using this model alongside other analytical approaches rather than relying solely on its outputs.
Further Training: If resources become available, continue training with a larger subset of the data to improve performance.
Hardware Optimization: When deploying, optimize inference parameters according to available hardware.

Best Practices for Deployment

Confidence Scoring: Implement a confidence scoring system to flag low-confidence insights for human review.
Clear Documentation: Clearly communicate the model's capabilities and limitations to end-users.
Regular Evaluation: Periodically evaluate the model against new reviews to ensure continued performance.
Domain Adaptation: For businesses outside the restaurant/service industry, consider additional fine-tuning on domain-specific reviews.

Future Improvements

Expanded Training Data: Incorporate more diverse review sources beyond Yelp to broaden the model's understanding.
Multilingual Capabilities: Explore extending the model to handle reviews in languages beyond English.
Explainability Features: Develop methods to highlight which parts of the review influenced specific insights.

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

How to Get Started with the Model

Review the code below to get started with the model.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, re

# Load the model and tokenizer
model_id = "kangelamw/negative-reviews-into-actionable-insights"  # Once pushed to Hub
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,  # Use float16 for efficiency
    device_map="auto"  # Automatically choose best device configuration
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.eval()

# How to extract insights from a review
def get_emotion_label(text):
  """
  Get the top 3 predicted emotions for a given text.
  """
  # Tokenize text
  tokens = tokenizer(text,
                       padding=True,
                       truncation=True,
                       max_length=128,
                       return_tensors="pt").to(model.device)

  # Run inference
  with torch.no_grad():
        outputs = model(**tokens)

  # Get top 3 predicted labels
  probabilities = torch.nn.functional.softmax(outputs.logits, dim=1)  # Convert logits to probabilities
  top3_indices = torch.argsort(probabilities, descending=True)[0][:3]  # Get top 3 indices
    
  # Convert indices to emotion labels
  top3_emotions = [emotion_labels[i] for i in top3_indices]
    
  # Return as string...
  emotions = ", ".join(top3_emotions)

  return emotions

def create_prompt(stars, review_text, emotions):
    return prompt = f"A customer left us a {stars}-star review: {review_text}. The customer feels {emotions}. Concisely, how can we best improve our services for this customer's experience? \n\nResponse: "

def clean_phi2_output(input_text, output_text):
    """
    Extracts the response from the model output by removing the input prompt.
    Works by detecting markers like "Answer:", "Response:", "###", or newline separation.
    """
    # Normalize whitespace
    input_norm = re.sub(r'\s+', ' ', input_text.strip())
    output_norm = re.sub(r'\s+', ' ', output_text.strip())

    # Look for common response markers
    response_markers = [r"Answer:", 
                        r"Response:", 
                        r"###", 
                        r"\n\n", 
                        r"##Your task:",
                        r"<|Question|>",
                        r"<|Answer|>",
                        r"Instruction: ",
                        r"Response: ",
                        r"OUTPUT:"]
    
    # Try to find the first occurrence of any marker
    for marker in response_markers:
        match = re.search(marker, output_norm, re.IGNORECASE)
        if match:
            return output_norm[match.end():].strip()  # Extract everything after the marker

    # If no marker is found, attempt a basic subtraction
    cleaned_output = re.sub(re.escape(input_norm), '', output_norm, flags=re.IGNORECASE).strip()
    
    return cleaned_output if cleaned_output else output_norm  # Default to original if no cleaning was effective

# ===== SINGLE USE: ===== #
inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
# Generate response
with torch.no_grad():
  outputs = model.generate(
    inputs.input_ids,
    max_new_tokens=1000,
    temperature=0.5,  # Adjust for creativity vs consistency
    top_p=0.9,
    top_k=50
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id # Important
    )
    
output = tokenizer.decode(output[0], skip_special_tokens=True)

print(clean_phi2_output(prompt, output))

# ===== BATCHED USE: ===== #
batch_responses = []
for j in range(len(outputs)):
  try:
    full_text = tokenizer.decode(outputs[j], skip_special_tokens=True)
    prompt_text = tokenizer.decode(inputs['input_ids'][j], skip_special_tokens=True)
    response = full_text.replace(prompt_text, "", 1).strip()
    response = clean_response(response)
  except Exception as e:
    response = "ERROR_GENERATING_RESPONSE"
    print(f"Error: {e}")
                        
  batch_responses.append(response) # You can check the repo for the full batched code, but this is the important part for cleaning up the responses by batch, the single-use one above (clean_phi2_output) was more of a monkey patch for testing.

Training Details

Training Data

It is balanced, I've been stratified sampling by label.
Relatively clean.
Word Count per review: Min 32.0 and Max 315.0

This is the full train data that I have: 15k Train

This is the one it was trained on: 999 Train

Training Procedure

Preprocessing

import re
# Define punctuation to keep
keep_punctuation = {".", ",", "!", "?", "'"} # 'It was great.' vs 'It was great!!!' makes a difference.

# Cleaning function
def clean_text(review):
  review = review.lower()  # Lowercase, I'd say optional.

  # Remove unwanted characters (keep only letters, numbers, and whitelisted punctuation)
  cleaned_text = "".join(char if char.isalnum() or char in keep_punctuation or char.isspace() else " " for char in review)

  # Remove extra spaces (from removed characters)
  cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()

  return cleaned_text

Training Hyperparameters

Training regime: fp16

Evaluation

Testing Data, Factors & Metrics

Testing Data

It is balanced, I've been stratified sampling by label.
Relatively clean.
Word Count per review: Min 32.0 and Max 315.0
Includes mistral's response (for eval purposes), drop it if using for training.

This is the full test data that I have: 3k Test

This is the one it was trained on: 333 Test

Factors

Review Length: Performance on reviews of varying lengths (32-315 words)
Pre vs. Post Fine-tuning: Comparative performance between base and fine-tuned models
Stars: 1-3 star-reviews only.

Metrics

The model's performance was measured using several NLP evaluation metrics (the baseline is mistral's output):

BLEURT: A learned metric based on BERT that correlates well with human judgment of text quality.
- for capturing meaning
BERTScore: Measures semantic similarity between generated insights and reference insights
- for semantic similarity
METEOR: Evaluates the quality of generated text based on exact, stem, synonym, and paraphrase matches
- for synonymns/para-phrasing

During testing, the BERTScore is always great. This just means that the output was using similar words and the model understands what it has to talk about. The BLEURT is always low, so it's talking about something almost entirely different (tries making up stories about the review = it is missing the meaning). Low METEOR solidifies this standpoint.

Always consider ALL the metrics. They are useless by themselves.

Results

Initial performance	bleurt	bertscore	meteor
mean	-0.756606	0.838373	0.2222
median	-0.70085	0.8395	0.2222

Fine-tuned performance	bleurt	bertscore	meteor
mean	-0.376466	0.885482	0.2318
median	-0.33835	0.8873	0.2318

Check the repo itself's README if you want to see some visuals :)

Summary

Phi-2's semantic understanding of what needs to be done has improved. The fine-tuning process has led to better semantic alignment and contextual similarity, as shown by the improvements in BLEURT, BERTScore, and METEOR.

While Phi-2 now mirrors Mistral’s responses more effectively, there is still room for improvement. BLEURT is still negative, indicating that some fine-grained semantic details might be missing. Further refinements could help Phi-2 align even more closely with Mistral’s outputs.

This is great in a way that maybe an extra round of training with 1-2k more rows per training would get it on par with Mistral's. I have about 14k/2k of train/test unused.

[More Information Needed] -->

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: 12GB Nvidia RTX 3060TI
Hours used: Approximately 8-16 hours everyday for 2-3 weeks for fine-tuning and inference
Cloud Provider: None - personal workstation/local machine
Compute Region: North America
Carbon Emitted: // Not in the list of hardwares on the calculator

This fine-tuning process used parameter-efficient methods (QLoRA) specifically to reduce computational requirements and environmental impact. The significantly reduced resource usage compared to full model fine-tuning and inference represents a more sustainable approach to applied model development.

For the hours, this is a bootcamp project, I've never really fine-tuned a model or ran inference before this one. Please don't use it by itself: fix it/fine-tune it further and as needed before using on prod.

Model Card Contact

You can find me on Github or LinkedIn.

kangelamw
/

negative-reviews-into-actionable-insights