mmx_classifier_microblog_ENv02 / README.md

Update README.md

a3d128f almost 2 years ago

5.03 kB

	---
	language:
	- en
	tags:
	- roberta
	- marketing mix
	- multi-label
	- classification
	- microblog
	- tweets

	widget:
	- text: "Best cushioning ever!!! 🤗🤗🤗 my zoom vomeros are the bomb🏃🏽‍♀️💨!!! @nike #run #training"
	- text: "Why is @BestBuy always sold-out of Apple's new airpods in their online shop 🤯😡?"
	- text: "They’re closing the @Aldo at the Lehigh Vally Mall and KOP 😭"
	- text: "@Sony’s XM3’s ain’t as sweet as my bro’s airpod pros but got a real steal 🤑 the other day #deal #headphonez"
	- text: "Nike needs to sponsor more e-sports atheletes with Air Jordans! #nike #esports"
	- text: "Say what you want about @Abercrombie's 90s shirtless males ads, they made dang good woll sweaters back in the day. This is one of 3 I have from the late 90s."
	- text: "To celebrate this New Year, @Nordstrom is DOUBLING all donations up to $25,000! 🎉 Your donation will help us answer 2X the calls, texts, and chats that come in, and allow us to train 2X more volunteers!"
	- text: "It's inspiring to see religious leaders speaking up for workers' rights and fair wages. Every voice matters in the #FightFor15! 💪🏽✊🏼 #Solidarity #WorkersRights"
	---
	# Model Card for: mmx_classifier_microblog_ENv02
	Multi-label classifier that identifies which marketing mix variable(s) a microblog post pertains to.

	Version: 0.2 from August 16, 2023

	## Model Details
	You can use this classifier to determine which of the 4P's of marketing, also known as marketing mix variables, a microblog post (e.g., Tweet) pertains to:

	1. Product
	2. Place
	3. Price
	4. Promotion

	### Model Description
	This classifier is a fine-tuned checkpoint of [cardiffnlp/twitter-roberta-large-2022-154m] (https://huggingface.co/cardiffnlp/twitter-roberta-large-2022-154m).
	It was trained on 15K Tweets that mentioned at least one of 699 brands. The Tweets were first cleaned and then labeled using OpenAI's GPT4.

	Because this is a multi-label classification problem, we use binary cross-entropy (BCE) with logits loss for the fine-tuning. We basically combine a sigmoid layer with BCELoss in a single class.
	To obtain the probabilities for each label (i.e., marketing mix variable), you need to "push" the predictions through a sigmoid function. This is already done in the accompanying python notebook.

	*IMPORTANT* At the time of writing this description, Huggingface's pipeline did not support multi-label classifiers.

	### Working Paper
	Download the working paper from SSRN: ["Creating Synthetic Experts with Generative AI"](https://papers.ssrn.com/abstract_id=4542949)

	### Quickstart
	```python
	# Imports
	import pandas as pd, numpy as np, warnings, torch, re
	from transformers import AutoModelForSequenceClassification, AutoTokenizer
	from bs4 import BeautifulSoup
	warnings.filterwarnings("ignore", category=UserWarning, module='bs4')
	# Helper Functions
	def clean_and_parse_tweet(tweet):
	tweet = re.sub(r"https?://\S+\|www\.\S+", " URL ", tweet)
	parsed = BeautifulSoup(tweet, "html.parser").get_text() if "filename" not in str(BeautifulSoup(tweet, "html.parser")) else None
	return re.sub(r" +", " ", re.sub(r'^[.:]+', '', re.sub(r"\\n+\|\n+", " ", parsed or tweet)).strip()) if parsed else None
	def predict_tweet(tweet, model, tokenizer, device, threshold=0.5):
	inputs = tokenizer(tweet, return_tensors="pt", padding=True, truncation=True, max_length=128).to(device)
	probs = torch.sigmoid(model(**inputs).logits).detach().cpu().numpy()[0]
	return probs, [id2label[i] for i, p in enumerate(probs) if id2label[i] in {'Product', 'Place', 'Price', 'Promotion'} and p >= threshold]
	# Setup
	device = "mps" if torch.backends.mps.is_built() and torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu"
	synxp = "dmr76/mmx_classifier_microblog_ENv02"
	model = AutoModelForSequenceClassification.from_pretrained(synxp).to(device)
	tokenizer = AutoTokenizer.from_pretrained(synxp)
	id2label = model.config.id2label
	# ---->>> Define your Tweet <<<----
	tweet = "Best cushioning ever!!! 🤗🤗🤗 my zoom vomeros are the bomb🏃🏽‍♀️💨!!! \n @nike #run #training https://randomurl.ai"
	# Clean and Predict
	cleaned_tweet = clean_and_parse_tweet(tweet)
	probs, labels = predict_tweet(cleaned_tweet, model, tokenizer, device)
	# Print Labels and Probabilities
	print("Please don't forget to cite the paper: https://ssrn.com/abstract=4542949 in you use this code")
	print(labels, probs)
	```
	Conveniently predict thousands tweets with the batch processing python notebook, available in my [GitHub Repository](https://github.com/dringel/Synthetic-Experts)

	### Citation
	Please cite the following reference if you use synthetic experts in your work:
	```
	Ringel, Daniel, Creating Synthetic Experts with Generative Artificial Intelligence (July 15, 2023). Available at SSRN: https://ssrn.com/abstract=4542949
	```

	### Additional Ressources
	[www.synthetic-experts.ai](http://www.synthetic-experts.ai)
	[GitHub Repository](https://github.com/dringel/Synthetic-Experts)