YAML Metadata Error: "datasets[0]" with value "SAIL 2017" is not valid. If possible, use a dataset id from https://hf.co/datasets.

BERT codemixed base model for Hinglish (cased)

This model was built using lingualytics, an open-source library that supports code-mixed analytics.

Model description

Input for the model: Any codemixed Hinglish text Output for the model: Sentiment. (0 - Negative, 1 - Neutral, 2 - Positive)

I took a bert-base-multilingual-cased model from Huggingface and finetuned it on SAIL 2017 dataset.

Eval results

Performance of this model on the dataset

metric score
acc 0.55873
f1 0.558369
acc_and_f1 0.558549
precision 0.558075
recall 0.55873

How to use

Here is how to use this model to get the features of a given text in PyTorch:

# You can include sample code which will be formatted
from transformers import BertTokenizer, BertModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
model = AutoModelForSequenceClassification.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

and in TensorFlow:

from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
model = TFBertModel.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

Preprocessing

Followed standard preprocessing techniques:

  • removed digits
  • removed punctuation
  • removed stopwords
  • removed excess whitespace Here's the snippet
from pathlib import Path
import pandas as pd
from lingualytics.preprocessing import remove_lessthan, remove_punctuation, remove_stopwords
from lingualytics.stopwords import hi_stopwords,en_stopwords
from texthero.preprocessing import remove_digits, remove_whitespace

root = Path('<path-to-data>')

for file in 'test','train','validation':
  tochange = root / f'{file}.txt'
  df = pd.read_csv(tochange,header=None,sep='\t',names=['text','label'])
  df['text'] = df['text'].pipe(remove_digits) \
                                    .pipe(remove_punctuation) \
                                    .pipe(remove_stopwords,stopwords=en_stopwords.union(hi_stopwords)) \
                                    .pipe(remove_whitespace)
  df.to_csv(tochange,index=None,header=None,sep='\t')

Training data

The dataset and annotations are not good, but this is the best dataset I could find. I am working on procuring my own dataset and will try to come up with a better model!

Training procedure

I trained on the dataset on the bert-base-multilingual-cased model.

Downloads last month
26
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.