cryptobertRefined / README.md
AfterRain007's picture
Update README.md
3412acb verified
|
raw
history blame
2.89 kB
metadata
license: apache-2.0
language:
  - en
metrics:
  - accuracy
pipeline_tag: text-classification
tags:
  - Crypto
  - Bitcoin
  - Sentiment Analysis
  - RoBERTa
  - NLP
  - Cryptocurrency

CryptoBERTRefined

CryptoBERTRefined is a fine tuned model from CryptoBERT by Elkulako model.

Classification Example

Input:

!pip -q install transformers
from transformers import TextClassificationPipeline, AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("AfterRain007/cryptobertRefined", use_fast=True)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = 3)
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, max_length=128, truncation=True, padding = 'max_length')

post_3 = "Because Forex Markets have years of solidity and millions in budget, not to mention that they use their own datacenters. These lame cryptomarkets are all supported by some Amazon-cloud-style system. They delegate and delegate their security and in the end, get buttfucked..." 
post_2 = "Russian crypto market worth $500B despite bad regulation, says exec https://t.co/MZFoZIr2cN #CryptoCurrencies #Bitcoin #Technical Analysis" 
post_1 = "I really wouldn't be asking strangers such an important question. I'm sure you'd get well meaning answers but you probably need professional advice."

df_posts = [post_1, post_2, post_3]
preds = pipe(df_posts)
print(preds)

Output:

[{'label': 'Neutral', 'score': 0.8427615165710449}, {'label': 'Bullish', 'score': 0.5444369912147522}, {'label': 'Bearish', 'score': 0.8388379812240601}]

Training Corpus

Total of 3.803 text have been labelled manually to fine tune the model, with consideration of non-duplicate and a minimum of 4 words after cleaning. The following website were used for our training dataset:

  1. Bitcoin tweet dataset from Kaggle Datasets (Randomly picked).
  2. Labelled crypto sentiment dataset from SurgeAI.
  3. Reddit thread r/Bitcoin with the topic "Daily Discussion" (Randomly picked)

Data augmentation was also performed to enrich the dataset, Back-Translation was used with Google Translate API on 10 language ('it', 'fr', "sv", "da", 'pt', 'id', 'pl', 'hr', "bg", "fi").

Source Code

See Github for the source code to finetune cryptoBERT model into cryptoBERTRefined.

Credit

Credit where credit is due, thank you for all!

  1. Muhaza Liebenlito, M.Si and Prof. Dr. Nur Inayah, M.Si. as my academic advisor.
  2. Risky Amalia Marhariyadi for helping labelling the dataset.
  3. SurgeAI for the dataset.
  4. Mikolaj Kulakowski and Flavius Frasincar for the original CryptoBERT model.
  5. Kaushik Suresh for the bitcoin tweets.