|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
pipeline_tag: text-classification |
|
tags: |
|
- Crypto |
|
- Bitcoin |
|
- Sentiment Analysis |
|
- RoBERTa |
|
- NLP |
|
- Cryptocurrency |
|
--- |
|
|
|
# CryptoBERTRefined |
|
CryptoBERTRefined is a fine tuned model from [CryptoBERT by Elkulako](https://huggingface.co/ElKulako/cryptobert) model. |
|
|
|
# Classification Example |
|
Input: |
|
```python |
|
!pip -q install transformers |
|
from transformers import TextClassificationPipeline, AutoModelForSequenceClassification, AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("AfterRain007/cryptobertRefined", use_fast=True) |
|
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = 3) |
|
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, max_length=128, truncation=True, padding = 'max_length') |
|
|
|
post_3 = "Because Forex Markets have years of solidity and millions in budget, not to mention that they use their own datacenters. These lame cryptomarkets are all supported by some Amazon-cloud-style system. They delegate and delegate their security and in the end, get buttfucked..." |
|
post_2 = "Russian crypto market worth $500B despite bad regulation, says exec https://t.co/MZFoZIr2cN #CryptoCurrencies #Bitcoin #Technical Analysis" |
|
post_1 = "I really wouldn't be asking strangers such an important question. I'm sure you'd get well meaning answers but you probably need professional advice." |
|
|
|
df_posts = [post_1, post_2, post_3] |
|
preds = pipe(df_posts) |
|
print(preds) |
|
``` |
|
Output: |
|
```python |
|
[{'label': 'Neutral', 'score': 0.8427615165710449}, {'label': 'Bullish', 'score': 0.5444369912147522}, {'label': 'Bearish', 'score': 0.8388379812240601}] |
|
``` |
|
|
|
# Training Corpus |
|
Total of 3.803 text have been labelled manually to fine tune the model, with consideration of non-duplicate and a minimum of 4 words after cleaning. The following website were used for our training dataset: |
|
1. Bitcoin tweet dataset from [Kaggle Datasets](https://www.kaggle.com/datasets/kaushiksuresh147/bitcoin-tweets) (Randomly picked). |
|
2. Labelled crypto sentiment dataset from [SurgeAI](https://www.surgehq.ai/datasets/crypto-sentiment-dataset). |
|
3. Reddit thread r/Bitcoin with the topic "Daily Discussion" (Randomly picked) |
|
|
|
Data augmentation was also performed to enrich the dataset, Back-Translation was used with Google Translate API on 10 language ('it', 'fr', "sv", "da", 'pt', 'id', 'pl', 'hr', "bg", "fi"). |
|
|
|
# Source Code |
|
See [Github](https://github.com/AfterRain007/cryptobertRefined) for the source code to finetune cryptoBERT model into cryptoBERTRefined. |
|
|
|
# Credit |
|
Credit where credit is due, thank you for all! |
|
|
|
1. Muhaza Liebenlito, M.Si and Prof. Dr. Nur Inayah, M.Si. as my academic advisor. |
|
2. Risky Amalia Marhariyadi for helping labelling the dataset. |
|
3. SurgeAI for the dataset. |
|
4. Mikolaj Kulakowski and Flavius Frasincar for the original CryptoBERT model. |
|
5. Kaushik Suresh for the bitcoin tweets. |