File size: 2,891 Bytes
44141a7 0e663aa 6513634 0e663aa 1fd144d 6513634 9e390d8 6513634 9e390d8 6513634 1fd144d 0e663aa 1fd144d 6513634 3412acb 0e663aa 1fd144d 3412acb 1fd144d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
---
license: apache-2.0
language:
- en
metrics:
- accuracy
pipeline_tag: text-classification
tags:
- Crypto
- Bitcoin
- Sentiment Analysis
- RoBERTa
- NLP
- Cryptocurrency
---
# CryptoBERTRefined
CryptoBERTRefined is a fine tuned model from [CryptoBERT by Elkulako](https://huggingface.co/ElKulako/cryptobert) model.
# Classification Example
Input:
```python
!pip -q install transformers
from transformers import TextClassificationPipeline, AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("AfterRain007/cryptobertRefined", use_fast=True)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = 3)
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, max_length=128, truncation=True, padding = 'max_length')
post_3 = "Because Forex Markets have years of solidity and millions in budget, not to mention that they use their own datacenters. These lame cryptomarkets are all supported by some Amazon-cloud-style system. They delegate and delegate their security and in the end, get buttfucked..."
post_2 = "Russian crypto market worth $500B despite bad regulation, says exec https://t.co/MZFoZIr2cN #CryptoCurrencies #Bitcoin #Technical Analysis"
post_1 = "I really wouldn't be asking strangers such an important question. I'm sure you'd get well meaning answers but you probably need professional advice."
df_posts = [post_1, post_2, post_3]
preds = pipe(df_posts)
print(preds)
```
Output:
```python
[{'label': 'Neutral', 'score': 0.8427615165710449}, {'label': 'Bullish', 'score': 0.5444369912147522}, {'label': 'Bearish', 'score': 0.8388379812240601}]
```
# Training Corpus
Total of 3.803 text have been labelled manually to fine tune the model, with consideration of non-duplicate and a minimum of 4 words after cleaning. The following website were used for our training dataset:
1. Bitcoin tweet dataset from [Kaggle Datasets](https://www.kaggle.com/datasets/kaushiksuresh147/bitcoin-tweets) (Randomly picked).
2. Labelled crypto sentiment dataset from [SurgeAI](https://www.surgehq.ai/datasets/crypto-sentiment-dataset).
3. Reddit thread r/Bitcoin with the topic "Daily Discussion" (Randomly picked)
Data augmentation was also performed to enrich the dataset, Back-Translation was used with Google Translate API on 10 language ('it', 'fr', "sv", "da", 'pt', 'id', 'pl', 'hr', "bg", "fi").
# Source Code
See [Github](https://github.com/AfterRain007/cryptobertRefined) for the source code to finetune cryptoBERT model into cryptoBERTRefined.
# Credit
Credit where credit is due, thank you for all!
1. Muhaza Liebenlito, M.Si and Prof. Dr. Nur Inayah, M.Si. as my academic advisor.
2. Risky Amalia Marhariyadi for helping labelling the dataset.
3. SurgeAI for the dataset.
4. Mikolaj Kulakowski and Flavius Frasincar for the original CryptoBERT model.
5. Kaushik Suresh for the bitcoin tweets. |