AfterRain007
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -15,17 +15,35 @@ tags:
|
|
15 |
---
|
16 |
|
17 |
# CryptoBERTRefined
|
18 |
-
CryptoBERTRefined is a fine tuned model from [CryptoBERT by Elkulako](https://huggingface.co/ElKulako/cryptobert) model
|
19 |
|
20 |
# Classification Example
|
|
|
21 |
```
|
22 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
```
|
24 |
|
25 |
# Training Corpus
|
26 |
Total of 3.803 text have been labelled manually to fine tune the model, with consideration of non-duplicate and a minimum of 4 words after cleaning. The following website were used for our training dataset:
|
27 |
-
1. Bitcoin tweet dataset from [
|
28 |
-
2. Labelled crypto sentiment dataset from [
|
29 |
3. Reddit thread r/Bitcoin with the topic "Daily Discussion" (Randomly picked).
|
30 |
Data augmentation is done to enrich the dataset, Back-Translation were used with Google Translate API on 10 language ('it', 'fr', "sv", "da", 'pt', 'id', 'pl', 'hr', "bg", "fi").
|
31 |
|
|
|
15 |
---
|
16 |
|
17 |
# CryptoBERTRefined
|
18 |
+
CryptoBERTRefined is a fine tuned model from [CryptoBERT by Elkulako](https://huggingface.co/ElKulako/cryptobert) model.
|
19 |
|
20 |
# Classification Example
|
21 |
+
Input:
|
22 |
```
|
23 |
+
!pip -q install transformers
|
24 |
+
from transformers import TextClassificationPipeline, AutoModelForSequenceClassification, AutoTokenizer
|
25 |
+
|
26 |
+
tokenizer = AutoTokenizer.from_pretrained("AfterRain007/cryptobertRefined", use_fast=True)
|
27 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = 3)
|
28 |
+
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, max_length=128, truncation=True, padding = 'max_length')
|
29 |
+
|
30 |
+
post_3 = "Because Forex Markets have years of solidity and millions in budget, not to mention that they use their own datacenters. These lame cryptomarkets are all supported by some Amazon-cloud-style system. They delegate and delegate their security and in the end, get buttfucked..."
|
31 |
+
post_2 = "Russian crypto market worth $500B despite bad regulation, says exec https://t.co/MZFoZIr2cN #CryptoCurrencies #Bitcoin #Technical Analysis"
|
32 |
+
post_1 = "I really wouldn't be asking strangers such an important question. I'm sure you'd get well meaning answers but you probably need professional advice."
|
33 |
+
|
34 |
+
df_posts = [post_1, post_2, post_3]
|
35 |
+
preds = pipe(df_posts)
|
36 |
+
print(preds)
|
37 |
+
```
|
38 |
+
Output:
|
39 |
+
```
|
40 |
+
[{'label': 'Neutral', 'score': 0.8427615165710449}, {'label': 'Bullish', 'score': 0.5444369912147522}, {'label': 'Bearish', 'score': 0.8388379812240601}]
|
41 |
```
|
42 |
|
43 |
# Training Corpus
|
44 |
Total of 3.803 text have been labelled manually to fine tune the model, with consideration of non-duplicate and a minimum of 4 words after cleaning. The following website were used for our training dataset:
|
45 |
+
1. Bitcoin tweet dataset from [Kaggle Datasets](https://www.kaggle.com/datasets/kaushiksuresh147/bitcoin-tweets) (Randomly picked).
|
46 |
+
2. Labelled crypto sentiment dataset from [SurgeAI](https://www.surgehq.ai/datasets/crypto-sentiment-dataset).
|
47 |
3. Reddit thread r/Bitcoin with the topic "Daily Discussion" (Randomly picked).
|
48 |
Data augmentation is done to enrich the dataset, Back-Translation were used with Google Translate API on 10 language ('it', 'fr', "sv", "da", 'pt', 'id', 'pl', 'hr', "bg", "fi").
|
49 |
|