AfterRain007 commited on
Commit
6513634
·
verified ·
1 Parent(s): 1fd144d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -4
README.md CHANGED
@@ -15,17 +15,35 @@ tags:
15
  ---
16
 
17
  # CryptoBERTRefined
18
- CryptoBERTRefined is a fine tuned model from [CryptoBERT by Elkulako](https://huggingface.co/ElKulako/cryptobert) model (See the base model to see it's training description).
19
 
20
  # Classification Example
 
21
  ```
22
- Import your code here!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
  ```
24
 
25
  # Training Corpus
26
  Total of 3.803 text have been labelled manually to fine tune the model, with consideration of non-duplicate and a minimum of 4 words after cleaning. The following website were used for our training dataset:
27
- 1. Bitcoin tweet dataset from [kaggle datasets](https://www.kaggle.com/datasets/kaushiksuresh147/bitcoin-tweets) (Randomly picked).
28
- 2. Labelled crypto sentiment dataset from [surgeAI](https://www.surgehq.ai/datasets/crypto-sentiment-dataset).
29
  3. Reddit thread r/Bitcoin with the topic "Daily Discussion" (Randomly picked).
30
  Data augmentation is done to enrich the dataset, Back-Translation were used with Google Translate API on 10 language ('it', 'fr', "sv", "da", 'pt', 'id', 'pl', 'hr', "bg", "fi").
31
 
 
15
  ---
16
 
17
  # CryptoBERTRefined
18
+ CryptoBERTRefined is a fine tuned model from [CryptoBERT by Elkulako](https://huggingface.co/ElKulako/cryptobert) model.
19
 
20
  # Classification Example
21
+ Input:
22
  ```
23
+ !pip -q install transformers
24
+ from transformers import TextClassificationPipeline, AutoModelForSequenceClassification, AutoTokenizer
25
+
26
+ tokenizer = AutoTokenizer.from_pretrained("AfterRain007/cryptobertRefined", use_fast=True)
27
+ model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = 3)
28
+ pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, max_length=128, truncation=True, padding = 'max_length')
29
+
30
+ post_3 = "Because Forex Markets have years of solidity and millions in budget, not to mention that they use their own datacenters. These lame cryptomarkets are all supported by some Amazon-cloud-style system. They delegate and delegate their security and in the end, get buttfucked..."
31
+ post_2 = "Russian crypto market worth $500B despite bad regulation, says exec https://t.co/MZFoZIr2cN #CryptoCurrencies #Bitcoin #Technical Analysis"
32
+ post_1 = "I really wouldn't be asking strangers such an important question. I'm sure you'd get well meaning answers but you probably need professional advice."
33
+
34
+ df_posts = [post_1, post_2, post_3]
35
+ preds = pipe(df_posts)
36
+ print(preds)
37
+ ```
38
+ Output:
39
+ ```
40
+ [{'label': 'Neutral', 'score': 0.8427615165710449}, {'label': 'Bullish', 'score': 0.5444369912147522}, {'label': 'Bearish', 'score': 0.8388379812240601}]
41
  ```
42
 
43
  # Training Corpus
44
  Total of 3.803 text have been labelled manually to fine tune the model, with consideration of non-duplicate and a minimum of 4 words after cleaning. The following website were used for our training dataset:
45
+ 1. Bitcoin tweet dataset from [Kaggle Datasets](https://www.kaggle.com/datasets/kaushiksuresh147/bitcoin-tweets) (Randomly picked).
46
+ 2. Labelled crypto sentiment dataset from [SurgeAI](https://www.surgehq.ai/datasets/crypto-sentiment-dataset).
47
  3. Reddit thread r/Bitcoin with the topic "Daily Discussion" (Randomly picked).
48
  Data augmentation is done to enrich the dataset, Back-Translation were used with Google Translate API on 10 language ('it', 'fr', "sv", "da", 'pt', 'id', 'pl', 'hr', "bg", "fi").
49