zohfur commited on
Commit
e8bfc31
·
verified ·
1 Parent(s): b997c9b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +184 -3
README.md CHANGED
@@ -1,3 +1,184 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ base_model: distilbert/distilbert-base-uncased
6
+ library_name: transformers
7
+ tags:
8
+ - distilbert
9
+ - bert
10
+ - text-classification
11
+ - commission-detection
12
+ - social-media
13
+ pipeline_tag: text-classification
14
+ datasets:
15
+ - custom
16
+ model-index:
17
+ - name: distilbert-commissions
18
+ results:
19
+ - task:
20
+ type: text-classification
21
+ name: Text Classification
22
+ dataset:
23
+ type: custom
24
+ name: Scraped Social Media Profiles (Bluesky & Twitter)
25
+ metrics:
26
+ - name: Accuracy
27
+ type: accuracy
28
+ value: 0.9506
29
+ verified: false
30
+ - name: Precision
31
+ type: precision
32
+ value: 0.9513
33
+ verified: false
34
+ - name: Recall
35
+ type: recall
36
+ value: 0.9506
37
+ verified: false
38
+ - name: F1 Score
39
+ type: f1
40
+ value: 0.9508
41
+ verified: false
42
+ ---
43
+
44
+ # DistilBERT Commission Detection Model
45
+
46
+ ## Model Description
47
+
48
+ This is a fine-tuned DistilBERT model for detecting commission-related content in social media profiles and posts. The model classifies text to identify whether an artist's profile/bio/post content shows they are open or closed for commissions, or if the text is unclear.
49
+
50
+ ## Model Details
51
+
52
+ ### Model Architecture
53
+
54
+ - **Base Model**: [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased)
55
+ - **Model Type**: Text Classification
56
+ - **Language**: English
57
+ - **License**: MIT
58
+
59
+ ### Training Data
60
+
61
+ - **Sources**: Manually scraped profile names, bios, and posts from Bluesky and Twitter by a crowd of furries uploading classifications via a custom extension built specifically to make this dataset
62
+ - **Dataset**: Custom dataset of ~1000 rows and user classifications with an equal amount of artificial data to boost pattern recognition
63
+
64
+ ## Performance
65
+
66
+ | Metric | Value |
67
+ |--------|-------|
68
+ | Accuracy | 95.06% |
69
+ | Precision | 95.13% |
70
+ | Recall | 95.06% |
71
+ | F1 Score | 95.08% |
72
+
73
+ *Note: These metrics are not independently verified.*
74
+
75
+ ## Usage
76
+
77
+ I recommend a high temperature when inferencing to lower the model's confidence. I use between 1.5 - 3.0.
78
+
79
+ ```python
80
+
81
+ # Example inference #
82
+
83
+ from transformers import DistilBertForSequenceClassification, DistilBertTokenizer
84
+ import torch
85
+
86
+ # Load model and tokenizer #
87
+ model_name = 'zohfur/distilbert-commissions'
88
+ tokenizer = DistilBertTokenizer.from_pretrained(model_name)
89
+ model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=3)
90
+
91
+ # Example usage #
92
+ example_sentences = [
93
+ "Commissions are currently closed.",
94
+ "Check my bio for commission status.",
95
+ "C*mms 0pen on p-site",
96
+ "DM for comms",
97
+ "Taking art requests, dm me",
98
+ "comm completed for personmcperson, thank you <3",
99
+ "open for trades",
100
+ "Comms are not open",
101
+ "Comms form will be open soon, please check back later",
102
+ "~ Furry artist - 25 y.o - he/him - c*mms 0pen: 2/5 - bots dni ~"
103
+ ]
104
+
105
+ # Map label integers back to strings #
106
+ label_map = {0: 'open', 1: 'closed', 2: 'unclear'}
107
+
108
+ def predict_with_temperature(model, tokenizer, sentences, temperature=1.5):
109
+ # Prepare input #
110
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
111
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
112
+ encoded_input = {key: value.to(device) for key, value in encoded_input.items()}
113
+ model.to(device)
114
+ model.eval()
115
+
116
+ # Make predictions with temperature scaling #
117
+ with torch.no_grad():
118
+ outputs = model(**encoded_input)
119
+ logits = outputs['logits'] / temperature # Apply temperature scaling #
120
+ probabilities = torch.softmax(logits, dim=1)
121
+
122
+ # Extract predictions and confidence scores #
123
+ predicted_class_indices = torch.argmax(probabilities, dim=1)
124
+ confidences = torch.max(probabilities, dim=1).values
125
+
126
+ # Convert to CPU and prepare results #
127
+ predictions = {
128
+ 'sentences': sentences,
129
+ 'labels': [label_map[idx.item()] for idx in predicted_class_indices],
130
+ 'confidences': [score.item() for score in confidences]
131
+ }
132
+
133
+ return predictions
134
+
135
+ def print_predictions(predictions):
136
+ """Print formatted predictions with confidence scores."""
137
+ print("\nClassification Results:")
138
+ print("=" * 50)
139
+ for i, (sentence, label, confidence) in enumerate(zip(
140
+ predictions['sentences'],
141
+ predictions['labels'],
142
+ predictions['confidences']
143
+ ), 1):
144
+ print(f"\n{i}. Sentence: '{sentence}'")
145
+ print(f" Predicted Label: {label}")
146
+ print(f" Confidence Score: {confidence:.4f}")
147
+
148
+ # Make predictions with temperature scaling #
149
+ predictions = predict_with_temperature(model, tokenizer, example_sentences, temperature=1.5)
150
+
151
+ # Print results #
152
+ print_predictions(predictions)
153
+ ```
154
+
155
+ ## Limitations and Biases
156
+
157
+ ### Limitations
158
+
159
+ - **Language**: Only trained on English text
160
+ - **False Positives**: Requires a high temperature to avoid false positives (particularly with the words "open" and "closed")
161
+ - **Platform Bias**: Trained on Bsky and Twitter/X data, might not perform as well on other platforms like FurAffinity or Instagra
162
+
163
+ ## Training Details
164
+
165
+ ### Training Procedure
166
+
167
+ - **Base Model**: DistilBERT base uncased
168
+ - **Fine-tuning**: Finetuned using Huggingface's Trainer, evaluated using Trainer and sklearn.metrics
169
+ - **Optimization**: Wandb hyperparameter sweep using bayers algorithm to reach highest f1 score
170
+
171
+ ### Data Preprocessing
172
+
173
+ - Classifications uploaded voluntarily by crowdsourcing extension users
174
+ - Problematic unicode characters cleaned from dataset
175
+ - Label encoding for classification
176
+ - Class weights computed to adjust weights inversely proportional to class frequencies
177
+
178
+ ## Model Card Authors
179
+
180
+ All credit to original author Zohfur. Base model attributed to distilbert.
181
+
182
+ ## Model Card Contact
183
+
184
+ For questions or concerns about this model, please contact: [[email protected]]