Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,70 @@
|
|
1 |
---
|
2 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
tags:
|
3 |
+
- sklearn
|
4 |
+
- text-classification
|
5 |
+
language:
|
6 |
+
- nl
|
7 |
+
metrics:
|
8 |
+
- accuracy
|
9 |
+
- hamming-loss
|
10 |
---
|
11 |
+
|
12 |
+
|
13 |
+
# Model card for NOS Drug-Related Text Classification on Telegram
|
14 |
+
The NOS editorial team is conducting an investigation into drug-related messages on Telegram. Thousands of Telegram messages has been labeled as drugs-related content (or not), as well including detail regarding the specific type of drugs, and delivery method. The data is utilized in order to train a model to scale it up and automatically label millions more.
|
15 |
+
|
16 |
+
## Methodology
|
17 |
+
Primarily a Logistic Regression model has been trained for binary classification. Text data was converted to numeric values using the Tfidf Vectorizer, considering term frequency-inverse document frequency (TF-IDF). This transformation enables the model to learn patterns and relationships between words. The model achieved 97% accuracy on the test set.
|
18 |
+
To take tasks with multiple possible labels into consideration, a MultiOutputClassifier was employed as an extension. This addresses the complexity of associating a text message with multiple categories such as "soft drugs," "hard drugs," and "medicines”. One-Hot Encoding was used for multi-label transformation.
|
19 |
+
Performance evaluation utilized Hamming Loss, a metric suitable for multi-label classification. The model demonstrated a Hamming Loss of 0.04, indicating 96% accuracy per label.
|
20 |
+
|
21 |
+
### Tools used to train the model
|
22 |
+
• Python
|
23 |
+
• scikit-learn
|
24 |
+
• pandas
|
25 |
+
• numpy
|
26 |
+
|
27 |
+
### How to Get Started with the Model
|
28 |
+
|
29 |
+
Use the code below to get started with the model.
|
30 |
+
|
31 |
+
```python
|
32 |
+
from joblib import load
|
33 |
+
|
34 |
+
# load the model
|
35 |
+
clf = load('model.joblib')
|
36 |
+
|
37 |
+
# make some predictions
|
38 |
+
|
39 |
+
text_messages = [
|
40 |
+
"""
|
41 |
+
Oud kleding te koop! Stuur een berichtje
|
42 |
+
We repareren ook!
|
43 |
+
""",
|
44 |
+
|
45 |
+
"""
|
46 |
+
COKE/XTC
|
47 |
+
* 1Gram = €50
|
48 |
+
* 5Gram = €230
|
49 |
+
"""]
|
50 |
+
|
51 |
+
mapping = {0:"bezorging", 1:"bulk", 2:"designer", 3:"drugsad", 4:"geendrugsad", 5:"harddrugs", 6:"medicijnen", 7: "pickup", 8: "post", 9:"softdrugs"}
|
52 |
+
|
53 |
+
labels = []
|
54 |
+
|
55 |
+
for message in clf.predict(text_messages):
|
56 |
+
label = []
|
57 |
+
for idx, labeled in enumerate(message):
|
58 |
+
if labeled == 1:
|
59 |
+
label.append(mapping[idx])
|
60 |
+
labels.append(label)
|
61 |
+
|
62 |
+
print(labels)
|
63 |
+
|
64 |
+
```
|
65 |
+
|
66 |
+
## Details
|
67 |
+
- **Shared by** Dutch Public Broadcasting Foundation (NOS)
|
68 |
+
- **Model type:** text-classification
|
69 |
+
- **Language:** Dutch
|
70 |
+
- **License:** Creative Commons Attribution Non Commercial No Derivatives 4.0
|