nosdigitalmedia
/

telegram-drugs-classification

Text Classification

Model card Files Files and versions Community

nosdigitalmedia commited on Jan 25, 2024

Commit

813ae9b

·

verified ·

1 Parent(s): c78b4ec

Update README.md

Files changed (1) hide show

README.md +68 -1

README.md CHANGED Viewed

@@ -1,3 +1,70 @@
 ---
-license: apache-2.0
 ---

 ---
+tags:
+- sklearn
+- text-classification
+language:
+- nl
+metrics:
+- accuracy
+- hamming-loss
 ---
+# Model card for NOS Drug-Related Text Classification on Telegram
+The NOS editorial team is conducting an investigation into drug-related messages on Telegram. Thousands of Telegram messages has been labeled as drugs-related content (or not), as well including detail regarding the specific type of drugs, and delivery method. The data is utilized in order to train a model to scale it up and automatically label millions more.
+## Methodology
+Primarily a Logistic Regression model has been trained for binary classification. Text data was converted to numeric values using the Tfidf Vectorizer, considering term frequency-inverse document frequency (TF-IDF). This transformation enables the model to learn patterns and relationships between words. The model achieved 97% accuracy on the test set.
+To take tasks with multiple possible labels into consideration, a MultiOutputClassifier was employed as an extension. This addresses the complexity of associating a text message with multiple categories such as "soft drugs," "hard drugs," and "medicines”. One-Hot Encoding was used for multi-label transformation.
+Performance evaluation utilized Hamming Loss, a metric suitable for multi-label classification. The model demonstrated a Hamming Loss of 0.04, indicating 96% accuracy per label.
+### Tools used to train the model
+    • Python
+    • scikit-learn
+    • pandas
+    • numpy
+### How to Get Started with the Model
+Use the code below to get started with the model.
+```python
+from joblib import load
+# load the model
+clf = load('model.joblib')
+# make some predictions
+text_messages = [
+    """
+    Oud kleding te koop! Stuur een berichtje
+    We repareren ook!
+    """,
+    """
+    COKE/XTC
+    * 1Gram = €50
+    * 5Gram = €230
+    """]
+mapping = {0:"bezorging", 1:"bulk", 2:"designer", 3:"drugsad", 4:"geendrugsad", 5:"harddrugs", 6:"medicijnen", 7: "pickup", 8: "post", 9:"softdrugs"}
+labels = []
+for message in clf.predict(text_messages):
+    label = []
+    for idx, labeled in enumerate(message):
+        if labeled == 1:
+            label.append(mapping[idx])
+    labels.append(label)
+print(labels)
+```
+## Details
+- **Shared by** Dutch Public Broadcasting Foundation (NOS)
+- **Model type:** text-classification
+- **Language:** Dutch
+- **License:** Creative Commons Attribution Non Commercial No Derivatives 4.0