π²πΎ Malay Claim Classification Model
This is a fine-tuned BERT model built to classify claims in Malay (and English) into 21 categories.
π Categories
The model classifies claims into the following categories:
Politik
(Politics)Perpaduan
(Unity)Keluarga
(Family)Belia
(Youth)Perumahan
(Housing)Internet
(Internet)Pengguna
(Consumer)Makanan
(Food)Pekerjaan
(Employment)Pengangkutan
(Transportation)Sukan
(Sports)Ekonomi
(Economy)Hiburan
(Entertainment)Jenayah
(Crime)Alam Sekitar
(Environment)Teknologi
(Technology)Pendidikan
(Education)Agama
(Religion)Sosial
(Social)Kesihatan
(Health)Halal
(Halal)
π§ Base Model
Fine-tuned from bert-base-multilingual-cased
, which supports both Malay and English text.
π§ͺ Example Usage
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
# Load model and tokenizer
model_name = "rmtariq/malay_classification"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Function to classify a claim
def classify_claim(claim):
# Prepare the input
inputs = tokenizer(claim, return_tensors="pt", truncation=True, max_length=128)
# Get the prediction
with torch.no_grad():
outputs = model(**inputs)
# Get the predicted class
logits = outputs.logits
predicted_class_id = logits.argmax().item()
# Get the confidence score
probabilities = torch.nn.functional.softmax(logits, dim=1)[0]
confidence = probabilities[predicted_class_id].item()
# Map to category
category = model.config.id2label[predicted_class_id]
return category, confidence
# Example claims
examples = [
"Projek mega kerajaan penuh dengan ketirisan.",
"Harga barang keperluan naik setiap bulan.",
"Program vaksinasi tidak mencakupi golongan luar bandar.",
"Makanan di hotel lima bintang tidak jelas status halalnya."
]
# Classify each example
for claim in examples:
category, confidence = classify_claim(claim)
print(f"Claim: {claim}")
print(f"Category: {category}")
print(f"Confidence: {confidence:.4f}")
print("-" * 50)
π Dataset
Fine-tuned on a custom dataset with 3,675 claims labeled by category, with an 80/20 train/test split.
π Evaluation
The model achieves high accuracy on the test set, with most predictions having confidence scores above 0.95.
π― Specific Claim Patterns
The model includes special handling for specific claim patterns:
Police-related claims: Claims about the police chief, summons, or threats
- Example: "Ketua Polis Negara (KPN) Tan Sri Razarudin Husain hantar e-mel berkaitan saman dan berbaur ugutan kepada orang awam"
- Category: Jenayah (Crime)
Zakat-related claims: Claims about zakat fitrah, rice types, or payment validity
- Example: "Zakat fitrah tidak sah jika dibayar tidak mengikut jenis beras yang dimakan"
- Category: Agama (Religion)
Tax-related claims: Claims about government taxes, especially on palm oil
- Example: "Kerajaan akan memperkenalkan cukai khas minyak sawit mentah"
- Category: Ekonomi (Economy)
Consumer product claims: Claims about contact lenses or online sales
- Example: "Kanta lekap tidak boleh dijual secara dalam talian"
- Category: Pengguna (Consumer)
National security claims: Claims about ammunition, colonization, or enemies
- Example: "Penemuan 50 tan kelongsong dan peluru petanda negara bakal dijajah musuh"
- Category: Politik (Politics)
π License
MIT License
- Downloads last month
- 50