|
--- |
|
datasets: |
|
- kenhktsui/FineFineWeb-First100K |
|
tags: |
|
- fasttext |
|
language: |
|
- en |
|
metrics: |
|
- f1 |
|
pipeline_tag: text-classification |
|
--- |
|
# finefineweb-domain-fasttext-classifier |
|
|
|
This is part of my [fasttext classifier collection](https://huggingface.co/collections/kenhktsui/fasttext-model-for-pretraining-data-curation-67220374c8acb97a1839553c) for curating pretraining dataset. |
|
This classifier classifies a text into domains specified in [m-a-p/FineFineWeb](https://huggingface.co/datasets/m-a-p/FineFineWeb). |
|
The classifier can be used for LLM pretraining data curation, to enhance capability in many domains. |
|
It is ultra fast ⚡ with a throughtput of ~2000 doc/s with CPU. |
|
|
|
Don't underestimate the "old" fasttext classiifer! It is indeed a good and scalable practice. |
|
For example, [QWEN2.5-MATH](https://arxiv.org/pdf/2409.12122) leverages fasttext to curate pretraining data, althought its classifier is not open sourced. |
|
|
|
|
|
## 🛠️Usage |
|
```python |
|
from typing import List |
|
import re |
|
from huggingface_hub import hf_hub_download |
|
import fasttext |
|
|
|
|
|
model_hf = fasttext.load_model(hf_hub_download("kenhktsui/finefineweb-domain-fasttext-classifier", "model.bin")) |
|
|
|
|
|
def replace_newlines(text: str) -> str: |
|
return re.sub("\n+", " ", text) |
|
|
|
|
|
def predict(text_list): |
|
text_list = [replace_newlines(text) for text in text_list] |
|
pred = model.predict(text_list) |
|
return [{"label": l[0][9:], "score": s[0]} |
|
for l, s in zip(*pred)] |
|
|
|
|
|
predict( |
|
[ |
|
"Arsenal is the best team in the world", |
|
"Macroeconomics is a branch of economics that deals with the performance, structure, behavior, and decision-making of an economy as a whole.[1] This includes regional, national, and global economies.[2][3] Macroeconomists study topics such as output/GDP (gross domestic product) and national income, unemployment (including unemployment rates), price indices and inflation, consumption, saving, investment, energy, international trade, and international finance.", |
|
"Quantum entanglement is the phenomenon of a group of particles being generated, interacting, or sharing spatial proximity in a manner such that the quantum state of each particle of the group cannot be described independently of the state of the others, including when the particles are separated by a large distance. The topic of quantum entanglement is at the heart of the disparity between classical physics and quantum physics: entanglement is a primary feature of quantum mechanics not present in classical mechanics.", |
|
"Any program written in a high-level programming language must be translated to object code before it can be executed, so all programmers using such a language use a compiler or an interpreter, sometimes even both. Improvements to a compiler may lead to a large number of improved features in executable programs." |
|
] |
|
) |
|
|
|
# [{'label': 'sports', 'score': 0.5640762}, |
|
# {'label': 'economics', 'score': 0.53133816}, |
|
# {'label': 'physics', 'score': 0.9524484}, |
|
# {'label': 'computer_science_and_technology', 'score': 0.41515663}] |
|
|
|
``` |
|
## 📊Evaluation |
|
full version |
|
``` |
|
precision recall f1-score support |
|
|
|
aerospace 0.69 0.72 0.71 10000 |
|
agronomy 0.68 0.74 0.71 10000 |
|
artistic 0.37 0.24 0.29 10000 |
|
astronomy 0.67 0.76 0.71 10000 |
|
atmospheric_science 0.82 0.92 0.87 10000 |
|
automotive 0.66 0.74 0.70 10000 |
|
beauty 0.82 0.86 0.84 10000 |
|
biology 0.44 0.45 0.45 10000 |
|
celebrity 0.69 0.81 0.75 10000 |
|
chemistry 0.51 0.49 0.50 10000 |
|
christianity 0.80 0.84 0.82 10000 |
|
civil_engineering 0.58 0.58 0.58 10000 |
|
communication_engineering 0.63 0.67 0.65 10000 |
|
computer_science_and_technology 0.63 0.59 0.61 10000 |
|
design 0.51 0.42 0.46 10000 |
|
drama_and_film 0.53 0.53 0.53 10000 |
|
economics 0.34 0.26 0.29 10000 |
|
electronic_science 0.42 0.35 0.38 10000 |
|
entertainment 0.43 0.29 0.34 10000 |
|
environmental_science 0.42 0.35 0.38 10000 |
|
fashion 0.72 0.77 0.74 10000 |
|
finance 0.49 0.52 0.50 10000 |
|
food 0.81 0.86 0.83 10000 |
|
gamble 0.78 0.93 0.85 10000 |
|
game 0.67 0.67 0.67 10000 |
|
geography 0.42 0.33 0.37 10000 |
|
health 0.43 0.29 0.34 10000 |
|
history 0.64 0.71 0.67 10000 |
|
hobby 0.45 0.37 0.41 10000 |
|
hydraulic_engineering 0.95 0.98 0.96 10000 |
|
instrument_science 0.48 0.50 0.49 10000 |
|
journalism_and_media_communication 0.26 0.11 0.16 10000 |
|
landscape_architecture 0.78 0.83 0.80 10000 |
|
law 0.50 0.55 0.53 10000 |
|
library 0.53 0.51 0.52 10000 |
|
literature 0.52 0.53 0.52 10000 |
|
materials_science 0.49 0.50 0.50 10000 |
|
mathematics 0.87 0.90 0.88 10000 |
|
mechanical_engineering 0.48 0.37 0.42 10000 |
|
medical 0.41 0.42 0.41 10000 |
|
mining_engineering 0.84 0.93 0.89 10000 |
|
movie 0.59 0.71 0.64 10000 |
|
music_and_dance 0.75 0.86 0.80 10000 |
|
news 0.23 0.13 0.16 10000 |
|
nuclear_science 0.92 0.96 0.94 10000 |
|
ocean_science 0.83 0.92 0.88 10000 |
|
optical_engineering 0.70 0.78 0.74 10000 |
|
painting 0.91 0.96 0.94 10000 |
|
pet 0.91 0.95 0.93 10000 |
|
petroleum_and_natural_gas_engineering 0.92 0.96 0.94 10000 |
|
philosophy 0.63 0.66 0.64 10000 |
|
photo 0.80 0.85 0.82 10000 |
|
physics 0.40 0.35 0.37 10000 |
|
politics 0.38 0.41 0.39 10000 |
|
psychology 0.62 0.66 0.64 10000 |
|
public_administration 0.35 0.33 0.34 10000 |
|
relationship 0.84 0.88 0.86 10000 |
|
sociology 0.46 0.50 0.48 10000 |
|
sports 0.66 0.82 0.73 10000 |
|
statistics 0.60 0.70 0.65 10000 |
|
systems_science 0.53 0.53 0.53 10000 |
|
textile_science 0.81 0.86 0.83 10000 |
|
topicality 0.97 0.99 0.98 10000 |
|
transportation_engineering 0.51 0.52 0.51 10000 |
|
travel 0.68 0.72 0.70 10000 |
|
urban_planning 0.56 0.62 0.59 10000 |
|
weapons_science 0.97 0.99 0.98 10000 |
|
|
|
accuracy 0.64 670000 |
|
macro avg 0.62 0.64 0.63 670000 |
|
weighted avg 0.62 0.64 0.63 670000 |
|
|
|
``` |
|
|
|
|
|
## ⚠️Known Limitation |
|
The classifier does not handle short text well, which might not be surprising. |