Create README.md

c4a01ff verified 3 months ago

8.8 kB

	---
	datasets:
	- kenhktsui/FineFineWeb-First100K
	tags:
	- fasttext
	language:
	- en
	metrics:
	- f1
	pipeline_tag: text-classification
	---
	# finefineweb-domain-fasttext-classifier

	This is part of my [fasttext classifier collection](https://huggingface.co/collections/kenhktsui/fasttext-model-for-pretraining-data-curation-67220374c8acb97a1839553c) for curating pretraining dataset.
	This classifier classifies a text into domains specified in [m-a-p/FineFineWeb](https://huggingface.co/datasets/m-a-p/FineFineWeb).
	The classifier can be used for LLM pretraining data curation, to enhance capability in many domains.
	It is ultra fast ⚡ with a throughtput of ~2000 doc/s with CPU.

	Don't underestimate the "old" fasttext classiifer! It is indeed a good and scalable practice.
	For example, [QWEN2.5-MATH](https://arxiv.org/pdf/2409.12122) leverages fasttext to curate pretraining data, althought its classifier is not open sourced.


	## 🛠️Usage
	```python
	from typing import List
	import re
	from huggingface_hub import hf_hub_download
	import fasttext


	model_hf = fasttext.load_model(hf_hub_download("kenhktsui/finefineweb-domain-fasttext-classifier", "model.bin"))


	def replace_newlines(text: str) -> str:
	return re.sub("\n+", " ", text)


	def predict(text_list):
	text_list = [replace_newlines(text) for text in text_list]
	pred = model.predict(text_list)
	return [{"label": l[0][9:], "score": s[0]}
	for l, s in zip(*pred)]


	predict(
	[
	"Arsenal is the best team in the world",
	"Macroeconomics is a branch of economics that deals with the performance, structure, behavior, and decision-making of an economy as a whole.[1] This includes regional, national, and global economies.[2][3] Macroeconomists study topics such as output/GDP (gross domestic product) and national income, unemployment (including unemployment rates), price indices and inflation, consumption, saving, investment, energy, international trade, and international finance.",
	"Quantum entanglement is the phenomenon of a group of particles being generated, interacting, or sharing spatial proximity in a manner such that the quantum state of each particle of the group cannot be described independently of the state of the others, including when the particles are separated by a large distance. The topic of quantum entanglement is at the heart of the disparity between classical physics and quantum physics: entanglement is a primary feature of quantum mechanics not present in classical mechanics.",
	"Any program written in a high-level programming language must be translated to object code before it can be executed, so all programmers using such a language use a compiler or an interpreter, sometimes even both. Improvements to a compiler may lead to a large number of improved features in executable programs."
	]
	)

	# [{'label': 'sports', 'score': 0.5640762},
	# {'label': 'economics', 'score': 0.53133816},
	# {'label': 'physics', 'score': 0.9524484},
	# {'label': 'computer_science_and_technology', 'score': 0.41515663}]

	```
	## 📊Evaluation
	full version
	```
	precision recall f1-score support

	aerospace 0.69 0.72 0.71 10000
	agronomy 0.68 0.74 0.71 10000
	artistic 0.37 0.24 0.29 10000
	astronomy 0.67 0.76 0.71 10000
	atmospheric_science 0.82 0.92 0.87 10000
	automotive 0.66 0.74 0.70 10000
	beauty 0.82 0.86 0.84 10000
	biology 0.44 0.45 0.45 10000
	celebrity 0.69 0.81 0.75 10000
	chemistry 0.51 0.49 0.50 10000
	christianity 0.80 0.84 0.82 10000
	civil_engineering 0.58 0.58 0.58 10000
	communication_engineering 0.63 0.67 0.65 10000
	computer_science_and_technology 0.63 0.59 0.61 10000
	design 0.51 0.42 0.46 10000
	drama_and_film 0.53 0.53 0.53 10000
	economics 0.34 0.26 0.29 10000
	electronic_science 0.42 0.35 0.38 10000
	entertainment 0.43 0.29 0.34 10000
	environmental_science 0.42 0.35 0.38 10000
	fashion 0.72 0.77 0.74 10000
	finance 0.49 0.52 0.50 10000
	food 0.81 0.86 0.83 10000
	gamble 0.78 0.93 0.85 10000
	game 0.67 0.67 0.67 10000
	geography 0.42 0.33 0.37 10000
	health 0.43 0.29 0.34 10000
	history 0.64 0.71 0.67 10000
	hobby 0.45 0.37 0.41 10000
	hydraulic_engineering 0.95 0.98 0.96 10000
	instrument_science 0.48 0.50 0.49 10000
	journalism_and_media_communication 0.26 0.11 0.16 10000
	landscape_architecture 0.78 0.83 0.80 10000
	law 0.50 0.55 0.53 10000
	library 0.53 0.51 0.52 10000
	literature 0.52 0.53 0.52 10000
	materials_science 0.49 0.50 0.50 10000
	mathematics 0.87 0.90 0.88 10000
	mechanical_engineering 0.48 0.37 0.42 10000
	medical 0.41 0.42 0.41 10000
	mining_engineering 0.84 0.93 0.89 10000
	movie 0.59 0.71 0.64 10000
	music_and_dance 0.75 0.86 0.80 10000
	news 0.23 0.13 0.16 10000
	nuclear_science 0.92 0.96 0.94 10000
	ocean_science 0.83 0.92 0.88 10000
	optical_engineering 0.70 0.78 0.74 10000
	painting 0.91 0.96 0.94 10000
	pet 0.91 0.95 0.93 10000
	petroleum_and_natural_gas_engineering 0.92 0.96 0.94 10000
	philosophy 0.63 0.66 0.64 10000
	photo 0.80 0.85 0.82 10000
	physics 0.40 0.35 0.37 10000
	politics 0.38 0.41 0.39 10000
	psychology 0.62 0.66 0.64 10000
	public_administration 0.35 0.33 0.34 10000
	relationship 0.84 0.88 0.86 10000
	sociology 0.46 0.50 0.48 10000
	sports 0.66 0.82 0.73 10000
	statistics 0.60 0.70 0.65 10000
	systems_science 0.53 0.53 0.53 10000
	textile_science 0.81 0.86 0.83 10000
	topicality 0.97 0.99 0.98 10000
	transportation_engineering 0.51 0.52 0.51 10000
	travel 0.68 0.72 0.70 10000
	urban_planning 0.56 0.62 0.59 10000
	weapons_science 0.97 0.99 0.98 10000

	accuracy 0.64 670000
	macro avg 0.62 0.64 0.63 670000
	weighted avg 0.62 0.64 0.63 670000

	```


	## ⚠️Known Limitation
	The classifier does not handle short text well, which might not be surprising.