agentlans
/

multilingual-e5-small-fineweb2hq-vs-c4-classifier

Text Classification

Generated from Trainer

Model card Files Files and versions

agentlans commited on Jul 19

Commit

fed64ef

·

verified ·

1 Parent(s): b676a2a

Update README.md

Files changed (1) hide show

README.md +46 -16

README.md CHANGED Viewed

@@ -6,34 +6,64 @@ tags:
 - generated_from_trainer
 metrics:
 - accuracy
-model-index:
-- name: multilingual-e5-small-aligned-v2-fineweb2hq-vs-c4-classifier-run2
-  results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# multilingual-e5-small-aligned-v2-fineweb2hq-vs-c4-classifier-run2
-This model is a fine-tuned version of [agentlans/multilingual-e5-small-aligned-v2](https://huggingface.co/agentlans/multilingual-e5-small-aligned-v2) on an unknown dataset.
-It achieves the following results on the evaluation set:
 - Loss: 0.1983
 - Accuracy: 0.9515
 - Combined Score: 1.3494
 - Num Input Tokens Seen: 122880000
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
 ## Training procedure
@@ -62,4 +92,4 @@ The following hyperparameters were used during training:
 - Transformers 4.51.3
 - Pytorch 2.6.0+cu124
 - Datasets 3.2.0
-- Tokenizers 0.21.0

 - generated_from_trainer
 metrics:
 - accuracy
+language:
+- ar
+- zh
+- cs
+- da
+- nl
+- fr
+- de
+- el
+- hu
+- id
+- it
+- ja
+- fa
+- pl
+- pt
+- ru
+- es
+- sv
+- tr
+- vi
+datasets:
+- agentlans/fineweb2hq-vs-c4
+pipeline_tag: text-classification
 ---
+# agentlans/multilingual-e5-small-fineweb2hq-vs-c4-classifier
+> [!IMPORTANT]
+> **Note:** This model is provided for reference and reproducibility, not for standalone use.
+This model is a fine-tuned version of [agentlans/multilingual-e5-small-aligned-v2](https://huggingface.co/agentlans/multilingual-e5-small-aligned-v2)
+on the [agentlans/fineweb2hq-vs-c4](https://huggingface.co/datasets/agentlans/fineweb2hq-vs-c4) dataset.
+The aim is to classify text as higher quality (FineWeb 2 HQ) or lower quality (C4) for AI training.
+Training dataset:
+On the validation set:
 - Loss: 0.1983
 - Accuracy: 0.9515
 - Combined Score: 1.3494
 - Num Input Tokens Seen: 122880000
+## Example
+```python
+from transformers import pipeline
+classifier = pipeline("text-classification", model="agentlans/multilingual-e5-small-fineweb2hq-vs-c4-classifier")
+classifier("Your text here.")
+```
+## Limitations
+- **Not trained on English data**
+- Tends to be overly permissive, labelling most texts outside training data as high quality
+- May be biased against some text types
 ## Training procedure
 - Transformers 4.51.3
 - Pytorch 2.6.0+cu124
 - Datasets 3.2.0
+- Tokenizers 0.21.0