agentlans commited on
Commit
fed64ef
·
verified ·
1 Parent(s): b676a2a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -16
README.md CHANGED
@@ -6,34 +6,64 @@ tags:
6
  - generated_from_trainer
7
  metrics:
8
  - accuracy
9
- model-index:
10
- - name: multilingual-e5-small-aligned-v2-fineweb2hq-vs-c4-classifier-run2
11
- results: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
 
14
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
- should probably proofread and complete it, then remove this comment. -->
16
 
17
- # multilingual-e5-small-aligned-v2-fineweb2hq-vs-c4-classifier-run2
 
18
 
19
- This model is a fine-tuned version of [agentlans/multilingual-e5-small-aligned-v2](https://huggingface.co/agentlans/multilingual-e5-small-aligned-v2) on an unknown dataset.
20
- It achieves the following results on the evaluation set:
 
 
 
 
 
 
21
  - Loss: 0.1983
22
  - Accuracy: 0.9515
23
  - Combined Score: 1.3494
24
  - Num Input Tokens Seen: 122880000
25
 
26
- ## Model description
27
-
28
- More information needed
29
 
30
- ## Intended uses & limitations
31
 
32
- More information needed
 
33
 
34
- ## Training and evaluation data
 
 
35
 
36
- More information needed
 
 
 
37
 
38
  ## Training procedure
39
 
@@ -62,4 +92,4 @@ The following hyperparameters were used during training:
62
  - Transformers 4.51.3
63
  - Pytorch 2.6.0+cu124
64
  - Datasets 3.2.0
65
- - Tokenizers 0.21.0
 
6
  - generated_from_trainer
7
  metrics:
8
  - accuracy
9
+ language:
10
+ - ar
11
+ - zh
12
+ - cs
13
+ - da
14
+ - nl
15
+ - fr
16
+ - de
17
+ - el
18
+ - hu
19
+ - id
20
+ - it
21
+ - ja
22
+ - fa
23
+ - pl
24
+ - pt
25
+ - ru
26
+ - es
27
+ - sv
28
+ - tr
29
+ - vi
30
+ datasets:
31
+ - agentlans/fineweb2hq-vs-c4
32
+ pipeline_tag: text-classification
33
  ---
34
 
35
+ # agentlans/multilingual-e5-small-fineweb2hq-vs-c4-classifier
 
36
 
37
+ > [!IMPORTANT]
38
+ > **Note:** This model is provided for reference and reproducibility, not for standalone use.
39
 
40
+ This model is a fine-tuned version of [agentlans/multilingual-e5-small-aligned-v2](https://huggingface.co/agentlans/multilingual-e5-small-aligned-v2)
41
+ on the [agentlans/fineweb2hq-vs-c4](https://huggingface.co/datasets/agentlans/fineweb2hq-vs-c4) dataset.
42
+
43
+ The aim is to classify text as higher quality (FineWeb 2 HQ) or lower quality (C4) for AI training.
44
+
45
+ Training dataset:
46
+
47
+ On the validation set:
48
  - Loss: 0.1983
49
  - Accuracy: 0.9515
50
  - Combined Score: 1.3494
51
  - Num Input Tokens Seen: 122880000
52
 
 
 
 
53
 
54
+ ## Example
55
 
56
+ ```python
57
+ from transformers import pipeline
58
 
59
+ classifier = pipeline("text-classification", model="agentlans/multilingual-e5-small-fineweb2hq-vs-c4-classifier")
60
+ classifier("Your text here.")
61
+ ```
62
 
63
+ ## Limitations
64
+ - **Not trained on English data**
65
+ - Tends to be overly permissive, labelling most texts outside training data as high quality
66
+ - May be biased against some text types
67
 
68
  ## Training procedure
69
 
 
92
  - Transformers 4.51.3
93
  - Pytorch 2.6.0+cu124
94
  - Datasets 3.2.0
95
+ - Tokenizers 0.21.0