Update README.md
Browse files
README.md
CHANGED
@@ -8,13 +8,13 @@ license: apache-2.0
|
|
8 |
|
9 |
This project provides a high-performance script for classifying the quality of Polish texts using a pre-trained XGBoost model. The classifier assigns one of three quality categories (`LOW`, `MEDIUM`, `HIGH`) to each text and provides a confidence score (probability).
|
10 |
|
11 |
-
The classification is based on over 200 linguistic features extracted from each text, such as the count of nouns and verbs, NER entities, sentence length statistics, and the number of out-of-vocabulary words. These features are calculated by a companion
|
12 |
|
13 |
-
The script
|
14 |
|
15 |
## 2. Features
|
16 |
|
17 |
-
- **
|
18 |
- **Dual Format Support**: Ingests data from either `.parquet` or `.jsonl` files.
|
19 |
- **Robust Feature Extraction**: Relies on a sophisticated feature engineering module (`predictor.py`) to generate over 200 linguistic metrics for accurate classification.
|
20 |
- **Scalable**: Capable of handling millions of documents by processing files sequentially and texts in parallel.
|
@@ -33,9 +33,9 @@ The script follows a simple yet powerful workflow:
|
|
33 |
- For **JSONL files**, it streams the file line by line, parsing each line as a separate JSON object.
|
34 |
4. **Parallel Processing**: The core task of text analysis is distributed across a pool of worker processes.
|
35 |
- A list of texts is extracted from the input file.
|
36 |
-
-
|
37 |
-
-
|
38 |
-
-
|
39 |
5. **Output Generation**:
|
40 |
- The results (category and confidence) are collected from all worker processes.
|
41 |
- The script appends two new fields to each original record:
|
@@ -82,11 +82,9 @@ Ensure your project follows this structure:
|
|
82 |
```
|
83 |
.
|
84 |
├── input_parquet/
|
85 |
-
│ └──
|
86 |
-
│ └── docs2.parquet
|
87 |
├── input_jsonl/
|
88 |
-
│ └──
|
89 |
-
│ └── data2.jsonl
|
90 |
├── models/
|
91 |
│ ├── model.joblib # The trained XGBoost model
|
92 |
│ └── scaler.pkl # The scikit-learn scaler
|
@@ -94,12 +92,24 @@ Ensure your project follows this structure:
|
|
94 |
├── dummy.py # The interactive testing script
|
95 |
├── main_jsonl.py # The main processing script jsonl
|
96 |
├── main_parquet.py # The main processing script parquet
|
97 |
-
└──
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
98 |
```
|
99 |
|
100 |
## 5. Usage
|
101 |
|
102 |
-
The script is configured to run out-of-the-box. Simply place your data files in the
|
103 |
|
104 |
### Step 1: Place Your Data Files
|
105 |
|
@@ -129,30 +139,21 @@ The script automatically skips files that have already been processed and exist
|
|
129 |
```
|
130 |
|
131 |
python -W ignore main_parquet.py
|
132 |
-
|
133 |
-
input_parquet\docs.parquet
|
134 |
-
|
135 |
-
Podsumowanie:
|
136 |
-
- Plików do przetworzenia: 1
|
137 |
-
- Plików już przetworzonych: 0
|
138 |
-
|
139 |
-
Rozpoczynanie przetwarzania 1 plików...
|
140 |
-
Dodawanie do kolejki: input_parquet\docs.parquet
|
141 |
-
|
142 |
text quality_ai confidence
|
143 |
-
0 Pierwszy kamienny kościół w stylu romańskim po... HIGH
|
144 |
-
1 FJM.B.ZP \n cykl kształcenia 2019-2024\nKARTA ... LOW
|
145 |
-
2 Sztuka romańska (styl romański, romanizm, roma... HIGH
|
146 |
-
3 Przypisy\n Jerzy Z. Łoziński: Pomniki sztuki w... LOW
|
147 |
-
4 Na temat historii sakramentarza w wiekach XII–... HIGH
|
148 |
-
5 Przednia okładka\nPrzednia okładka\n \nMiniatu... LOW
|
149 |
-
6 Uchwała Nr 19\nZarządu Związku Rzemiosła Polsk... MEDIUM
|
150 |
-
7 Alternatywy 4 to jeden z najważniejszych i naj... HIGH
|
151 |
-
8 Akslop, może to jakieś duńskie miasto\njestem ... HIGH
|
152 |
-
9 Bielik - orzeł, czy nie orzeł?\nBielik, birkut... HIGH
|
153 |
-
|
154 |
-
|
155 |
-
|
156 |
Wszystkie pliki zostały przetworzone!
|
157 |
```
|
158 |
|
|
|
8 |
|
9 |
This project provides a high-performance script for classifying the quality of Polish texts using a pre-trained XGBoost model. The classifier assigns one of three quality categories (`LOW`, `MEDIUM`, `HIGH`) to each text and provides a confidence score (probability).
|
10 |
|
11 |
+
The classification is based on over 200 linguistic features extracted from each text, such as the count of nouns and verbs, NER entities, sentence length statistics, and the number of out-of-vocabulary words. These features are calculated by a companion modules located in "features" folder.
|
12 |
|
13 |
+
The script supports processing files in both **Parquet** and **JSONL** formats.
|
14 |
|
15 |
## 2. Features
|
16 |
|
17 |
+
- **Efficient Batch Processing**: Processes all texts from a file at once, minimizing I/O and leveraging vectorized computations for high performance.
|
18 |
- **Dual Format Support**: Ingests data from either `.parquet` or `.jsonl` files.
|
19 |
- **Robust Feature Extraction**: Relies on a sophisticated feature engineering module (`predictor.py`) to generate over 200 linguistic metrics for accurate classification.
|
20 |
- **Scalable**: Capable of handling millions of documents by processing files sequentially and texts in parallel.
|
|
|
33 |
- For **JSONL files**, it streams the file line by line, parsing each line as a separate JSON object.
|
34 |
4. **Parallel Processing**: The core task of text analysis is distributed across a pool of worker processes.
|
35 |
- A list of texts is extracted from the input file.
|
36 |
+
- The `text` column is extracted into a list, forming a complete batch.
|
37 |
+
- This batch of texts is passed to the `predict_batch` function.
|
38 |
+
- Inside the function, the `TextAnalyzer` calculates features for all texts. This step may itself use mini-batches for memory efficiency.
|
39 |
5. **Output Generation**:
|
40 |
- The results (category and confidence) are collected from all worker processes.
|
41 |
- The script appends two new fields to each original record:
|
|
|
82 |
```
|
83 |
.
|
84 |
├── input_parquet/
|
85 |
+
│ └── test.parquet
|
|
|
86 |
├── input_jsonl/
|
87 |
+
│ └── test.jsonl
|
|
|
88 |
├── models/
|
89 |
│ ├── model.joblib # The trained XGBoost model
|
90 |
│ └── scaler.pkl # The scikit-learn scaler
|
|
|
92 |
├── dummy.py # The interactive testing script
|
93 |
├── main_jsonl.py # The main processing script jsonl
|
94 |
├── main_parquet.py # The main processing script parquet
|
95 |
+
└── text_analyzer/ # The feature extraction module
|
96 |
+
├── __init__.py
|
97 |
+
├── analyzer.py
|
98 |
+
├── utils.py
|
99 |
+
├── constants.py
|
100 |
+
└── features/
|
101 |
+
├── base_features.py
|
102 |
+
├── linguistic_features.py
|
103 |
+
├── regex_features.py
|
104 |
+
├── spacy_features.py
|
105 |
+
└── structural_features.py
|
106 |
+
|
107 |
+
|
108 |
```
|
109 |
|
110 |
## 5. Usage
|
111 |
|
112 |
+
The script is configured to run out-of-the-box. Simply place your data files in the input directory and execute the main script.
|
113 |
|
114 |
### Step 1: Place Your Data Files
|
115 |
|
|
|
139 |
```
|
140 |
|
141 |
python -W ignore main_parquet.py
|
142 |
+
Analiza cech: 100%|███████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 12.47it/s]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
143 |
text quality_ai confidence
|
144 |
+
0 Pierwszy kamienny kościół w stylu romańskim po... HIGH 99.97
|
145 |
+
1 FJM.B.ZP \n cykl kształcenia 2019-2024\nKARTA ... LOW 99.97
|
146 |
+
2 Sztuka romańska (styl romański, romanizm, roma... HIGH 99.92
|
147 |
+
3 Przypisy\n Jerzy Z. Łoziński: Pomniki sztuki w... LOW 92.54
|
148 |
+
4 Na temat historii sakramentarza w wiekach XII–... HIGH 96.03
|
149 |
+
5 Przednia okładka\nPrzednia okładka\n \nMiniatu... LOW 92.64
|
150 |
+
6 Uchwała Nr 19\nZarządu Związku Rzemiosła Polsk... MEDIUM 62.49
|
151 |
+
7 Alternatywy 4 to jeden z najważniejszych i naj... HIGH 99.98
|
152 |
+
8 Akslop, może to jakieś duńskie miasto\njestem ... HIGH 73.60
|
153 |
+
9 Bielik - orzeł, czy nie orzeł?\nBielik, birkut... HIGH 99.92
|
154 |
+
Pomyślnie zapisano przetworzone dane do pliku output\test.parquet
|
155 |
+
Processing time: 0.8603 seconds
|
156 |
+
|
157 |
Wszystkie pliki zostały przetworzone!
|
158 |
```
|
159 |
|