Spaces:
Running
Running
Update README architecture
Browse files
README.md
CHANGED
@@ -160,8 +160,10 @@ The following environment variables can be set to customize the behavior of the
|
|
160 |
|
161 |
|
162 |
### Architecture
|
163 |
-
The input text is first preprocessed and tokenized using `spaCy` where:
|
164 |
-
-
|
|
|
|
|
165 |
- Words are converted to lowercase
|
166 |
- Lemmatization is performed (words are converted to their base form based on the surrounding context)
|
167 |
|
|
|
160 |
|
161 |
|
162 |
### Architecture
|
163 |
+
The input text is first preprocessed and tokenized using `re` and `spaCy` where:
|
164 |
+
- The text is cleaned up by removing any HTML tags and converting emojis to text
|
165 |
+
- Stop words and punctuation are removed
|
166 |
+
- URLs, email addresses and numbers are removed
|
167 |
- Words are converted to lowercase
|
168 |
- Lemmatization is performed (words are converted to their base form based on the surrounding context)
|
169 |
|