Tymec commited on
Commit
d29d6fe
1 Parent(s): 632adc4

Update README architecture

Browse files
Files changed (1) hide show
  1. README.md +4 -2
README.md CHANGED
@@ -160,8 +160,10 @@ The following environment variables can be set to customize the behavior of the
160
 
161
 
162
  ### Architecture
163
- The input text is first preprocessed and tokenized using `spaCy` where:
164
- - Stop words, punctuation and any non-alphabetic words are removed
 
 
165
  - Words are converted to lowercase
166
  - Lemmatization is performed (words are converted to their base form based on the surrounding context)
167
 
 
160
 
161
 
162
  ### Architecture
163
+ The input text is first preprocessed and tokenized using `re` and `spaCy` where:
164
+ - The text is cleaned up by removing any HTML tags and converting emojis to text
165
+ - Stop words and punctuation are removed
166
+ - URLs, email addresses and numbers are removed
167
  - Words are converted to lowercase
168
  - Lemmatization is performed (words are converted to their base form based on the surrounding context)
169