Tymec commited on
Commit
71069d7
1 Parent(s): c5ed75e

Update documentation

Browse files
Files changed (1) hide show
  1. README.md +200 -14
README.md CHANGED
@@ -16,20 +16,206 @@ models:
16
  - spacy/en_core_web_sm
17
  ---
18
 
19
- # Sentiment Analysis
20
 
21
- ### Usage
22
- 1. Clone the repository
23
- 2. `cd` into the repository
24
- 3. Run `just install` to install the dependencies
25
- 4. Run `just run --help` to see the available commands
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
  ### Datasets
28
- - [Sentiment140](https://www.kaggle.com/datasets/kazanova/sentiment140)
29
- - [Amazon Reviews](https://www.kaggle.com/datasets/bittlingmayer/amazonreviews)
30
- - [IMDB](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)
31
- - [Multiclass Sentiment Analysis](https://huggingface.co/datasets/Sp1786/multiclass-sentiment-analysis-dataset) (Used only testing)
32
-
33
- ### Required tools
34
- - `just`
35
- - `poetry`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  - spacy/en_core_web_sm
17
  ---
18
 
 
19
 
20
+ # Sentiment Analysis [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/Tymec/sentiment-analysis)
21
+
22
+
23
+ ### Table of Contents
24
+ - [Description](#description)
25
+ - [Installation](#installation)
26
+ - [Prerequisites](#prerequisites)
27
+ - [Usage](#usage)
28
+ - [Predict](#predict)
29
+ - [GUI](#gui)
30
+ - [Training](#training)
31
+ - [Evaluation](#evaluation)
32
+ - [Options](#options)
33
+ - [Datasets](#datasets)
34
+ - [Vectorizers](#vectorizers)
35
+ - [Environment Variables](#environment-variables)
36
+ - [Implementation](#implementation)
37
+ - [Architecture](#architecture)
38
+ - [Pre-trained Models](#pre-trained-models)
39
+ - [License](#license)
40
+
41
+
42
+ ## Description
43
+ This is a simple sentiment analysis model written in Python, designed to predict whether the provided text has a positive or negative sentiment. The project comes with both a graphical user interface and a command-line interface. While training the model, the user can choose from a couple of datasets to train the model on and then evaluate the trained model on another dataset. Once the model is trained, it can be used to predict the sentiment of any text with the help of the GUI or CLI.
44
+
45
+
46
+ ## Installation
47
+ Clone the repository and once inside the directory, run the following command to install the dependencies:
48
+ ```bash
49
+ python -m pip install -r requirements.txt
50
+ ```
51
+
52
+ Ensure that you have **at least** one dataset downloaded and placed in the data directory before running `train`.
53
+ For `evaluate`, you will need the `test` dataset. See [Datasets](#datasets) for more information.
54
+
55
+ The project comes with pre-trained models that can be used for prediction. See [Pre-trained Models](#pre-trained-models) for more information.
56
+
57
+
58
+ ### Prerequisites
59
+ - Python 3.11+
60
+
61
+
62
+ ## Usage
63
+ To see the available commands and options, run:
64
+ ```bash
65
+ python -m app --help
66
+ ```
67
+
68
+ <!-- Image of the output -->
69
+
70
+
71
+ ### Predict
72
+ To perform sentiment analysis on a given text, run the following command:
73
+ ```bash
74
+ python -m app predict --model <model> I love this movie
75
+ ```
76
+ where `<model>` is the path to the trained model.
77
+
78
+ Alternatively, you can pipe the text into the command:
79
+ ```bash
80
+ echo "I love this movie" | python -m app predict --model <model>
81
+ ```
82
+
83
+ <!-- Image of the output -->
84
+
85
+
86
+ ### GUI
87
+ To launch the GUI, run the following command:
88
+ ```bash
89
+ python -m app gui --model <model>
90
+ ```
91
+ where `<model>` is the path to the trained model. Add the `--share` flag to create a publicly accessible link.
92
+
93
+ After running the command, open the link from the terminal in your browser to access the GUI.
94
+
95
+ <!-- Image of the output -->
96
+ <!-- Image of the GUI -->
97
+
98
+
99
+ ### Training
100
+ Before training the model, ensure that the specified dataset is downloaded and can be accessed at its respective path. To train the model, run the following command:
101
+ ```bash
102
+ python -m app train --dataset <dataset> {options}
103
+ ```
104
+ where `<dataset>` is the name of the dataset to train the model on. For available datasets, see [Datasets](#datasets).
105
+
106
+ The trained model will be exported to the models directory.
107
+
108
+ To see all available options, run:
109
+ ```bash
110
+ python -m app train --help
111
+ ```
112
+
113
+ <!-- Image of the output -->
114
+
115
+
116
+ ### Evaluation
117
+ Once the model is trained, you can evaluate it on a different dataset by running the following command:
118
+ ```bash
119
+ python -m app evaluate --model <model>
120
+ ```
121
+ where `<model>` is the path to the trained model. For available datasets, see [Datasets](#datasets).
122
+
123
+ To see all available options, run:
124
+ ```bash
125
+ python -m app evaluate --help
126
+ ```
127
+
128
+ <!-- Image of the output -->
129
+
130
+
131
+ ## Options
132
 
133
  ### Datasets
134
+ | Option | Path | Notes | Dataset |
135
+ | --- | --- | --- | --- |
136
+ | sentiment140 | `data/sentiment140.csv` | | [Twitter Sentiment Analysis](https://www.kaggle.com/kazanova/sentiment140) |
137
+ | amazonreviews | `data/amazonreviews.bz2` | only train is used | [Amazon Product Reviews](https://www.kaggle.com/bittlingmayer/amazonreviews) |
138
+ | imdb50k | `data/imdb50k.csv` | | [IMDB Movie Reviews](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) |
139
+ | test | `data/test.csv` | required for `evaluate` | [Multiclass Sentiment Analysis](https://huggingface.co/datasets/Sp1786/multiclass-sentiment-analysis-dataset) |
140
+
141
+
142
+ ### Vectorizers
143
+ | Option | Description | When to Use |
144
+ | --- | --- | --- |
145
+ | `count` | Count Vectorizer | When the frequency of words is important |
146
+ | `tfidf` | TF-IDF Vectorizer | When the importance of words is important |
147
+ | `hashing` | Hashing Vectorizer | When memory is a concern |
148
+
149
+
150
+ ### Environment Variables
151
+ The following environment variables can be set to customize the behavior of the application:
152
+ | Name | Description | Default |
153
+ | --- | --- | --- |
154
+ | `MODEL_DIR` | the directory where the trained models are stored | `models` |
155
+ | `DATA_DIR` | the directory where the datasets are stored | `data` |
156
+ | `CACHE_DIR` | the directory where cached files are stored | `.cache` |
157
+
158
+
159
+ ## Implementation
160
+
161
+
162
+ ### Architecture
163
+ The input text is first preprocessed and tokenized using `spaCy` where:
164
+ - Stop words, punctuation and any non-alphabetic words are removed
165
+ - Words are converted to lowercase
166
+ - Lemmatization is performed (words are converted to their base form based on the surrounding context)
167
+
168
+ After tokenization, feature extraction is performed on the tokens using the chosen vectorizer. Each vectorizer has its own advantages and disadvantages, and the choice of vectorizer can affect the speed and accuracy of the model (see [Vectorizers](#vectorizers)). The extracted features are then passed to the classifier which predicts the class which in this case is the sentiment of the text. Both the vectorizer and classifier are trained on the specified dataset.
169
+
170
+ ```mermaid
171
+ %%{ init : { "flowchart" : { "curve" : "monotoneX" }}}%%
172
+ graph LR
173
+ START:::hidden --> |text|Preprocessing
174
+
175
+ subgraph Preprocessing
176
+ direction TB
177
+ A[Tokenizer]
178
+ B1[HashingVectorizer]
179
+ B2[CountVectorizer]
180
+ B3[TfidfVectorizer]
181
+
182
+ A --> B1
183
+ A --> |tokens|B2
184
+ A --> B3
185
+
186
+ B1 --> C1:::hidden
187
+ B2 --> C2:::hidden
188
+ B3 --> C3:::hidden
189
+ end
190
+
191
+ Preprocessing --> |features|Classification
192
+
193
+ subgraph Classification
194
+ direction LR
195
+ D1[LogisticRegression]
196
+ D2[LinearSVC]
197
+ end
198
+
199
+ Classification --> |sentiment|END:::hidden
200
+
201
+ classDef hidden display: none;
202
+ ```
203
+
204
+
205
+ ### Pre-trained Models
206
+ The following pre-trained models are available for use:
207
+ | Dataset | Vectorizer | Features | Classifier | Accuracy | Model |
208
+ | --- | --- | --- | --- | --- | --- |
209
+ | `sentiment140` | `tfidf` | `LinearRegression` | 20 000 | ? | [Here](models/sentiment140_tfidf_ft-20000.pkl) |
210
+ | `imdb50k` | `tfidf` | `LinearRegression` | 20 000 | ? | [Here](models/imdb50k_tfidf_ft-20000.pkl) |
211
+ | `imdb50k` | `tfidf` | `LinearRegression` | 800 | ? | [Here](models/imdb50k_tfidf_ft-800.pkl) |
212
+ | `imdb50k` | `hashing` | `LinearRegression` | 1 048 576 | 55.65% ± 1.07% | [Here](models/imdb50k_hashing_ft1048576.pkl) |
213
+
214
+ The accuracy of the models is based on the cross-validation score using the `test` dataset and `5` folds.
215
+
216
+ #### Note
217
+ Due to the size of the `amazonreviews` dataset, it was not possible to train a model with a vectorizer other than `hashing`.
218
+
219
+
220
+ ## License
221
+ Distributed under the MIT License. See [LICENSE](LICENSE) for more information.