Spaces:
Sleeping
Sleeping
Deepak Sahu
commited on
Commit
·
0720e54
1
Parent(s):
59b441e
section update
Browse files- .resources/eval2.png +3 -0
- .resources/eval3.png +3 -0
- .resources/eval4.png +3 -0
- .resources/eval5.png +3 -0
- .resources/eval6.png +3 -0
- README.md +23 -11
- app.py +4 -1
- z_evaluate.py +3 -1
.resources/eval2.png
ADDED
![]() |
Git LFS Details
|
.resources/eval3.png
ADDED
![]() |
Git LFS Details
|
.resources/eval4.png
ADDED
![]() |
Git LFS Details
|
.resources/eval5.png
ADDED
![]() |
Git LFS Details
|
.resources/eval6.png
ADDED
![]() |
Git LFS Details
|
README.md
CHANGED
@@ -21,9 +21,11 @@ Try it out: https://huggingface.co/spaces/LunaticMaestro/book-recommender
|
|
21 |
|
22 |
- All images are my actual work please source powerpoint of them in `.resources` folder of this repo.
|
23 |
|
24 |
-
- Code is documentation is as per [Google's Python Style Guide](https://google.github.io/styleguide/pyguide.html)
|
25 |
|
26 |
-
- ALL files Paths are at set as CONST in beginning of each script, to make it easier while using the paths while inferencing & evaluation; hence not passing as CLI arguments
|
|
|
|
|
27 |
|
28 |
- prefix `z_` in filenames is just to avoid confusion (to human) of which is prebuilt module and which is custom during import.
|
29 |
|
@@ -220,31 +222,41 @@ The generation is handled by functions in script `z_hypothetical_summary.py`. Un
|
|
220 |
|
221 |

|
222 |
|
223 |
-
|
224 |
|
225 |
-
 of the store title. To evaluate we the entire approach we are going to use a modified version **Mean Reciprocal Rank (MRR)**.
|
240 |
|
241 |
-

|
246 |
|
247 |
-
The values of TOP_P and TOP_K (i.e. token sampling for our generator model) are sent as `CONST` in the `z_evaluate.py`; The current set of values
|
248 |
|
249 |
MRR = 0.311 implies that there's a good change that the target book will be in rank (1/.311) ~ 3 (third rank) **i.e. within top 5 recommendations**
|
250 |
|
|
|
21 |
|
22 |
- All images are my actual work please source powerpoint of them in `.resources` folder of this repo.
|
23 |
|
24 |
+
- Code is documentation is as per [Google's Python Style Guide](https://google.github.io/styleguide/pyguide.html).
|
25 |
|
26 |
+
- ALL files Paths are at set as CONST in beginning of each script, to make it easier while using the paths while inferencing & evaluation; hence not passing as CLI arguments.
|
27 |
+
|
28 |
+
- Seed value for code reproducability is set at as CONST as well.
|
29 |
|
30 |
- prefix `z_` in filenames is just to avoid confusion (to human) of which is prebuilt module and which is custom during import.
|
31 |
|
|
|
222 |
|
223 |

|
224 |
|
225 |
+
**Function Preview** I did the minimal post processing to chop of the `prompt` from the generated summaries before returning the result.
|
226 |
|
227 |
+

|
228 |
|
229 |
### Similarity Matching
|
230 |
|
231 |
+

|
232 |
|
233 |
+

|
234 |
|
235 |
+
**Function Preview** Because there are 1230 unique titles so we get the averaged similarity vector of same size.
|
236 |
|
237 |
+

|
238 |
|
239 |
### Evaluation Metric
|
240 |
|
241 |
So for given input title we can get rank (by desc order cosine similarity) of the store title. To evaluate we the entire approach we are going to use a modified version **Mean Reciprocal Rank (MRR)**.
|
242 |
|
243 |
+

|
244 |
+
|
245 |
+
|
246 |
+
|
247 |
+
Test Plan:
|
248 |
+
- Take random 30 samples and compute the mean of their reciprocal ranks.
|
249 |
+
- If we want that our known book titles be in top 5 results then MRR >= 1/5 = 0.2
|
250 |
|
251 |
+
**RUN**
|
252 |
+
|
253 |
+
```SH
|
254 |
+
python z_evaluate.py
|
255 |
+
```
|
256 |
|
257 |

|
258 |
|
259 |
+
The values of TOP_P and TOP_K (i.e. token sampling for our generator model) are sent as `CONST` in the `z_evaluate.py`; The current set of values are borrowed from the work: https://www.kaggle.com/code/tuckerarrants/text-generation-with-huggingface-gpt2#Top-K-and-Top-P-Sampling
|
260 |
|
261 |
MRR = 0.311 implies that there's a good change that the target book will be in rank (1/.311) ~ 3 (third rank) **i.e. within top 5 recommendations**
|
262 |
|
app.py
CHANGED
@@ -16,7 +16,10 @@ GRADIO_TITLE = "Content Based Book Recommender"
|
|
16 |
GRADIO_DESCRIPTION = '''
|
17 |
This is a [HyDE](https://arxiv.org/abs/2212.10496) based searching mechanism that generates random summaries using your input book title and matches books which has summary similary to generated ones. The books, for search, are used from used [Kaggle Dataset: arpansri/books-summary](https://www.kaggle.com/datasets/arpansri/books-summary)
|
18 |
|
19 |
-
**Should take ~ 15s to 30s** for inferencing.
|
|
|
|
|
|
|
20 |
'''
|
21 |
|
22 |
# Caching mechanism for gradio
|
|
|
16 |
GRADIO_DESCRIPTION = '''
|
17 |
This is a [HyDE](https://arxiv.org/abs/2212.10496) based searching mechanism that generates random summaries using your input book title and matches books which has summary similary to generated ones. The books, for search, are used from used [Kaggle Dataset: arpansri/books-summary](https://www.kaggle.com/datasets/arpansri/books-summary)
|
18 |
|
19 |
+
**Should take ~ 15s to 30s** for inferencing.
|
20 |
+
|
21 |
+
## Is it slow 🐢? (Happens in free HF space)
|
22 |
+
Cold starting in HF space can lead to model file reloading. The entire process will lasts 300s and **decreases to 15s when you have made sufficiently many ~10 to 15 calls**
|
23 |
'''
|
24 |
|
25 |
# Caching mechanism for gradio
|
z_evaluate.py
CHANGED
@@ -1,3 +1,5 @@
|
|
|
|
|
|
1 |
import random
|
2 |
from z_utils import get_dataframe
|
3 |
from z_similarity import computes_similarity_w_hypothetical
|
@@ -8,7 +10,7 @@ import numpy as np
|
|
8 |
# CONST
|
9 |
random.seed(53)
|
10 |
CLEAN_DF_UNIQUE_TITLES = "unique_titles_books_summary.csv"
|
11 |
-
N_SAMPLES_EVAL =
|
12 |
TOP_K = 50
|
13 |
TOP_P = 0.85
|
14 |
|
|
|
1 |
+
# This is one time script, hence no functions but sequential coding
|
2 |
+
|
3 |
import random
|
4 |
from z_utils import get_dataframe
|
5 |
from z_similarity import computes_similarity_w_hypothetical
|
|
|
10 |
# CONST
|
11 |
random.seed(53)
|
12 |
CLEAN_DF_UNIQUE_TITLES = "unique_titles_books_summary.csv"
|
13 |
+
N_SAMPLES_EVAL = 30
|
14 |
TOP_K = 50
|
15 |
TOP_P = 0.85
|
16 |
|