Spaces:

LunaticMaestro
/

book-recommender

Sleeping

App Files Files Community

Deepak Sahu commited on Nov 24, 2024

Commit

0720e54

1 Parent(s): 59b441e

section update

Browse files

Files changed (8) hide show

.resources/eval2.png +3 -0
.resources/eval3.png +3 -0
.resources/eval4.png +3 -0
.resources/eval5.png +3 -0
.resources/eval6.png +3 -0
README.md +23 -11
app.py +4 -1
z_evaluate.py +3 -1

.resources/eval2.png ADDED Viewed

Git LFS Details

SHA256: 601317f094ad4d76025f555146c403f57288e97a95b301cc4d3fb39105f7d4f4
Pointer size: 130 Bytes
Size of remote file: 48.5 kB

.resources/eval3.png ADDED Viewed

Git LFS Details

SHA256: daef5e7fdbc4bf37351e2a793b4dab6b631cb94a2fcdb2dcf6acc2147e288f62
Pointer size: 130 Bytes
Size of remote file: 28.5 kB

.resources/eval4.png ADDED Viewed

Git LFS Details

SHA256: 355834bc5121384e36fc8402b33dfc95f601662cc785eb36f189da91e6442eeb
Pointer size: 130 Bytes
Size of remote file: 32.3 kB

.resources/eval5.png ADDED Viewed

Git LFS Details

SHA256: 3ff52cc99e92a849f902a23895289e3dfcbd96963afe3ee246699c8a0918a1fc
Pointer size: 130 Bytes
Size of remote file: 23.1 kB

.resources/eval6.png ADDED Viewed

Git LFS Details

SHA256: 24d9c610b32317599761e3f7e9ef2605430da91155530f6d493b3c3b941aa863
Pointer size: 130 Bytes
Size of remote file: 37.2 kB

README.md CHANGED Viewed

@@ -21,9 +21,11 @@ Try it out: https://huggingface.co/spaces/LunaticMaestro/book-recommender
 - All images are my actual work please source powerpoint of them in `.resources` folder of this repo.
-- Code is documentation is as per [Google's Python Style Guide](https://google.github.io/styleguide/pyguide.html)
-- ALL files Paths are at set as CONST in beginning of each script, to make it easier while using the paths while inferencing & evaluation; hence not passing as CLI arguments
 - prefix `z_` in filenames is just to avoid confusion (to human) of which is prebuilt module and which is custom during import.
@@ -220,31 +222,41 @@ The generation is handled by functions in script `z_hypothetical_summary.py`. Un
 ![image](.resources/eval1.png)
-Code Preview. I did the minimal post processing to chop of the `prompt` from the generated summaries before returning the result.
-![image](https://github.com/user-attachments/assets/132e84a7-cb4f-49d2-8457-ff473224bad6)
 ### Similarity Matching
-![image](https://github.com/user-attachments/assets/229ce58b-77cb-40b7-b033-c353ee41b0a6)
-![image](https://github.com/user-attachments/assets/58613cd7-0b73-4042-b98d-e6cdf2184c32)
-Because there are 1230 unique titles so we get the averaged similarity vector of same size.
-![image](https://github.com/user-attachments/assets/cc7b2164-a437-4517-8edb-cc0573c8a5e6)
 ### Evaluation Metric
 So for given input title we can get rank (by desc order cosine similarity) of the store title. To evaluate we the entire approach we are going to use a modified version **Mean Reciprocal Rank (MRR)**.
-![image](https://github.com/user-attachments/assets/0cb8fc2a-8834-4cda-95d2-52a02ac9c11d)
-We are going to do this for random 30 samples and compute the mean of their reciprocal ranks. Ideally all the title should be ranked 1 and their MRR should be equal to 1. Closer to 1 is good.
 ![image](https://github.com/user-attachments/assets/d2c77d47-9244-474a-a850-d31fb914c9ca)
-The values of TOP_P and TOP_K (i.e. token sampling for our generator model) are sent as `CONST` in the `z_evaluate.py`; The current set of values of this are borrowed from the work: https://www.kaggle.com/code/tuckerarrants/text-generation-with-huggingface-gpt2#Top-K-and-Top-P-Sampling
 MRR = 0.311 implies that there's a good change that the target book will be in rank (1/.311) ~ 3 (third rank) **i.e. within top 5 recommendations**

 - All images are my actual work please source powerpoint of them in `.resources` folder of this repo.
+- Code is documentation is as per [Google's Python Style Guide](https://google.github.io/styleguide/pyguide.html).
+- ALL files Paths are at set as CONST in beginning of each script, to make it easier while using the paths while inferencing & evaluation; hence not passing as CLI arguments.
+- Seed value for code reproducability is set at as CONST as well.
 - prefix `z_` in filenames is just to avoid confusion (to human) of which is prebuilt module and which is custom during import.
 ![image](.resources/eval1.png)
+**Function Preview** I did the minimal post processing to chop of the `prompt` from the generated summaries before returning the result.
+![image](.resources/eval2.png)
 ### Similarity Matching
+![image](.resources/eval3.png)
+![image](.resources/eval4.png)
+**Function Preview** Because there are 1230 unique titles so we get the averaged similarity vector of same size.
+![image](.resources/eval5.png)
 ### Evaluation Metric
 So for given input title we can get rank (by desc order cosine similarity) of the store title. To evaluate we the entire approach we are going to use a modified version **Mean Reciprocal Rank (MRR)**.
+![image](.resources/eval6.png)
+Test Plan:
+  - Take random 30 samples and compute the mean of their reciprocal ranks.
+  - If we want that our known book titles be in top 5 results then MRR >= 1/5 = 0.2
+**RUN**
+```SH
+python z_evaluate.py
+```
 ![image](https://github.com/user-attachments/assets/d2c77d47-9244-474a-a850-d31fb914c9ca)
+The values of TOP_P and TOP_K (i.e. token sampling for our generator model) are sent as `CONST` in the `z_evaluate.py`; The current set of values are borrowed from the work: https://www.kaggle.com/code/tuckerarrants/text-generation-with-huggingface-gpt2#Top-K-and-Top-P-Sampling
 MRR = 0.311 implies that there's a good change that the target book will be in rank (1/.311) ~ 3 (third rank) **i.e. within top 5 recommendations**

app.py CHANGED Viewed

@@ -16,7 +16,10 @@ GRADIO_TITLE = "Content Based Book Recommender"
 GRADIO_DESCRIPTION = '''
 This is a [HyDE](https://arxiv.org/abs/2212.10496) based searching mechanism that generates random summaries using your input book title and matches books which has summary similary to generated ones. The books, for search, are used from used [Kaggle Dataset: arpansri/books-summary](https://www.kaggle.com/datasets/arpansri/books-summary)
-**Should take ~ 15s to 30s** for inferencing. If taking time then then its cold starting in HF space which lasts 300s and **decreases to 15s when you have made sufficiently many ~10 to 15 call**
 '''
 # Caching mechanism for gradio

 GRADIO_DESCRIPTION = '''
 This is a [HyDE](https://arxiv.org/abs/2212.10496) based searching mechanism that generates random summaries using your input book title and matches books which has summary similary to generated ones. The books, for search, are used from used [Kaggle Dataset: arpansri/books-summary](https://www.kaggle.com/datasets/arpansri/books-summary)
+**Should take ~ 15s to 30s** for inferencing.
+## Is it slow 🐢? (Happens in free HF space)
+Cold starting in HF space can lead to model file reloading. The entire process will lasts 300s  and **decreases to 15s when you have made sufficiently many ~10 to 15 calls**
 '''
 # Caching mechanism for gradio

z_evaluate.py CHANGED Viewed

@@ -1,3 +1,5 @@
 import random
 from z_utils import get_dataframe
 from z_similarity import computes_similarity_w_hypothetical
@@ -8,7 +10,7 @@ import numpy as np
 # CONST
 random.seed(53)
 CLEAN_DF_UNIQUE_TITLES = "unique_titles_books_summary.csv"
-N_SAMPLES_EVAL = 2
 TOP_K = 50
 TOP_P = 0.85

+# This is one time script, hence no functions but sequential coding
 import random
 from z_utils import get_dataframe
 from z_similarity import computes_similarity_w_hypothetical
 # CONST
 random.seed(53)
 CLEAN_DF_UNIQUE_TITLES = "unique_titles_books_summary.csv"
+N_SAMPLES_EVAL = 30
 TOP_K = 50
 TOP_P = 0.85