Deepak Sahu commited on
Commit
3abee27
·
1 Parent(s): 77192ae
Files changed (2) hide show
  1. README.md +166 -1
  2. app.py +36 -0
README.md CHANGED
@@ -1 +1,166 @@
1
- # Content-Based-Book-Recommender
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Book Recommender
3
+ emoji: ⚡
4
+ colorFrom: indigo
5
+ colorTo: gray
6
+ sdk: gradio
7
+ sdk_version: 5.6.0
8
+ app_file: app.py
9
+ pinned: false
10
+ short_description: A content based book recommender.
11
+ ---
12
+
13
+ # Content-Based-Book-Recommender
14
+ A HyDE based approach for building recommendation engine.
15
+
16
+ ## Libraries installed separately
17
+
18
+ I used google colab with following libraries extra.
19
+
20
+ NOT storing `requirements.txt` because issue while pushing to HF space.
21
+
22
+ ```SH
23
+ pip install -U sentence-transformers datasets
24
+ ```
25
+
26
+ ## Training Steps
27
+
28
+ **ALL files Paths are at set as CONST in beginning of each script, to make it easier while using the paths while inferencing; hence not passing as CLI arguments**
29
+
30
+ ### Step 1: Data Clean
31
+
32
+ I am going to do basic steps like unwanted column removal (the first column of index), missing values removal (drop rows), duplicate rows removal. Output Screenshot attached.
33
+
34
+ I am NOT doing any text pre-processing steps like stopword removal, stemming/lemmatization or special character removal because my approach is to use the casual language modelling (later steps) hence makes no sense to rip apart the word meaning via these word-based techniques.
35
+
36
+ A little tinker in around with the dataset found that some titles can belong to multiple categories. (*this code I ran separately, is not part of any script*)
37
+
38
+ ![image](https://github.com/user-attachments/assets/cdf9141e-21f9-481a-8b09-913a0006db87)
39
+
40
+ A descriptive analysis shows that there are just 1230 unique titles. (*this code I ran separately, is not part of any script*)
41
+
42
+ ![image](https://github.com/user-attachments/assets/072b4ed7-7a4d-48b2-a93c-7b08fc5bee45)
43
+
44
+ We are not going to remove them rows that shows same titles (& summaries) with different categories but rather create a separate file for unique titles.
45
+
46
+ ```SH
47
+ python z_clean_data.py
48
+ ```
49
+
50
+ ![image](https://github.com/user-attachments/assets/a466c20b-60ed-47ac-8bfc-e0a38ccdb88d)
51
+
52
+
53
+ Output: `clean_books_summary.csv`, `unique_titles_books_summary.csv`
54
+
55
+
56
+ ### Step 2: Generate vectors of the books summaries.
57
+
58
+ Here, I am going to use pretrained sentence encoder that will help get the meaning of the sentence. As the semantic meaning of the summaries themselves are not changed.
59
+
60
+ We perform this over `unique_titles_books_summary.csv` dataset
61
+
62
+ ![image](https://github.com/user-attachments/assets/21d2d92b-0ad5-4686-8e38-c47df10893f8)
63
+
64
+ Use command
65
+ ```SH
66
+ python z_embedding.py
67
+ ```
68
+
69
+ Just using CPU should take <1 min
70
+
71
+ ![image](https://github.com/user-attachments/assets/5765d586-cc50-4adf-b714-5e371f757f38)
72
+
73
+ Output: `app_cache/summary_vectors.npy`
74
+
75
+ ### Step 3: Fine-tune GPT-2 to Hallucinate but with some bounds.
76
+
77
+ Lets address the **Hypothetical** part of HyDE approach. Its all about generating random summaries,in short hallucinating. While the **Document Extraction** (part of HyDE) is about using these hallucinated summaries to do semantic search on database.
78
+
79
+ Two very important reasons why to fine-tune GPT-2
80
+ 1. We want it to hallucinate but withing boundaries i.e. speak words/ language that we have in books_summaries.csv NOT VERY DIFFERENT OUT OF WORLD LOGIC.
81
+
82
+ 2. Prompt Tune such that we can get consistent results. (Screenshot from https://huggingface.co/openai-community/gpt2); The screenshot show the model is mildly consistent.
83
+
84
+ ![image](https://github.com/user-attachments/assets/1b974da8-799b-48b8-8df7-be17a612f666)
85
+
86
+ > we are going to use ``clean_books_summary.csv` dataset in this training to align with the prompt of ingesting different genre.
87
+
88
+ Reference:
89
+ - HyDE Approach, Precise Zero-Shot Dense Retrieval without Relevance Labels https://arxiv.org/pdf/2212.10496
90
+ - Prompt design and book summary idea I borrowed from https://github.com/pranavpsv/Genre-Based-Story-Generator
91
+ - I didnt not use his model
92
+ - its lacks most of the categories; (our dataset is different)
93
+ - His code base is too much, can edit it but not worth the effort.
94
+ - Fine-tuning code instructions are from https://huggingface.co/docs/transformers/en/tasks/language_modeling
95
+
96
+ Command
97
+
98
+ You must supply your token from huggingface, required to push model to HF
99
+
100
+ ```SH
101
+ huggingface-cli login
102
+ ```
103
+
104
+ We are going to use dataset `clean_books_summary.csv` while triggering this training.
105
+
106
+ ```SH
107
+ python z_finetune_gpt.py
108
+ ```
109
+ (Training lasts ~30 mins for 10 epochs with T4 GPU)
110
+
111
+ ![image](https://github.com/user-attachments/assets/46253d48-903a-4977-b3f5-39ea1e6a6fd6)
112
+
113
+
114
+ The loss you see is cross-entryopy loss; as ref in the fine-tuning instructions (see above reference) states : `Transformers models all have a default task-relevant loss function, so you don’t need to specify one `
115
+
116
+ ![image](https://github.com/user-attachments/assets/13e9b868-6352-490c-9803-c5e49f8e8ae8)
117
+
118
+ So all we care is lower the value better is the model trained :)
119
+
120
+ We are NOT going to test this unit model for some test dataset as the model is already proven (its GPT-2 duh!!).
121
+ But **we are going to evaluate our HyDE approach end-2-end next to ensure sanity of the approach**.
122
+
123
+ ## Evaluation
124
+
125
+ Before discussing evaluation metric let me walk you through two important pieces recommendation generation and similarity matching;
126
+
127
+ ### Recommendation Generation
128
+
129
+ The generation is handled by script `z_hypothetical_summary.py`. Under the hood following happens
130
+
131
+ ![image](https://github.com/user-attachments/assets/ee174c38-a1f3-438a-afb8-be2888c590da)
132
+
133
+ Code Preview. I did the minimal post processing to chop of the `prompt` from the generated summaries before returning the result.
134
+
135
+ ![image](https://github.com/user-attachments/assets/132e84a7-cb4f-49d2-8457-ff473224bad6)
136
+
137
+ ### Similarity Matching
138
+
139
+ ![image](https://github.com/user-attachments/assets/229ce58b-77cb-40b7-b033-c353ee41b0a6)
140
+
141
+ ![image](https://github.com/user-attachments/assets/58613cd7-0b73-4042-b98d-e6cdf2184c32)
142
+
143
+ Because there are 1230 unique titles so we get the averaged similarity vector of same size.
144
+
145
+ ![image](https://github.com/user-attachments/assets/cc7b2164-a437-4517-8edb-cc0573c8a5e6)
146
+
147
+ ### Evaluation Metric
148
+
149
+ So for given input title we can get rank (by desc order cosine similarity) of the store title. To evaluate we the entire approach we are going to use a modified version **Mean Reciprocal Rank (MRR)**.
150
+
151
+ ![image](https://github.com/user-attachments/assets/0cb8fc2a-8834-4cda-95d2-52a02ac9c11d)
152
+
153
+ We are going to do this for random 30 samples and compute the mean of their reciprocal ranks. Ideally all the title should be ranked 1 and their MRR should be equal to 1. Closer to 1 is good.
154
+
155
+ ![image](https://github.com/user-attachments/assets/d2c77d47-9244-474a-a850-d31fb914c9ca)
156
+
157
+ The values of TOP_P and TOP_K (i.e. token sampling for our generator model) are sent as `CONST` in the `z_evaluate.py`; The current set of values of this are borrowed from the work: https://www.kaggle.com/code/tuckerarrants/text-generation-with-huggingface-gpt2#Top-K-and-Top-P-Sampling
158
+
159
+ MRR = 0.311 implies that there's a good change that the target book will be in rank (1/.311) ~ 3 (third rank) **i.e. within top 5 recommendations**
160
+
161
+
162
+
163
+
164
+
165
+
166
+
app.py ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # from z_utils import get_dataframe
2
+ # import numpy as np
3
+
4
+ # # CONST
5
+ # SUMMARY_VECTORS = "app_cache/summary_vectors.npy"
6
+ # BOOKS_CSV = "clean_books_summary.csv"
7
+
8
+ # def get_recommendation(book_title: str) -> str:
9
+ # return book_title
10
+
11
+
12
+ # def sanity_check():
13
+ # '''Validates whether the vectors count is of same as summaries present else RAISES Error
14
+ # '''
15
+ # global BOOKS_CSV, SUMMARY_VECTORS
16
+ # df = get_dataframe(BOOKS_CSV)
17
+ # vectors = np.load(SUMMARY_VECTORS)
18
+ # assert df.shape[0] == vectors.shape[0]
19
+
20
+
21
+ # Reference: https://huggingface.co/learn/nlp-course/en/chapter9/2
22
+
23
+ import gradio as gr
24
+
25
+
26
+ def greet(name):
27
+ return "Hello " + name
28
+
29
+
30
+ # We instantiate the Textbox class
31
+ textbox = gr.Textbox(label="Write truth you wana Know:", placeholder="John Doe", lines=2)
32
+
33
+
34
+ demo = gr.Interface(fn=greet, inputs=textbox, outputs="text")
35
+
36
+ demo.launch()