Spaces:
Sleeping
Sleeping
Deepak Sahu
commited on
Commit
·
3abee27
1
Parent(s):
77192ae
hf test 1
Browse files
README.md
CHANGED
@@ -1 +1,166 @@
|
|
1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
title: Book Recommender
|
3 |
+
emoji: ⚡
|
4 |
+
colorFrom: indigo
|
5 |
+
colorTo: gray
|
6 |
+
sdk: gradio
|
7 |
+
sdk_version: 5.6.0
|
8 |
+
app_file: app.py
|
9 |
+
pinned: false
|
10 |
+
short_description: A content based book recommender.
|
11 |
+
---
|
12 |
+
|
13 |
+
# Content-Based-Book-Recommender
|
14 |
+
A HyDE based approach for building recommendation engine.
|
15 |
+
|
16 |
+
## Libraries installed separately
|
17 |
+
|
18 |
+
I used google colab with following libraries extra.
|
19 |
+
|
20 |
+
NOT storing `requirements.txt` because issue while pushing to HF space.
|
21 |
+
|
22 |
+
```SH
|
23 |
+
pip install -U sentence-transformers datasets
|
24 |
+
```
|
25 |
+
|
26 |
+
## Training Steps
|
27 |
+
|
28 |
+
**ALL files Paths are at set as CONST in beginning of each script, to make it easier while using the paths while inferencing; hence not passing as CLI arguments**
|
29 |
+
|
30 |
+
### Step 1: Data Clean
|
31 |
+
|
32 |
+
I am going to do basic steps like unwanted column removal (the first column of index), missing values removal (drop rows), duplicate rows removal. Output Screenshot attached.
|
33 |
+
|
34 |
+
I am NOT doing any text pre-processing steps like stopword removal, stemming/lemmatization or special character removal because my approach is to use the casual language modelling (later steps) hence makes no sense to rip apart the word meaning via these word-based techniques.
|
35 |
+
|
36 |
+
A little tinker in around with the dataset found that some titles can belong to multiple categories. (*this code I ran separately, is not part of any script*)
|
37 |
+
|
38 |
+

|
39 |
+
|
40 |
+
A descriptive analysis shows that there are just 1230 unique titles. (*this code I ran separately, is not part of any script*)
|
41 |
+
|
42 |
+

|
43 |
+
|
44 |
+
We are not going to remove them rows that shows same titles (& summaries) with different categories but rather create a separate file for unique titles.
|
45 |
+
|
46 |
+
```SH
|
47 |
+
python z_clean_data.py
|
48 |
+
```
|
49 |
+
|
50 |
+

|
51 |
+
|
52 |
+
|
53 |
+
Output: `clean_books_summary.csv`, `unique_titles_books_summary.csv`
|
54 |
+
|
55 |
+
|
56 |
+
### Step 2: Generate vectors of the books summaries.
|
57 |
+
|
58 |
+
Here, I am going to use pretrained sentence encoder that will help get the meaning of the sentence. As the semantic meaning of the summaries themselves are not changed.
|
59 |
+
|
60 |
+
We perform this over `unique_titles_books_summary.csv` dataset
|
61 |
+
|
62 |
+

|
63 |
+
|
64 |
+
Use command
|
65 |
+
```SH
|
66 |
+
python z_embedding.py
|
67 |
+
```
|
68 |
+
|
69 |
+
Just using CPU should take <1 min
|
70 |
+
|
71 |
+

|
72 |
+
|
73 |
+
Output: `app_cache/summary_vectors.npy`
|
74 |
+
|
75 |
+
### Step 3: Fine-tune GPT-2 to Hallucinate but with some bounds.
|
76 |
+
|
77 |
+
Lets address the **Hypothetical** part of HyDE approach. Its all about generating random summaries,in short hallucinating. While the **Document Extraction** (part of HyDE) is about using these hallucinated summaries to do semantic search on database.
|
78 |
+
|
79 |
+
Two very important reasons why to fine-tune GPT-2
|
80 |
+
1. We want it to hallucinate but withing boundaries i.e. speak words/ language that we have in books_summaries.csv NOT VERY DIFFERENT OUT OF WORLD LOGIC.
|
81 |
+
|
82 |
+
2. Prompt Tune such that we can get consistent results. (Screenshot from https://huggingface.co/openai-community/gpt2); The screenshot show the model is mildly consistent.
|
83 |
+
|
84 |
+

|
85 |
+
|
86 |
+
> we are going to use ``clean_books_summary.csv` dataset in this training to align with the prompt of ingesting different genre.
|
87 |
+
|
88 |
+
Reference:
|
89 |
+
- HyDE Approach, Precise Zero-Shot Dense Retrieval without Relevance Labels https://arxiv.org/pdf/2212.10496
|
90 |
+
- Prompt design and book summary idea I borrowed from https://github.com/pranavpsv/Genre-Based-Story-Generator
|
91 |
+
- I didnt not use his model
|
92 |
+
- its lacks most of the categories; (our dataset is different)
|
93 |
+
- His code base is too much, can edit it but not worth the effort.
|
94 |
+
- Fine-tuning code instructions are from https://huggingface.co/docs/transformers/en/tasks/language_modeling
|
95 |
+
|
96 |
+
Command
|
97 |
+
|
98 |
+
You must supply your token from huggingface, required to push model to HF
|
99 |
+
|
100 |
+
```SH
|
101 |
+
huggingface-cli login
|
102 |
+
```
|
103 |
+
|
104 |
+
We are going to use dataset `clean_books_summary.csv` while triggering this training.
|
105 |
+
|
106 |
+
```SH
|
107 |
+
python z_finetune_gpt.py
|
108 |
+
```
|
109 |
+
(Training lasts ~30 mins for 10 epochs with T4 GPU)
|
110 |
+
|
111 |
+

|
112 |
+
|
113 |
+
|
114 |
+
The loss you see is cross-entryopy loss; as ref in the fine-tuning instructions (see above reference) states : `Transformers models all have a default task-relevant loss function, so you don’t need to specify one `
|
115 |
+
|
116 |
+

|
117 |
+
|
118 |
+
So all we care is lower the value better is the model trained :)
|
119 |
+
|
120 |
+
We are NOT going to test this unit model for some test dataset as the model is already proven (its GPT-2 duh!!).
|
121 |
+
But **we are going to evaluate our HyDE approach end-2-end next to ensure sanity of the approach**.
|
122 |
+
|
123 |
+
## Evaluation
|
124 |
+
|
125 |
+
Before discussing evaluation metric let me walk you through two important pieces recommendation generation and similarity matching;
|
126 |
+
|
127 |
+
### Recommendation Generation
|
128 |
+
|
129 |
+
The generation is handled by script `z_hypothetical_summary.py`. Under the hood following happens
|
130 |
+
|
131 |
+

|
132 |
+
|
133 |
+
Code Preview. I did the minimal post processing to chop of the `prompt` from the generated summaries before returning the result.
|
134 |
+
|
135 |
+

|
136 |
+
|
137 |
+
### Similarity Matching
|
138 |
+
|
139 |
+

|
140 |
+
|
141 |
+

|
142 |
+
|
143 |
+
Because there are 1230 unique titles so we get the averaged similarity vector of same size.
|
144 |
+
|
145 |
+

|
146 |
+
|
147 |
+
### Evaluation Metric
|
148 |
+
|
149 |
+
So for given input title we can get rank (by desc order cosine similarity) of the store title. To evaluate we the entire approach we are going to use a modified version **Mean Reciprocal Rank (MRR)**.
|
150 |
+
|
151 |
+

|
152 |
+
|
153 |
+
We are going to do this for random 30 samples and compute the mean of their reciprocal ranks. Ideally all the title should be ranked 1 and their MRR should be equal to 1. Closer to 1 is good.
|
154 |
+
|
155 |
+

|
156 |
+
|
157 |
+
The values of TOP_P and TOP_K (i.e. token sampling for our generator model) are sent as `CONST` in the `z_evaluate.py`; The current set of values of this are borrowed from the work: https://www.kaggle.com/code/tuckerarrants/text-generation-with-huggingface-gpt2#Top-K-and-Top-P-Sampling
|
158 |
+
|
159 |
+
MRR = 0.311 implies that there's a good change that the target book will be in rank (1/.311) ~ 3 (third rank) **i.e. within top 5 recommendations**
|
160 |
+
|
161 |
+
|
162 |
+
|
163 |
+
|
164 |
+
|
165 |
+
|
166 |
+
|
app.py
ADDED
@@ -0,0 +1,36 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# from z_utils import get_dataframe
|
2 |
+
# import numpy as np
|
3 |
+
|
4 |
+
# # CONST
|
5 |
+
# SUMMARY_VECTORS = "app_cache/summary_vectors.npy"
|
6 |
+
# BOOKS_CSV = "clean_books_summary.csv"
|
7 |
+
|
8 |
+
# def get_recommendation(book_title: str) -> str:
|
9 |
+
# return book_title
|
10 |
+
|
11 |
+
|
12 |
+
# def sanity_check():
|
13 |
+
# '''Validates whether the vectors count is of same as summaries present else RAISES Error
|
14 |
+
# '''
|
15 |
+
# global BOOKS_CSV, SUMMARY_VECTORS
|
16 |
+
# df = get_dataframe(BOOKS_CSV)
|
17 |
+
# vectors = np.load(SUMMARY_VECTORS)
|
18 |
+
# assert df.shape[0] == vectors.shape[0]
|
19 |
+
|
20 |
+
|
21 |
+
# Reference: https://huggingface.co/learn/nlp-course/en/chapter9/2
|
22 |
+
|
23 |
+
import gradio as gr
|
24 |
+
|
25 |
+
|
26 |
+
def greet(name):
|
27 |
+
return "Hello " + name
|
28 |
+
|
29 |
+
|
30 |
+
# We instantiate the Textbox class
|
31 |
+
textbox = gr.Textbox(label="Write truth you wana Know:", placeholder="John Doe", lines=2)
|
32 |
+
|
33 |
+
|
34 |
+
demo = gr.Interface(fn=greet, inputs=textbox, outputs="text")
|
35 |
+
|
36 |
+
demo.launch()
|