Spaces:
Sleeping
Sleeping
File size: 9,474 Bytes
3abee27 ea38284 10b6366 ea38284 abf887e 0720e54 abf887e 0720e54 abf887e 2582bfd 9affe81 10b6366 2582bfd 10b6366 e446a52 2582bfd e446a52 2582bfd 81388e4 9affe81 3abee27 abf887e 10b6366 81388e4 9affe81 3abee27 81388e4 3abee27 10b6366 81388e4 10b6366 81388e4 10b6366 e446a52 10b6366 3abee27 abf887e 3abee27 abf887e 3abee27 abf887e 3abee27 abf887e 3abee27 abf887e 3abee27 abf887e 3abee27 abf887e 3abee27 01e2b4e 3abee27 01e2b4e 3abee27 01e2b4e 3abee27 01e2b4e 3abee27 e446a52 3abee27 e446a52 3abee27 e446a52 3abee27 e446a52 3abee27 e446a52 3abee27 e446a52 3abee27 e446a52 3abee27 e446a52 3abee27 81388e4 3abee27 f2b9b39 3abee27 f2b9b39 3abee27 0720e54 3abee27 0720e54 3abee27 0720e54 3abee27 0720e54 3abee27 0720e54 3abee27 0720e54 3abee27 81388e4 3abee27 0720e54 3abee27 0720e54 3abee27 bd396f2 3abee27 0720e54 3abee27 bd396f2 3abee27 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 |
---
title: Book Recommender
emoji: ⚡
colorFrom: indigo
colorTo: gray
sdk: gradio
sdk_version: 5.6.0
app_file: app.py
pinned: false
short_description: A content based book recommender.
---
# Content-Based-Book-Recommender
A HyDE based approach for building recommendation engine.
Try it out: https://huggingface.co/spaces/LunaticMaestro/book-recommender

## Foreword
- All images are my actual work please source powerpoint of them in `.resources` folder of this repo.
- Code is documentation is as per [Google's Python Style Guide](https://google.github.io/styleguide/pyguide.html).
- ALL files Paths are at set as CONST in beginning of each script, to make it easier while using the paths while inferencing & evaluation; hence not passing as CLI arguments.
- Seed value for code reproducability is set at as CONST as well.
- prefix `z_` in filenames is just to avoid confusion (to human) of which is prebuilt module and which is custom during import.
## Table of Content
- [Running Inference Locally](#running-inference)
- [Colab 🏎️ & minimal set up](#goolge-colab)
- [10,000 feet Approach overview](#approach)
- Pipeline walkthrough in detail
*For each part of pipeline there is separate script which needs to be executed, mentioned in respective section along with output screenshots.*
- [Training](#training-steps)
- [Step 1: Data Clean](#step-1-data-clean)
- [Step 2: Generate vectors of the books summaries](#step-2-generate-vectors-of-the-books-summaries)
- [Step 3: Fine-tune GPT-2 to Hallucinate but with some bounds.](#step-3-fine-tune-gpt-2-to-hallucinate-but-with-some-bounds)
- [Parts of Inference](#parts-of-inference)
- [How Recommendation is working](#recommendation-generation)
- [How Similarity Matching is working](#similarity-matching)
- [Evaluation Metric & Result](#evaluation-metric--result)
## Running Inference
### Memory Requirements
The code need <2Gb RAM to use both the following. Just CPU works fine for inferencing.
- https://huggingface.co/openai-community/gpt2 ~500 mb
- https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 <500 mb
### Libraries
`requirements.txt` is set up such that HF can best not create conflict. I developed the code in google colab with following libraries that required manual installation.
```SH
pip install sentence-transformers datasets gradio
```
### Running
#### Goolge Colab
```
!pip install sentence-transformers datasets gradio
!git clone https://github.com/LunaticMaestro/Content-Based-Book-Recommender
%cd /content/Content-Based-Book-Recommender
```
```
!python app.py
```

**Access the code at public link**
Also colab is fast. 🏎️ even with CPU only takes 16s

Sidenotes:
1. I rewrote the snippets from `z_evaluate.py` to `app.py`, cuz need to handle gradio rendering differently.
2. DONT set `debug=True` for gradio in HF space, else it doesn't start.
3. Free HF space work differently for persisting models (cache files) across local running (tried in colab space) works faster. **You will see lot of my commits in HF Space to discover this problem.**
#### Local System
```SH
python app.py
```
access at http://localhost:7860/
## Approach

References:
- This is the core idea: https://arxiv.org/abs/2212.10496
- Another work based on same, https://github.com/aws-samples/content-based-item-recommender
- For future, a very complex work https://github.com/HKUDS/LLMRec
## Training Steps
### Step 1: Data Clean
What is taken care
- unwanted column removal (the first column of index)
- missing values removal (drop rows)
- duplicate rows removal.
What is not taken care
- stopword removal, stemming/lemmatization or special character removal
**because approach is to use the casual language modelling (later steps) hence makes no sense to rip apart the word meaning**
### Observations from `z_cleand_data.ipynb`
- Same title corresponds to different categories

- Total 1230 unique titles.

**Action**: We are not going to remove them rows that shows same titles (& summaries) with different categories but rather create a separate file for unique titles.
**RUN**:
```SH
python z_clean_data.py
```

Output: `clean_books_summary.csv`, `unique_titles_books_summary.csv`
### Step 2: Generate vectors of the books summaries.
**WHAT & WHY**
Here, I am going to use pretrained sentence encoder that will help get the meaning of the sentence. We perform this over `unique_titles_books_summary.csv` dataset
Caching because the semantic meaning of the summaries (for books to output) are not changed during entire runtime.

**RUN**:
Use command
```SH
python z_embedding.py
```
Just using CPU should take <1 min

Output: `app_cache/summary_vectors.npy`
### Step 3: Fine-tune GPT-2 to Hallucinate but with some bounds.
**What & Why**
Hypothetical Document Extraction (HyDE) in nutshell
- The **Hypothetical** part of HyDE approach is all about generating random summaries,in short hallucinating. **This is why the approach will work for new book titles**
- The **Document Extraction** (part of HyDE) is about using these hallucinated summaries to do semantic search on database.
**Why to fine-tune GPT-2**
1. We want it to hallucinate but withing boundaries i.e. speak words/ language that we have in books_summaries.csv NOT VERY DIFFERENT OUT OF WORLD LOGIC.
2. Prompt Tune such that we can get consistent results. (Screenshot from https://huggingface.co/openai-community/gpt2); The screenshot show the model is mildly consistent.

Reference:
- HyDE Approach, Precise Zero-Shot Dense Retrieval without Relevance Labels https://arxiv.org/pdf/2212.10496
- Prompt design and book summary idea I borrowed from https://github.com/pranavpsv/Genre-Based-Story-Generator
- I didnt not use his model
- its lacks most of the categories; (our dataset is different)
- His code base is too much, can edit it but not worth the effort.
- Fine-tuning code instructions are from https://huggingface.co/docs/transformers/en/tasks/language_modeling
**RUN**
If you want to
- push to HF. You must supply your token from huggingface, required to push model to HF
```SH
huggingface-cli login
```
- Not Push to HF, then in `z_finetune_gpt.py`:
- set line 59 ` push_to_hub` to `False`
- comment line 77 `trainer.push_to_hub()`
We are going to use dataset `clean_books_summary.csv` while triggering this training.
```SH
python z_finetune_gpt.py
```
Image below just shows for 2 epochs, but the one push to my HF https://huggingface.co/LunaticMaestro/gpt2-book-summary-generator is trained for 10 epochs that lasts ~30 mins for 10 epochs with T4 GPU **reduing loss to 0.87 ~ (perplexity = 2.38)**

The loss you see is cross-entryopy loss; as ref in the [fine-tuning instructions](https://huggingface.co/docs/transformers/en/tasks/language_modeling) : `Transformers models all have a default task-relevant loss function, so you don’t need to specify one `
So all we care is lower the value better is the model trained :)
We are NOT going to test this unit model on some test dataset as the model is already proven (its GPT-2 duh!!).
But **we are going to evaluate our HyDE approach end-2-end next to ensure sanity of the approach** that will inherently prove the goodness of this model.
## Parts of Inference
Before discussing evaluation metric let me walk you through two important pieces recommendation generation and similarity matching;
### Recommendation Generation
The generation is handled by functions in script `z_hypothetical_summary.py`. Under the hood following happens

**Function Preview** I did the minimal post processing to chop of the `prompt` from the generated summaries before returning the result.

### Similarity Matching


**Function Preview** Because there are 1230 unique titles so we get the averaged similarity vector of same size.

## Evaluation Metric & Result
So for given input title we can get rank (by desc order cosine similarity) of the store title. To evaluate we the entire approach we are going to use a modified version **Mean Reciprocal Rank (MRR)**.

Test Plan:
- Take random 30 samples and compute the mean of their reciprocal ranks.
- If we want that our known book titles be in top 5 results then MRR >= 1/5 = 0.2
**RUN**
```SH
python z_evaluate.py
```

The values of TOP_P and TOP_K (i.e. token sampling for our generator model) are sent as `CONST` in the `z_evaluate.py`; The current set of values are borrowed from the work: https://www.kaggle.com/code/tuckerarrants/text-generation-with-huggingface-gpt2#Top-K-and-Top-P-Sampling
MRR = 0.311 implies that there's a good change that the target book will be in rank (1/.311) ~ 3 (third rank) **i.e. within top 5 recommendations**
> TODO: A sampling study can be done to better make this conclusion.
|