Spaces:

LunaticMaestro
/

book-recommender

Running

App Files Files Community

Deepak Sahu commited on Nov 24, 2024

Commit

e446a52

1 Parent(s): 01e2b4e

update content

Browse files

Files changed (5) hide show

.resources/eval1.png +3 -0
.resources/fine-tune2.png +3 -0
README.md +33 -18
z_embedding.py +11 -3
z_finetune_gpt.py +5 -4

.resources/eval1.png ADDED Viewed

Git LFS Details

SHA256: 2a9c07521e3749a596bc0c4d11e3e49c307aa7ff79fc2977bb37b29351c39b0b
Pointer size: 130 Bytes
Size of remote file: 36.8 kB

.resources/fine-tune2.png ADDED Viewed

Git LFS Details

SHA256: 2d843b127ad5aad8cb9650aa5bd52a69aaedb04302df731fc43a122a868441cd
Pointer size: 130 Bytes
Size of remote file: 50.1 kB

README.md CHANGED Viewed

@@ -36,9 +36,13 @@ Try it out: https://huggingface.co/spaces/LunaticMaestro/book-recommender
 - Pipeline walkthrough in detail
   *For each part of pipeline there is separate script which needs to be executed, mentioned in respective section along with output screenshots.*
-  - Training
     - [Step 1: Data Clean](#step-1-data-clean)
 ## Running Inference Locally
 ### Memory Requirements
@@ -79,7 +83,7 @@ Modify app.py edit line 93 to `demo.launch(share=True)` then run following in ce
 References:
 - This is the core idea: https://arxiv.org/abs/2212.10496
-- https://github.com/aws-samples/content-based-item-recommender
 - For future, a very complex work https://github.com/HKUDS/LLMRec
 ## Training Steps
@@ -149,16 +153,20 @@ Output: `app_cache/summary_vectors.npy`
 ### Step 3: Fine-tune GPT-2 to Hallucinate but with some bounds.
-Lets address the **Hypothetical** part of HyDE approach. Its all about generating random summaries,in short hallucinating. While the **Document Extraction** (part of HyDE) is about using these hallucinated summaries to do semantic search on database.
-Two very important reasons why to fine-tune GPT-2
 1. We want it to hallucinate but withing boundaries i.e. speak words/ language that we have in books_summaries.csv NOT VERY DIFFERENT OUT OF WORLD LOGIC.
 2. Prompt Tune such that we can get consistent results. (Screenshot from https://huggingface.co/openai-community/gpt2); The screenshot show the model is mildly consistent.
-  ![image](https://github.com/user-attachments/assets/1b974da8-799b-48b8-8df7-be17a612f666)
-  > we are going to use ``clean_books_summary.csv` dataset in this training to align with the prompt of ingesting different genre.
 Reference:
 - HyDE Approach, Precise Zero-Shot Dense Retrieval without Relevance Labels https://arxiv.org/pdf/2212.10496
@@ -168,32 +176,39 @@ Reference:
       - His code base is too much, can edit it but not worth the effort.
 - Fine-tuning code instructions are from https://huggingface.co/docs/transformers/en/tasks/language_modeling
-Command
-You must supply your token from huggingface, required to push model to HF
-```SH
-huggingface-cli login
-```
 We are going to use dataset `clean_books_summary.csv` while triggering this training.
 ```SH
 python z_finetune_gpt.py
 ```
-(Training lasts ~30 mins for 10 epochs with T4 GPU)
-![image](https://github.com/user-attachments/assets/46253d48-903a-4977-b3f5-39ea1e6a6fd6)
-The loss you see is cross-entryopy loss; as ref in the fine-tuning instructions (see above reference) states : `Transformers models all have a default task-relevant loss function, so you don’t need to specify one `
-![image](https://github.com/user-attachments/assets/13e9b868-6352-490c-9803-c5e49f8e8ae8)
 So all we care is lower the value better is the model trained :)
-We are NOT going to test this unit model for some test dataset as the model is already proven (its GPT-2 duh!!).
-But **we are going to evaluate our HyDE approach end-2-end next to ensure sanity of the approach**.
 ## Evaluation

 - Pipeline walkthrough in detail
   *For each part of pipeline there is separate script which needs to be executed, mentioned in respective section along with output screenshots.*
+  - [Training](#training-steps)
     - [Step 1: Data Clean](#step-1-data-clean)
+    - [Step 2: Generate vectors of the books summaries](#step-2-generate-vectors-of-the-books-summaries)
+    - [Step 3: Fine-tune GPT-2 to Hallucinate but with some bounds.](#step-3-fine-tune-gpt-2-to-hallucinate-but-with-some-bounds)
+  - [Evaluation](#evaluation)
+  - Inference
 ## Running Inference Locally
 ### Memory Requirements
 References:
 - This is the core idea: https://arxiv.org/abs/2212.10496
+- Another work based on same, https://github.com/aws-samples/content-based-item-recommender
 - For future, a very complex work https://github.com/HKUDS/LLMRec
 ## Training Steps
 ### Step 3: Fine-tune GPT-2 to Hallucinate but with some bounds.
+**What & Why**
+Hypothetical Document Extraction (HyDE) in nutshell
+  - The **Hypothetical** part of HyDE approach is all about generating random summaries,in short hallucinating. **This is why the approach will work for new book titles**
+  - The **Document Extraction** (part of HyDE) is about using these hallucinated summaries to do semantic search on database.
+**Why to fine-tune GPT-2**
 1. We want it to hallucinate but withing boundaries i.e. speak words/ language that we have in books_summaries.csv NOT VERY DIFFERENT OUT OF WORLD LOGIC.
 2. Prompt Tune such that we can get consistent results. (Screenshot from https://huggingface.co/openai-community/gpt2); The screenshot show the model is mildly consistent.
+  ![image](.resources/fine-tune.png)
 Reference:
 - HyDE Approach, Precise Zero-Shot Dense Retrieval without Relevance Labels https://arxiv.org/pdf/2212.10496
       - His code base is too much, can edit it but not worth the effort.
 - Fine-tuning code instructions are from https://huggingface.co/docs/transformers/en/tasks/language_modeling
+**RUN**
+If you want to
+  - push to HF. You must supply your token from huggingface, required to push model to HF
+    ```SH
+    huggingface-cli login
+    ```
+  - Not Push to HF, then in `z_finetune_gpt.py`:
+    - set line 59 ` push_to_hub` to `False`
+    - comment line 77 `trainer.push_to_hub()`
 We are going to use dataset `clean_books_summary.csv` while triggering this training.
 ```SH
 python z_finetune_gpt.py
 ```
+Image below just shows for 2 epochs, but the one push to my HF https://huggingface.co/LunaticMaestro/gpt2-book-summary-generator is trained for 10 epochs that lasts ~30 mins for 10 epochs with T4 GPU **reduing loss to 0.87 ~ (perplexity = 2.38)**
+![image](.resources/fine-tune2.png)
+The loss you see is cross-entryopy loss; as ref in the [fine-tuning instructions](https://huggingface.co/docs/transformers/en/tasks/language_modeling) : `Transformers models all have a default task-relevant loss function, so you don’t need to specify one `
 So all we care is lower the value better is the model trained :)
+We are NOT going to test this unit model on some test dataset as the model is already proven (its GPT-2 duh!!).
+But **we are going to evaluate our HyDE approach end-2-end next to ensure sanity of the approach** that will inherently prove the goodness of this model.
 ## Evaluation

z_embedding.py CHANGED Viewed

@@ -3,15 +3,18 @@ import pandas as pd
 import numpy as np
 from z_utils import get_dataframe
 from tqdm import tqdm
 # CONST
 EMB_MODEL = "all-MiniLM-L6-v2"
 INP_DATASET_CSV = "unique_titles_books_summary.csv"
 CACHE_SUMMARY_EMB_NPY = "app_cache/summary_vectors.npy"
 model = None
 def load_model():
     global model
     if model is None:
         model = SentenceTransformer(EMB_MODEL)
@@ -53,15 +56,21 @@ def dataframe_compute_summary_vector(books_df: pd.DataFrame) -> np.ndarray:
     return summary_vectors
-def get_embeddings(summaries: list[str], model = None) -> np.ndarray:
     '''Utils function to to take in hypothetical document(s) and return the embedding of it(s)
     '''
     model = model if model else load_model()
     if isinstance(summaries, str):
         summaries = [summaries, ]
     return model.encode(summaries)
 def cache_create_embeddings(books_csv_path: str, output_path: str) -> None:
     '''Read the books csv and generate vectors of the `summaries` columns and store in `output_path`
     '''
@@ -70,7 +79,6 @@ def cache_create_embeddings(books_csv_path: str, output_path: str) -> None:
     np.save(file=output_path, arr=vectors)
     print(f"Vectors saved to {output_path}")
 if __name__ == "__main__":
     print("Generating vectors of the summaries")
     cache_create_embeddings(books_csv_path=INP_DATASET_CSV, output_path=CACHE_SUMMARY_EMB_NPY)

 import numpy as np
 from z_utils import get_dataframe
 from tqdm import tqdm
+from typing import Any
 # CONST
 EMB_MODEL = "all-MiniLM-L6-v2"
 INP_DATASET_CSV = "unique_titles_books_summary.csv"
 CACHE_SUMMARY_EMB_NPY = "app_cache/summary_vectors.npy"
+# GLOBAL VAR
 model = None
 def load_model():
+    '''Workaround for HF space; cross scrip loading is slow.'''
     global model
     if model is None:
         model = SentenceTransformer(EMB_MODEL)
     return summary_vectors
+def get_embeddings(summaries: list[str], model: Any = None) -> np.ndarray:
     '''Utils function to to take in hypothetical document(s) and return the embedding of it(s)
+    Args:
+        summaries: differe hypothetical summaries
+        model: The embedding model, see `app.py` to fast the HF cross-script loading.
+    Returns
+        list of embeddings
     '''
     model = model if model else load_model()
     if isinstance(summaries, str):
         summaries = [summaries, ]
     return model.encode(summaries)
 def cache_create_embeddings(books_csv_path: str, output_path: str) -> None:
     '''Read the books csv and generate vectors of the `summaries` columns and store in `output_path`
     '''
     np.save(file=output_path, arr=vectors)
     print(f"Vectors saved to {output_path}")
 if __name__ == "__main__":
     print("Generating vectors of the summaries")
     cache_create_embeddings(books_csv_path=INP_DATASET_CSV, output_path=CACHE_SUMMARY_EMB_NPY)

z_finetune_gpt.py CHANGED Viewed

@@ -1,4 +1,7 @@
 # THIS file is meant to be used once hence not having functions just sequential code
 import pandas as pd
 from transformers import AutoTokenizer, set_seed
 from transformers import DataCollatorForLanguageModeling
@@ -9,11 +12,9 @@ from z_utils import get_dataframe
 # CONST
 INP_DATASET_CSV = "clean_books_summary.csv"
 BASE_CASUAL_MODEL = "openai-community/gpt2"
-# TRAINED_MODEL_OUTPUT_DIR = "gpt2-book-summary-generator" # same name for HF Hub
-TRAINED_MODEL_OUTPUT_DIR = "content" # same name for HF Hub
 set_seed(42)
-EPOCHS = 2
 LR = 2e-5
 # Load dataset

 # THIS file is meant to be used once hence not having functions just sequential code
+# Fine-tuning code instructions are from https://huggingface.co/docs/transformers/en/tasks/language_modeling
 import pandas as pd
 from transformers import AutoTokenizer, set_seed
 from transformers import DataCollatorForLanguageModeling
 # CONST
 INP_DATASET_CSV = "clean_books_summary.csv"
 BASE_CASUAL_MODEL = "openai-community/gpt2"
+TRAINED_MODEL_OUTPUT_DIR = "gpt2-book-summary-generator" # same name for HF Hub
 set_seed(42)
+EPOCHS = 2 # 10
 LR = 2e-5
 # Load dataset