Deepak Sahu commited on
Commit
e446a52
·
1 Parent(s): 01e2b4e

update content

Browse files
.resources/eval1.png ADDED

Git LFS Details

  • SHA256: 2a9c07521e3749a596bc0c4d11e3e49c307aa7ff79fc2977bb37b29351c39b0b
  • Pointer size: 130 Bytes
  • Size of remote file: 36.8 kB
.resources/fine-tune2.png ADDED

Git LFS Details

  • SHA256: 2d843b127ad5aad8cb9650aa5bd52a69aaedb04302df731fc43a122a868441cd
  • Pointer size: 130 Bytes
  • Size of remote file: 50.1 kB
README.md CHANGED
@@ -36,9 +36,13 @@ Try it out: https://huggingface.co/spaces/LunaticMaestro/book-recommender
36
  - Pipeline walkthrough in detail
37
 
38
  *For each part of pipeline there is separate script which needs to be executed, mentioned in respective section along with output screenshots.*
39
- - Training
40
  - [Step 1: Data Clean](#step-1-data-clean)
 
 
41
 
 
 
42
  ## Running Inference Locally
43
 
44
  ### Memory Requirements
@@ -79,7 +83,7 @@ Modify app.py edit line 93 to `demo.launch(share=True)` then run following in ce
79
 
80
  References:
81
  - This is the core idea: https://arxiv.org/abs/2212.10496
82
- - https://github.com/aws-samples/content-based-item-recommender
83
  - For future, a very complex work https://github.com/HKUDS/LLMRec
84
 
85
  ## Training Steps
@@ -149,16 +153,20 @@ Output: `app_cache/summary_vectors.npy`
149
 
150
  ### Step 3: Fine-tune GPT-2 to Hallucinate but with some bounds.
151
 
152
- Lets address the **Hypothetical** part of HyDE approach. Its all about generating random summaries,in short hallucinating. While the **Document Extraction** (part of HyDE) is about using these hallucinated summaries to do semantic search on database.
 
 
 
 
 
 
 
153
 
154
- Two very important reasons why to fine-tune GPT-2
155
  1. We want it to hallucinate but withing boundaries i.e. speak words/ language that we have in books_summaries.csv NOT VERY DIFFERENT OUT OF WORLD LOGIC.
156
 
157
  2. Prompt Tune such that we can get consistent results. (Screenshot from https://huggingface.co/openai-community/gpt2); The screenshot show the model is mildly consistent.
158
 
159
- ![image](https://github.com/user-attachments/assets/1b974da8-799b-48b8-8df7-be17a612f666)
160
-
161
- > we are going to use ``clean_books_summary.csv` dataset in this training to align with the prompt of ingesting different genre.
162
 
163
  Reference:
164
  - HyDE Approach, Precise Zero-Shot Dense Retrieval without Relevance Labels https://arxiv.org/pdf/2212.10496
@@ -168,32 +176,39 @@ Reference:
168
  - His code base is too much, can edit it but not worth the effort.
169
  - Fine-tuning code instructions are from https://huggingface.co/docs/transformers/en/tasks/language_modeling
170
 
171
- Command
172
 
173
- You must supply your token from huggingface, required to push model to HF
174
 
175
- ```SH
176
- huggingface-cli login
177
- ```
 
 
 
 
 
 
 
 
 
178
 
179
  We are going to use dataset `clean_books_summary.csv` while triggering this training.
180
 
181
  ```SH
182
  python z_finetune_gpt.py
183
  ```
184
- (Training lasts ~30 mins for 10 epochs with T4 GPU)
185
 
186
- ![image](https://github.com/user-attachments/assets/46253d48-903a-4977-b3f5-39ea1e6a6fd6)
187
 
 
188
 
189
- The loss you see is cross-entryopy loss; as ref in the fine-tuning instructions (see above reference) states : `Transformers models all have a default task-relevant loss function, so you don’t need to specify one `
190
 
191
- ![image](https://github.com/user-attachments/assets/13e9b868-6352-490c-9803-c5e49f8e8ae8)
192
 
193
  So all we care is lower the value better is the model trained :)
194
 
195
- We are NOT going to test this unit model for some test dataset as the model is already proven (its GPT-2 duh!!).
196
- But **we are going to evaluate our HyDE approach end-2-end next to ensure sanity of the approach**.
197
 
198
  ## Evaluation
199
 
 
36
  - Pipeline walkthrough in detail
37
 
38
  *For each part of pipeline there is separate script which needs to be executed, mentioned in respective section along with output screenshots.*
39
+ - [Training](#training-steps)
40
  - [Step 1: Data Clean](#step-1-data-clean)
41
+ - [Step 2: Generate vectors of the books summaries](#step-2-generate-vectors-of-the-books-summaries)
42
+ - [Step 3: Fine-tune GPT-2 to Hallucinate but with some bounds.](#step-3-fine-tune-gpt-2-to-hallucinate-but-with-some-bounds)
43
 
44
+ - [Evaluation](#evaluation)
45
+ - Inference
46
  ## Running Inference Locally
47
 
48
  ### Memory Requirements
 
83
 
84
  References:
85
  - This is the core idea: https://arxiv.org/abs/2212.10496
86
+ - Another work based on same, https://github.com/aws-samples/content-based-item-recommender
87
  - For future, a very complex work https://github.com/HKUDS/LLMRec
88
 
89
  ## Training Steps
 
153
 
154
  ### Step 3: Fine-tune GPT-2 to Hallucinate but with some bounds.
155
 
156
+ **What & Why**
157
+
158
+ Hypothetical Document Extraction (HyDE) in nutshell
159
+ - The **Hypothetical** part of HyDE approach is all about generating random summaries,in short hallucinating. **This is why the approach will work for new book titles**
160
+ - The **Document Extraction** (part of HyDE) is about using these hallucinated summaries to do semantic search on database.
161
+
162
+
163
+ **Why to fine-tune GPT-2**
164
 
 
165
  1. We want it to hallucinate but withing boundaries i.e. speak words/ language that we have in books_summaries.csv NOT VERY DIFFERENT OUT OF WORLD LOGIC.
166
 
167
  2. Prompt Tune such that we can get consistent results. (Screenshot from https://huggingface.co/openai-community/gpt2); The screenshot show the model is mildly consistent.
168
 
169
+ ![image](.resources/fine-tune.png)
 
 
170
 
171
  Reference:
172
  - HyDE Approach, Precise Zero-Shot Dense Retrieval without Relevance Labels https://arxiv.org/pdf/2212.10496
 
176
  - His code base is too much, can edit it but not worth the effort.
177
  - Fine-tuning code instructions are from https://huggingface.co/docs/transformers/en/tasks/language_modeling
178
 
179
+ **RUN**
180
 
 
181
 
182
+ If you want to
183
+
184
+ - push to HF. You must supply your token from huggingface, required to push model to HF
185
+
186
+ ```SH
187
+ huggingface-cli login
188
+ ```
189
+
190
+ - Not Push to HF, then in `z_finetune_gpt.py`:
191
+
192
+ - set line 59 ` push_to_hub` to `False`
193
+ - comment line 77 `trainer.push_to_hub()`
194
 
195
  We are going to use dataset `clean_books_summary.csv` while triggering this training.
196
 
197
  ```SH
198
  python z_finetune_gpt.py
199
  ```
 
200
 
201
+ Image below just shows for 2 epochs, but the one push to my HF https://huggingface.co/LunaticMaestro/gpt2-book-summary-generator is trained for 10 epochs that lasts ~30 mins for 10 epochs with T4 GPU **reduing loss to 0.87 ~ (perplexity = 2.38)**
202
 
203
+ ![image](.resources/fine-tune2.png)
204
 
 
205
 
206
+ The loss you see is cross-entryopy loss; as ref in the [fine-tuning instructions](https://huggingface.co/docs/transformers/en/tasks/language_modeling) : `Transformers models all have a default task-relevant loss function, so you don’t need to specify one `
207
 
208
  So all we care is lower the value better is the model trained :)
209
 
210
+ We are NOT going to test this unit model on some test dataset as the model is already proven (its GPT-2 duh!!).
211
+ But **we are going to evaluate our HyDE approach end-2-end next to ensure sanity of the approach** that will inherently prove the goodness of this model.
212
 
213
  ## Evaluation
214
 
z_embedding.py CHANGED
@@ -3,15 +3,18 @@ import pandas as pd
3
  import numpy as np
4
  from z_utils import get_dataframe
5
  from tqdm import tqdm
 
6
 
7
  # CONST
8
  EMB_MODEL = "all-MiniLM-L6-v2"
9
  INP_DATASET_CSV = "unique_titles_books_summary.csv"
10
  CACHE_SUMMARY_EMB_NPY = "app_cache/summary_vectors.npy"
11
 
 
12
  model = None
13
 
14
  def load_model():
 
15
  global model
16
  if model is None:
17
  model = SentenceTransformer(EMB_MODEL)
@@ -53,15 +56,21 @@ def dataframe_compute_summary_vector(books_df: pd.DataFrame) -> np.ndarray:
53
 
54
  return summary_vectors
55
 
56
- def get_embeddings(summaries: list[str], model = None) -> np.ndarray:
57
  '''Utils function to to take in hypothetical document(s) and return the embedding of it(s)
 
 
 
 
 
 
 
58
  '''
59
  model = model if model else load_model()
60
  if isinstance(summaries, str):
61
  summaries = [summaries, ]
62
  return model.encode(summaries)
63
 
64
-
65
  def cache_create_embeddings(books_csv_path: str, output_path: str) -> None:
66
  '''Read the books csv and generate vectors of the `summaries` columns and store in `output_path`
67
  '''
@@ -70,7 +79,6 @@ def cache_create_embeddings(books_csv_path: str, output_path: str) -> None:
70
  np.save(file=output_path, arr=vectors)
71
  print(f"Vectors saved to {output_path}")
72
 
73
-
74
  if __name__ == "__main__":
75
  print("Generating vectors of the summaries")
76
  cache_create_embeddings(books_csv_path=INP_DATASET_CSV, output_path=CACHE_SUMMARY_EMB_NPY)
 
3
  import numpy as np
4
  from z_utils import get_dataframe
5
  from tqdm import tqdm
6
+ from typing import Any
7
 
8
  # CONST
9
  EMB_MODEL = "all-MiniLM-L6-v2"
10
  INP_DATASET_CSV = "unique_titles_books_summary.csv"
11
  CACHE_SUMMARY_EMB_NPY = "app_cache/summary_vectors.npy"
12
 
13
+ # GLOBAL VAR
14
  model = None
15
 
16
  def load_model():
17
+ '''Workaround for HF space; cross scrip loading is slow.'''
18
  global model
19
  if model is None:
20
  model = SentenceTransformer(EMB_MODEL)
 
56
 
57
  return summary_vectors
58
 
59
+ def get_embeddings(summaries: list[str], model: Any = None) -> np.ndarray:
60
  '''Utils function to to take in hypothetical document(s) and return the embedding of it(s)
61
+
62
+ Args:
63
+ summaries: differe hypothetical summaries
64
+ model: The embedding model, see `app.py` to fast the HF cross-script loading.
65
+
66
+ Returns
67
+ list of embeddings
68
  '''
69
  model = model if model else load_model()
70
  if isinstance(summaries, str):
71
  summaries = [summaries, ]
72
  return model.encode(summaries)
73
 
 
74
  def cache_create_embeddings(books_csv_path: str, output_path: str) -> None:
75
  '''Read the books csv and generate vectors of the `summaries` columns and store in `output_path`
76
  '''
 
79
  np.save(file=output_path, arr=vectors)
80
  print(f"Vectors saved to {output_path}")
81
 
 
82
  if __name__ == "__main__":
83
  print("Generating vectors of the summaries")
84
  cache_create_embeddings(books_csv_path=INP_DATASET_CSV, output_path=CACHE_SUMMARY_EMB_NPY)
z_finetune_gpt.py CHANGED
@@ -1,4 +1,7 @@
1
  # THIS file is meant to be used once hence not having functions just sequential code
 
 
 
2
  import pandas as pd
3
  from transformers import AutoTokenizer, set_seed
4
  from transformers import DataCollatorForLanguageModeling
@@ -9,11 +12,9 @@ from z_utils import get_dataframe
9
  # CONST
10
  INP_DATASET_CSV = "clean_books_summary.csv"
11
  BASE_CASUAL_MODEL = "openai-community/gpt2"
12
- # TRAINED_MODEL_OUTPUT_DIR = "gpt2-book-summary-generator" # same name for HF Hub
13
- TRAINED_MODEL_OUTPUT_DIR = "content" # same name for HF Hub
14
-
15
  set_seed(42)
16
- EPOCHS = 2
17
  LR = 2e-5
18
 
19
  # Load dataset
 
1
  # THIS file is meant to be used once hence not having functions just sequential code
2
+ # Fine-tuning code instructions are from https://huggingface.co/docs/transformers/en/tasks/language_modeling
3
+
4
+
5
  import pandas as pd
6
  from transformers import AutoTokenizer, set_seed
7
  from transformers import DataCollatorForLanguageModeling
 
12
  # CONST
13
  INP_DATASET_CSV = "clean_books_summary.csv"
14
  BASE_CASUAL_MODEL = "openai-community/gpt2"
15
+ TRAINED_MODEL_OUTPUT_DIR = "gpt2-book-summary-generator" # same name for HF Hub
 
 
16
  set_seed(42)
17
+ EPOCHS = 2 # 10
18
  LR = 2e-5
19
 
20
  # Load dataset