kalle07 commited on
Commit
6eda2e5
·
verified ·
1 Parent(s): f3492c9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -11
README.md CHANGED
@@ -32,14 +32,14 @@ architecture:
32
 
33
  ---
34
 
35
- # <b>This is a collection of more than 25 types of embedding files and a really brief introduction to what you should know about embedding. If you don't keep a few things in mind, you won't be satisfied with the results.</b>
36
  <br>
37
 
38
  # <b>All models tested with ALLM(AnythingLLM) with LM-Studio as server, all models should be work with ollama</b>
39
  <b> the setup for local documents described below is allmost the same, GPT4All has only one model (nomic), and koboldcpp is not build in right now but in development</b><br>
40
 
41
  (sometimes the results are more truthful if the “chat with document only” option is used)<br>
42
- BTW embedder is only a part of a good RAG (Retrieval-Augmented Generation)<br>
43
  <b>&#x21e8;</b> give me a ❤️, if you like ;)<br>
44
  <br>
45
  <b>My short impression:</b>
@@ -64,19 +64,19 @@ Further tests have shown that the following models are suitable for complex task
64
  ...
65
 
66
  # Short hints for using (Example for a large context with many expected hits):
67
- Set your (Max Tokens)context-lenght 16000t main-LLM-model <b>"LM-Studio with ALLM you must set also in LM!"</b>, set your embedder-model (Max Embedding Chunk Length) 1024t,set (Max Context Snippets) 14,
68
- in ALLM set also (Text splitting & Chunking Preferences - Text Chunk Size) 1024 character parts and (Search Preference) "accuracy".
69
  <br>
70
- Hint in ALLM, set all in LM stdui start both models and both are on top in ALLM.<br>
71
 
72
  -> Ok what that mean!<br>
73
  Your document will be embedd in x times 1024t chunks(snippets),<br>
74
- You can receive 14-snippets a 1024t (~14000t) from your document ~10000words(10pages) and ~2000t left (from 16000t) for the answer ~1000words (2 pages)
75
  <br>
76
  You can play and set for your needs, eg 8-snippets a 2048t, or 28-snippets a 512t ... (every time you change the chunk-length the document must be embedd again). With these settings everything fits best for ONE answer, if you need more for a conversation, you should set lower and/or disable the document.
77
  <ul style="line-height: 1.05;">
78
- english vs german differ 50%<br>
79
- ~5000 characters is one page of a book (no matter ger/en). But if you calculate with words ... words in german are longer, that means per word more token.<br>
80
  The example is english, for german you can add apox 50% more token/word (1000 words ~1800t)<br>
81
  <li>1200t (~1000 words ~5000 chracter) ~0.1GB, this is aprox one page with small font</li>
82
  <li>8000t (~6000 words) ~0.8GB VRAM usage</li>
@@ -152,11 +152,12 @@ btw. <b>Jinja</b> templates very new ... the usual templates with usual models a
152
 
153
  ...
154
  <br>
155
- # DOC/PDF 2 TXT<br>
156
  Prepare your documents by yourself!<br>
157
  Bad Input = bad Output!<br>
158
  In most cases, it is not immediately obvious how the document is made available to the embedder.
159
- in nearly all cases images and tables, page-numbers, chapters and sections/paragraph-format not well implement.
 
160
  An easy start is to use a python based pdf-parser (it give a lot).<br>
161
  option only for simple txt/tables converting:
162
  <ul style="line-height: 1.05;">
@@ -180,7 +181,7 @@ also for OCR it download automatic some models. the only thing i haven't found y
180
  <br><br>
181
  large option to play with many types of (UI-Based)
182
  <ul style="line-height: 1.05;">
183
- <li>Parsemy PDF</li>
184
  </ul>
185
  <a href="https://github.com/genieincodebottle/parsemypdf">https://github.com/genieincodebottle/parsemypdf</a><br>
186
  <br>
 
32
 
33
  ---
34
 
35
+ # <b>This is a collection of more than 25 types of embedding models and a really brief introduction to what you should know about embedding.If you don't keep a few things in mind, you won't be satisfied with the results.</b>
36
  <br>
37
 
38
  # <b>All models tested with ALLM(AnythingLLM) with LM-Studio as server, all models should be work with ollama</b>
39
  <b> the setup for local documents described below is allmost the same, GPT4All has only one model (nomic), and koboldcpp is not build in right now but in development</b><br>
40
 
41
  (sometimes the results are more truthful if the “chat with document only” option is used)<br>
42
+ BTW the embedder-model is only a part of a good RAG (Retrieval-Augmented Generation)<br>
43
  <b>&#x21e8;</b> give me a ❤️, if you like ;)<br>
44
  <br>
45
  <b>My short impression:</b>
 
64
  ...
65
 
66
  # Short hints for using (Example for a large context with many expected hits):
67
+ Set your (Max Tokens)context-lenght 16000t main-LLM-model <b>"LM-Studio with ALLM you must set also in LM-Stutio settings!"</b>, set your embedder-model (Max Embedding Chunk Length) 1024t,set (Max Context Snippets) 14,
68
+ in ALLM set also (Text splitting & Chunking Preferences - Text Chunk Size) 1024 character parts and (Search Preference) "accuracy". And set in your workspace 14 snippets.
69
  <br>
70
+ Hint in ALLM, set all in LM studio start both models and both are on top in ALLM.<br>
71
 
72
  -> Ok what that mean!<br>
73
  Your document will be embedd in x times 1024t chunks(snippets),<br>
74
+ You can receive 14-snippets a 1024t (~14000t) from your document ~10000words(10pages) and ~2000t left (from 16000t) for the answer ~1000words (2 pages).
75
  <br>
76
  You can play and set for your needs, eg 8-snippets a 2048t, or 28-snippets a 512t ... (every time you change the chunk-length the document must be embedd again). With these settings everything fits best for ONE answer, if you need more for a conversation, you should set lower and/or disable the document.
77
  <ul style="line-height: 1.05;">
78
+ english vs german differ 50% in calculate tokens/word<br>
79
+ but ~5000 characters is one page of a book (no matter ger/en). But if you calculate with words ... words in german are longer, that means per word more token.<br>
80
  The example is english, for german you can add apox 50% more token/word (1000 words ~1800t)<br>
81
  <li>1200t (~1000 words ~5000 chracter) ~0.1GB, this is aprox one page with small font</li>
82
  <li>8000t (~6000 words) ~0.8GB VRAM usage</li>
 
152
 
153
  ...
154
  <br>
155
+ # DOC/PDF to TXT<br>
156
  Prepare your documents by yourself!<br>
157
  Bad Input = bad Output!<br>
158
  In most cases, it is not immediately obvious how the document is made available to the embedder.
159
+ In nearly all cases images and tables, page-numbers, chapters, formulas and sections/paragraph-format not well implement.
160
+ You can start by simply saving the PDF as a TXT file, and you will then see in the TXT file how the embedding-model would see the content.
161
  An easy start is to use a python based pdf-parser (it give a lot).<br>
162
  option only for simple txt/tables converting:
163
  <ul style="line-height: 1.05;">
 
181
  <br><br>
182
  large option to play with many types of (UI-Based)
183
  <ul style="line-height: 1.05;">
184
+ <li>Parse my PDF</li>
185
  </ul>
186
  <a href="https://github.com/genieincodebottle/parsemypdf">https://github.com/genieincodebottle/parsemypdf</a><br>
187
  <br>