Update README.md
Browse files
README.md
CHANGED
@@ -32,14 +32,14 @@ architecture:
|
|
32 |
|
33 |
---
|
34 |
|
35 |
-
# <b>This is a collection of more than 25 types of embedding
|
36 |
<br>
|
37 |
|
38 |
# <b>All models tested with ALLM(AnythingLLM) with LM-Studio as server, all models should be work with ollama</b>
|
39 |
<b> the setup for local documents described below is allmost the same, GPT4All has only one model (nomic), and koboldcpp is not build in right now but in development</b><br>
|
40 |
|
41 |
(sometimes the results are more truthful if the “chat with document only” option is used)<br>
|
42 |
-
BTW embedder is only a part of a good RAG (Retrieval-Augmented Generation)<br>
|
43 |
<b>⇨</b> give me a ❤️, if you like ;)<br>
|
44 |
<br>
|
45 |
<b>My short impression:</b>
|
@@ -64,19 +64,19 @@ Further tests have shown that the following models are suitable for complex task
|
|
64 |
...
|
65 |
|
66 |
# Short hints for using (Example for a large context with many expected hits):
|
67 |
-
Set your (Max Tokens)context-lenght 16000t main-LLM-model <b>"LM-Studio with ALLM you must set also in LM!"</b>, set your embedder-model (Max Embedding Chunk Length) 1024t,set (Max Context Snippets) 14,
|
68 |
-
in ALLM set also (Text splitting & Chunking Preferences - Text Chunk Size) 1024 character parts and (Search Preference) "accuracy".
|
69 |
<br>
|
70 |
-
Hint in ALLM, set all in LM
|
71 |
|
72 |
-> Ok what that mean!<br>
|
73 |
Your document will be embedd in x times 1024t chunks(snippets),<br>
|
74 |
-
You can receive 14-snippets a 1024t (~14000t) from your document ~10000words(10pages) and ~2000t left (from 16000t) for the answer ~1000words (2 pages)
|
75 |
<br>
|
76 |
You can play and set for your needs, eg 8-snippets a 2048t, or 28-snippets a 512t ... (every time you change the chunk-length the document must be embedd again). With these settings everything fits best for ONE answer, if you need more for a conversation, you should set lower and/or disable the document.
|
77 |
<ul style="line-height: 1.05;">
|
78 |
-
english vs german differ 50
|
79 |
-
~5000 characters is one page of a book (no matter ger/en). But if you calculate with words ... words in german are longer, that means per word more token.<br>
|
80 |
The example is english, for german you can add apox 50% more token/word (1000 words ~1800t)<br>
|
81 |
<li>1200t (~1000 words ~5000 chracter) ~0.1GB, this is aprox one page with small font</li>
|
82 |
<li>8000t (~6000 words) ~0.8GB VRAM usage</li>
|
@@ -152,11 +152,12 @@ btw. <b>Jinja</b> templates very new ... the usual templates with usual models a
|
|
152 |
|
153 |
...
|
154 |
<br>
|
155 |
-
# DOC/PDF
|
156 |
Prepare your documents by yourself!<br>
|
157 |
Bad Input = bad Output!<br>
|
158 |
In most cases, it is not immediately obvious how the document is made available to the embedder.
|
159 |
-
|
|
|
160 |
An easy start is to use a python based pdf-parser (it give a lot).<br>
|
161 |
option only for simple txt/tables converting:
|
162 |
<ul style="line-height: 1.05;">
|
@@ -180,7 +181,7 @@ also for OCR it download automatic some models. the only thing i haven't found y
|
|
180 |
<br><br>
|
181 |
large option to play with many types of (UI-Based)
|
182 |
<ul style="line-height: 1.05;">
|
183 |
-
<li>
|
184 |
</ul>
|
185 |
<a href="https://github.com/genieincodebottle/parsemypdf">https://github.com/genieincodebottle/parsemypdf</a><br>
|
186 |
<br>
|
|
|
32 |
|
33 |
---
|
34 |
|
35 |
+
# <b>This is a collection of more than 25 types of embedding models and a really brief introduction to what you should know about embedding.If you don't keep a few things in mind, you won't be satisfied with the results.</b>
|
36 |
<br>
|
37 |
|
38 |
# <b>All models tested with ALLM(AnythingLLM) with LM-Studio as server, all models should be work with ollama</b>
|
39 |
<b> the setup for local documents described below is allmost the same, GPT4All has only one model (nomic), and koboldcpp is not build in right now but in development</b><br>
|
40 |
|
41 |
(sometimes the results are more truthful if the “chat with document only” option is used)<br>
|
42 |
+
BTW the embedder-model is only a part of a good RAG (Retrieval-Augmented Generation)<br>
|
43 |
<b>⇨</b> give me a ❤️, if you like ;)<br>
|
44 |
<br>
|
45 |
<b>My short impression:</b>
|
|
|
64 |
...
|
65 |
|
66 |
# Short hints for using (Example for a large context with many expected hits):
|
67 |
+
Set your (Max Tokens)context-lenght 16000t main-LLM-model <b>"LM-Studio with ALLM you must set also in LM-Stutio settings!"</b>, set your embedder-model (Max Embedding Chunk Length) 1024t,set (Max Context Snippets) 14,
|
68 |
+
in ALLM set also (Text splitting & Chunking Preferences - Text Chunk Size) 1024 character parts and (Search Preference) "accuracy". And set in your workspace 14 snippets.
|
69 |
<br>
|
70 |
+
Hint in ALLM, set all in LM studio start both models and both are on top in ALLM.<br>
|
71 |
|
72 |
-> Ok what that mean!<br>
|
73 |
Your document will be embedd in x times 1024t chunks(snippets),<br>
|
74 |
+
You can receive 14-snippets a 1024t (~14000t) from your document ~10000words(10pages) and ~2000t left (from 16000t) for the answer ~1000words (2 pages).
|
75 |
<br>
|
76 |
You can play and set for your needs, eg 8-snippets a 2048t, or 28-snippets a 512t ... (every time you change the chunk-length the document must be embedd again). With these settings everything fits best for ONE answer, if you need more for a conversation, you should set lower and/or disable the document.
|
77 |
<ul style="line-height: 1.05;">
|
78 |
+
english vs german differ 50% in calculate tokens/word<br>
|
79 |
+
but ~5000 characters is one page of a book (no matter ger/en). But if you calculate with words ... words in german are longer, that means per word more token.<br>
|
80 |
The example is english, for german you can add apox 50% more token/word (1000 words ~1800t)<br>
|
81 |
<li>1200t (~1000 words ~5000 chracter) ~0.1GB, this is aprox one page with small font</li>
|
82 |
<li>8000t (~6000 words) ~0.8GB VRAM usage</li>
|
|
|
152 |
|
153 |
...
|
154 |
<br>
|
155 |
+
# DOC/PDF to TXT<br>
|
156 |
Prepare your documents by yourself!<br>
|
157 |
Bad Input = bad Output!<br>
|
158 |
In most cases, it is not immediately obvious how the document is made available to the embedder.
|
159 |
+
In nearly all cases images and tables, page-numbers, chapters, formulas and sections/paragraph-format not well implement.
|
160 |
+
You can start by simply saving the PDF as a TXT file, and you will then see in the TXT file how the embedding-model would see the content.
|
161 |
An easy start is to use a python based pdf-parser (it give a lot).<br>
|
162 |
option only for simple txt/tables converting:
|
163 |
<ul style="line-height: 1.05;">
|
|
|
181 |
<br><br>
|
182 |
large option to play with many types of (UI-Based)
|
183 |
<ul style="line-height: 1.05;">
|
184 |
+
<li>Parse my PDF</li>
|
185 |
</ul>
|
186 |
<a href="https://github.com/genieincodebottle/parsemypdf">https://github.com/genieincodebottle/parsemypdf</a><br>
|
187 |
<br>
|