Update README.md
Browse files
README.md
CHANGED
@@ -96,7 +96,7 @@ and a Vram calculator - (you need the original model link NOT the GGUF)<br>
|
|
96 |
|
97 |
You have a txt/pdf file maybe 90000words(~300pages) a book. You ask the model lets say "what is described in chapter called XYZ in relation to person ZYX".
|
98 |
Now it searches for keywords or similar semantic terms in the document. if it has found them, lets say word and meaning around “XYZ and ZYX” ,
|
99 |
-
now a piece of text 1024token around this word “XYZ/ZYX” is cut out at this point. (In reality, it's all done with coded numbers, but dosnt matter - the principle)<br>
|
100 |
This text snippet is then used for your answer. <br>
|
101 |
<ul style="line-height: 1.05;">
|
102 |
<li>If, for example, the word “XYZ” occurs 50 times in one file, not all 50 are used for answer, only the number of snippets with a fast ranking are used</li>
|
@@ -155,22 +155,22 @@ btw. <b>Jinja</b> templates very new ... the usual templates with usual models a
|
|
155 |
# DOC/PDF to TXT<br>
|
156 |
Prepare your documents by yourself!<br>
|
157 |
Bad Input = bad Output!<br>
|
158 |
-
In most cases, it is not immediately obvious how the document is made available to the embedder.
|
159 |
In nearly all cases images and tables, page-numbers, chapters, formulas and sections/paragraph-format not well implement.
|
160 |
You can start by simply saving the PDF as a TXT file, and you will then see in the TXT file how the embedding-model would see the content.
|
161 |
-
An easy start is to use a python based pdf-parser (it give a lot).<br>
|
162 |
-
option only for simple txt/tables converting:
|
163 |
<ul style="line-height: 1.05;">
|
164 |
<li>pdfplumber</li>
|
165 |
<li>fitz/PyMuPDF</li>
|
166 |
<li>Camelot</li>
|
167 |
</ul>
|
168 |
-
All in all you can tune a lot your code
|
169 |
-
my option:<br>
|
170 |
<a href="https://huggingface.co/kalle07/pdf2txt_parser_converter">https://huggingface.co/kalle07/pdf2txt_parser_converter</a>
|
171 |
|
172 |
<br><br>
|
173 |
-
option
|
174 |
<ul style="line-height: 1.05;">
|
175 |
<li>docling - (opensource on github)</li>
|
176 |
</ul>
|
@@ -189,7 +189,7 @@ large option to play with many types of (UI-Based)
|
|
189 |
...
|
190 |
<br>
|
191 |
# only Indexing option<br>
|
192 |
-
One hint for fast search on 10000s of PDF (its only indexing not embedding) you can use it as a simple way to find your top 5-10 articles or books, you can then make these available to an LLM.<br>
|
193 |
Jabref - https://github.com/JabRef/jabref/tree/v6.0-alpha?tab=readme-ov-file <br>
|
194 |
https://builds.jabref.org/main/ <br>
|
195 |
or<br>
|
|
|
96 |
|
97 |
You have a txt/pdf file maybe 90000words(~300pages) a book. You ask the model lets say "what is described in chapter called XYZ in relation to person ZYX".
|
98 |
Now it searches for keywords or similar semantic terms in the document. if it has found them, lets say word and meaning around “XYZ and ZYX” ,
|
99 |
+
now a piece of text 1024token around this word “XYZ/ZYX” is cut out at this point. (In reality, it's all done with coded numbers per chunck and thats why you dont can search for single numbers or words, but dosnt matter - the principle)<br>
|
100 |
This text snippet is then used for your answer. <br>
|
101 |
<ul style="line-height: 1.05;">
|
102 |
<li>If, for example, the word “XYZ” occurs 50 times in one file, not all 50 are used for answer, only the number of snippets with a fast ranking are used</li>
|
|
|
155 |
# DOC/PDF to TXT<br>
|
156 |
Prepare your documents by yourself!<br>
|
157 |
Bad Input = bad Output!<br>
|
158 |
+
In most cases, it is not immediately obvious how the document is made available to the embedder. In ALLM its "c:\Users\XXX\AppData\Roaming\anythingllm-desktop\storage\documents", you can open with a text editor to check the quality.
|
159 |
In nearly all cases images and tables, page-numbers, chapters, formulas and sections/paragraph-format not well implement.
|
160 |
You can start by simply saving the PDF as a TXT file, and you will then see in the TXT file how the embedding-model would see the content.
|
161 |
+
An easy start is to use a python based pdf-parser (it give a lot) also OCR based for images.<br>
|
162 |
+
option one only for simple txt/tables converting:
|
163 |
<ul style="line-height: 1.05;">
|
164 |
<li>pdfplumber</li>
|
165 |
<li>fitz/PyMuPDF</li>
|
166 |
<li>Camelot</li>
|
167 |
</ul>
|
168 |
+
All in all you can tune a lot your code but the difficulties lie in the details.<br>
|
169 |
+
my option, one exe for windows and also python, a second option with ocr:<br>
|
170 |
<a href="https://huggingface.co/kalle07/pdf2txt_parser_converter">https://huggingface.co/kalle07/pdf2txt_parser_converter</a>
|
171 |
|
172 |
<br><br>
|
173 |
+
option ocr from ibm (open source):
|
174 |
<ul style="line-height: 1.05;">
|
175 |
<li>docling - (opensource on github)</li>
|
176 |
</ul>
|
|
|
189 |
...
|
190 |
<br>
|
191 |
# only Indexing option<br>
|
192 |
+
One hint for fast search on 10000s of PDF/TXT/DOC (its only indexing, not embedding) you can use it as a simple way to find your top 5-10 articles or books, you can then make these available to an LLM.<br>
|
193 |
Jabref - https://github.com/JabRef/jabref/tree/v6.0-alpha?tab=readme-ov-file <br>
|
194 |
https://builds.jabref.org/main/ <br>
|
195 |
or<br>
|