kalle07 commited on
Commit
a9f7b92
·
verified ·
1 Parent(s): 6eda2e5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -8
README.md CHANGED
@@ -96,7 +96,7 @@ and a Vram calculator - (you need the original model link NOT the GGUF)<br>
96
 
97
  You have a txt/pdf file maybe 90000words(~300pages) a book. You ask the model lets say "what is described in chapter called XYZ in relation to person ZYX".
98
  Now it searches for keywords or similar semantic terms in the document. if it has found them, lets say word and meaning around “XYZ and ZYX” ,
99
- now a piece of text 1024token around this word “XYZ/ZYX” is cut out at this point. (In reality, it's all done with coded numbers, but dosnt matter - the principle)<br>
100
  This text snippet is then used for your answer. <br>
101
  <ul style="line-height: 1.05;">
102
  <li>If, for example, the word “XYZ” occurs 50 times in one file, not all 50 are used for answer, only the number of snippets with a fast ranking are used</li>
@@ -155,22 +155,22 @@ btw. <b>Jinja</b> templates very new ... the usual templates with usual models a
155
  # DOC/PDF to TXT<br>
156
  Prepare your documents by yourself!<br>
157
  Bad Input = bad Output!<br>
158
- In most cases, it is not immediately obvious how the document is made available to the embedder.
159
  In nearly all cases images and tables, page-numbers, chapters, formulas and sections/paragraph-format not well implement.
160
  You can start by simply saving the PDF as a TXT file, and you will then see in the TXT file how the embedding-model would see the content.
161
- An easy start is to use a python based pdf-parser (it give a lot).<br>
162
- option only for simple txt/tables converting:
163
  <ul style="line-height: 1.05;">
164
  <li>pdfplumber</li>
165
  <li>fitz/PyMuPDF</li>
166
  <li>Camelot</li>
167
  </ul>
168
- All in all you can tune a lot your code and you can manual add OCR.<br>
169
- my option:<br>
170
  <a href="https://huggingface.co/kalle07/pdf2txt_parser_converter">https://huggingface.co/kalle07/pdf2txt_parser_converter</a>
171
 
172
  <br><br>
173
- option all in all solution for the future:
174
  <ul style="line-height: 1.05;">
175
  <li>docling - (opensource on github)</li>
176
  </ul>
@@ -189,7 +189,7 @@ large option to play with many types of (UI-Based)
189
  ...
190
  <br>
191
  # only Indexing option<br>
192
- One hint for fast search on 10000s of PDF (its only indexing not embedding) you can use it as a simple way to find your top 5-10 articles or books, you can then make these available to an LLM.<br>
193
  Jabref - https://github.com/JabRef/jabref/tree/v6.0-alpha?tab=readme-ov-file <br>
194
  https://builds.jabref.org/main/ <br>
195
  or<br>
 
96
 
97
  You have a txt/pdf file maybe 90000words(~300pages) a book. You ask the model lets say "what is described in chapter called XYZ in relation to person ZYX".
98
  Now it searches for keywords or similar semantic terms in the document. if it has found them, lets say word and meaning around “XYZ and ZYX” ,
99
+ now a piece of text 1024token around this word “XYZ/ZYX” is cut out at this point. (In reality, it's all done with coded numbers per chunck and thats why you dont can search for single numbers or words, but dosnt matter - the principle)<br>
100
  This text snippet is then used for your answer. <br>
101
  <ul style="line-height: 1.05;">
102
  <li>If, for example, the word “XYZ” occurs 50 times in one file, not all 50 are used for answer, only the number of snippets with a fast ranking are used</li>
 
155
  # DOC/PDF to TXT<br>
156
  Prepare your documents by yourself!<br>
157
  Bad Input = bad Output!<br>
158
+ In most cases, it is not immediately obvious how the document is made available to the embedder. In ALLM its "c:\Users\XXX\AppData\Roaming\anythingllm-desktop\storage\documents", you can open with a text editor to check the quality.
159
  In nearly all cases images and tables, page-numbers, chapters, formulas and sections/paragraph-format not well implement.
160
  You can start by simply saving the PDF as a TXT file, and you will then see in the TXT file how the embedding-model would see the content.
161
+ An easy start is to use a python based pdf-parser (it give a lot) also OCR based for images.<br>
162
+ option one only for simple txt/tables converting:
163
  <ul style="line-height: 1.05;">
164
  <li>pdfplumber</li>
165
  <li>fitz/PyMuPDF</li>
166
  <li>Camelot</li>
167
  </ul>
168
+ All in all you can tune a lot your code but the difficulties lie in the details.<br>
169
+ my option, one exe for windows and also python, a second option with ocr:<br>
170
  <a href="https://huggingface.co/kalle07/pdf2txt_parser_converter">https://huggingface.co/kalle07/pdf2txt_parser_converter</a>
171
 
172
  <br><br>
173
+ option ocr from ibm (open source):
174
  <ul style="line-height: 1.05;">
175
  <li>docling - (opensource on github)</li>
176
  </ul>
 
189
  ...
190
  <br>
191
  # only Indexing option<br>
192
+ One hint for fast search on 10000s of PDF/TXT/DOC (its only indexing, not embedding) you can use it as a simple way to find your top 5-10 articles or books, you can then make these available to an LLM.<br>
193
  Jabref - https://github.com/JabRef/jabref/tree/v6.0-alpha?tab=readme-ov-file <br>
194
  https://builds.jabref.org/main/ <br>
195
  or<br>