Update README.md

Browse files

Files changed (1) hide show

README.md +8 -8

README.md CHANGED Viewed

@@ -96,7 +96,7 @@ and a Vram calculator - (you need the original model link NOT the GGUF)<br>
 You have a txt/pdf file maybe 90000words(~300pages) a book. You ask the model lets say "what is described in chapter called XYZ in relation to person ZYX".
 Now it searches for keywords or similar semantic terms in the document. if it has found them, lets say word and meaning around “XYZ and ZYX” ,
-now a piece of text 1024token around this word “XYZ/ZYX” is cut out at this point.  (In reality, it's all done with coded numbers, but dosnt matter - the principle)<br>
 This text snippet is then used for your answer. <br>
 <ul style="line-height: 1.05;">
 <li>If, for example, the word “XYZ” occurs 50 times in one file, not all 50 are used for answer, only the number of snippets with a fast ranking are used</li>
@@ -155,22 +155,22 @@ btw. <b>Jinja</b> templates very new ... the usual templates with usual models a
 # DOC/PDF to TXT<br>
 Prepare your documents by yourself!<br>
 Bad Input = bad Output!<br>
-In most cases, it is not immediately obvious how the document is made available to the embedder.
 In nearly all cases images and tables, page-numbers, chapters, formulas and sections/paragraph-format not well implement.
 You can start by simply saving the PDF as a TXT file, and you will then see in the TXT file how the embedding-model would see the content.
-An easy start is to use a python based pdf-parser (it give a lot).<br>
-option only for simple txt/tables converting:
 <ul style="line-height: 1.05;">
 <li>pdfplumber</li>
 <li>fitz/PyMuPDF</li>
 <li>Camelot</li>
 </ul>
-All in all you can tune a lot your code and you can manual add OCR.<br>
-my option:<br>
 <a href="https://huggingface.co/kalle07/pdf2txt_parser_converter">https://huggingface.co/kalle07/pdf2txt_parser_converter</a>
 <br><br>
-option all in all solution for the future:
 <ul style="line-height: 1.05;">
 <li>docling - (opensource on github)</li>
 </ul>
@@ -189,7 +189,7 @@ large option to play with many types of (UI-Based)
 ...
 <br>
 # only Indexing option<br>
-One hint for fast search on 10000s of PDF (its only indexing not embedding) you can use it as a simple way to find your top 5-10 articles or books, you can then make these available to an LLM.<br>
 Jabref - https://github.com/JabRef/jabref/tree/v6.0-alpha?tab=readme-ov-file <br>
 https://builds.jabref.org/main/ <br>
 or<br>

 You have a txt/pdf file maybe 90000words(~300pages) a book. You ask the model lets say "what is described in chapter called XYZ in relation to person ZYX".
 Now it searches for keywords or similar semantic terms in the document. if it has found them, lets say word and meaning around “XYZ and ZYX” ,
+now a piece of text 1024token around this word “XYZ/ZYX” is cut out at this point.  (In reality, it's all done with coded numbers per chunck and thats why you dont can search for single numbers or words, but dosnt matter - the principle)<br>
 This text snippet is then used for your answer. <br>
 <ul style="line-height: 1.05;">
 <li>If, for example, the word “XYZ” occurs 50 times in one file, not all 50 are used for answer, only the number of snippets with a fast ranking are used</li>
 # DOC/PDF to TXT<br>
 Prepare your documents by yourself!<br>
 Bad Input = bad Output!<br>
+In most cases, it is not immediately obvious how the document is made available to the embedder. In ALLM its "c:\Users\XXX\AppData\Roaming\anythingllm-desktop\storage\documents", you can open with a text editor to check the quality.
 In nearly all cases images and tables, page-numbers, chapters, formulas and sections/paragraph-format not well implement.
 You can start by simply saving the PDF as a TXT file, and you will then see in the TXT file how the embedding-model would see the content.
+An easy start is to use a python based pdf-parser (it give a lot) also OCR based for images.<br>
+option one only for simple txt/tables converting:
 <ul style="line-height: 1.05;">
 <li>pdfplumber</li>
 <li>fitz/PyMuPDF</li>
 <li>Camelot</li>
 </ul>
+All in all you can tune a lot your code but the difficulties lie in the details.<br>
+my option, one exe for windows and also python, a second option with ocr:<br>
 <a href="https://huggingface.co/kalle07/pdf2txt_parser_converter">https://huggingface.co/kalle07/pdf2txt_parser_converter</a>
 <br><br>
+option ocr from ibm (open source):
 <ul style="line-height: 1.05;">
 <li>docling - (opensource on github)</li>
 </ul>
 ...
 <br>
 # only Indexing option<br>
+One hint for fast search on 10000s of PDF/TXT/DOC (its only indexing, not embedding) you can use it as a simple way to find your top 5-10 articles or books, you can then make these available to an LLM.<br>
 Jabref - https://github.com/JabRef/jabref/tree/v6.0-alpha?tab=readme-ov-file <br>
 https://builds.jabref.org/main/ <br>
 or<br>