Spaces:
Sleeping
Sleeping
Luca Foppiano
commited on
Fix typo, acknowledge more contributors
Browse files
README.md
CHANGED
|
@@ -19,13 +19,13 @@ license: apache-2.0
|
|
| 19 |
## Introduction
|
| 20 |
|
| 21 |
Question/Answering on scientific documents using LLMs: ChatGPT-3.5-turbo, Mistral-7b-instruct and Zephyr-7b-beta.
|
| 22 |
-
The streamlit application
|
| 23 |
-
|
| 24 |
-
We target only the full-text using [Grobid](https://github.com/kermitt2/grobid)
|
| 25 |
|
| 26 |
Additionally, this frontend provides the visualisation of named entities on LLM responses to extract <span stype="color:yellow">physical quantities, measurements</span> (with [grobid-quantities](https://github.com/kermitt2/grobid-quantities)) and <span stype="color:blue">materials</span> mentions (with [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors)).
|
| 27 |
|
| 28 |
-
The conversation is kept in memory
|
| 29 |
|
| 30 |
(The image on the right was generated with https://huggingface.co/spaces/stabilityai/stable-diffusion)
|
| 31 |
|
|
@@ -35,9 +35,9 @@ The conversation is kept in memory up by a buffered sliding window memory (top 4
|
|
| 35 |
|
| 36 |
## Getting started
|
| 37 |
|
| 38 |
-
- Select the model+embedding combination you want
|
| 39 |
- Enter your API Key ([Open AI](https://platform.openai.com/account/api-keys) or [Huggingface](https://huggingface.co/docs/hub/security-tokens)).
|
| 40 |
-
- Upload a scientific article as PDF document. You will see a spinner or loading indicator while the processing is in progress.
|
| 41 |
- Once the spinner stops, you can proceed to ask your questions
|
| 42 |
|
| 43 |

|
|
@@ -53,9 +53,9 @@ With default settings, each question uses around 1000 tokens.
|
|
| 53 |
|
| 54 |
### Chunks size
|
| 55 |
When uploaded, each document is split into blocks of a determined size (250 tokens by default).
|
| 56 |
-
This setting
|
| 57 |
-
Smaller blocks will result in smaller context, yielding more precise sections of the document.
|
| 58 |
-
Larger blocks will result in larger context less constrained around the question.
|
| 59 |
|
| 60 |
### Query mode
|
| 61 |
Indicates whether sending a question to the LLM (Language Model) or to the vector storage.
|
|
@@ -65,7 +65,7 @@ Indicates whether sending a question to the LLM (Language Model) or to the vecto
|
|
| 65 |
### NER (Named Entities Recognition)
|
| 66 |
|
| 67 |
This feature is specifically crafted for people working with scientific documents in materials science.
|
| 68 |
-
It enables to run NER on the response from the LLM, to identify materials mentions and properties (quantities,
|
| 69 |
This feature leverages both [grobid-quantities](https://github.com/kermitt2/grobid-quanities) and [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors) external services.
|
| 70 |
|
| 71 |
|
|
@@ -78,7 +78,9 @@ To release a new version:
|
|
| 78 |
|
| 79 |
To use docker:
|
| 80 |
|
| 81 |
-
- docker run `lfoppiano/document-insights-qa:
|
|
|
|
|
|
|
| 82 |
|
| 83 |
To install the library with Pypi:
|
| 84 |
|
|
@@ -88,6 +90,9 @@ To install the library with Pypi:
|
|
| 88 |
## Acknolwedgement
|
| 89 |
|
| 90 |
This project is developed at the [National Institute for Materials Science](https://www.nims.go.jp) (NIMS) in Japan in collaboration with the [Lambard-ML-Team](https://github.com/Lambard-ML-Team).
|
|
|
|
|
|
|
|
|
|
| 91 |
|
| 92 |
|
| 93 |
|
|
|
|
| 19 |
## Introduction
|
| 20 |
|
| 21 |
Question/Answering on scientific documents using LLMs: ChatGPT-3.5-turbo, Mistral-7b-instruct and Zephyr-7b-beta.
|
| 22 |
+
The streamlit application demonstrates the implementation of a RAG (Retrieval Augmented Generation) on scientific documents, that we are developing at NIMS (National Institute for Materials Science), in Tsukuba, Japan.
|
| 23 |
+
Different to most of the projects, we focus on scientific articles.
|
| 24 |
+
We target only the full-text using [Grobid](https://github.com/kermitt2/grobid) which provides cleaner results than the raw PDF2Text converter (which is comparable with most of other solutions).
|
| 25 |
|
| 26 |
Additionally, this frontend provides the visualisation of named entities on LLM responses to extract <span stype="color:yellow">physical quantities, measurements</span> (with [grobid-quantities](https://github.com/kermitt2/grobid-quantities)) and <span stype="color:blue">materials</span> mentions (with [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors)).
|
| 27 |
|
| 28 |
+
The conversation is kept in memory by a buffered sliding window memory (top 4 more recent messages) and the messages are injected in the context as "previous messages".
|
| 29 |
|
| 30 |
(The image on the right was generated with https://huggingface.co/spaces/stabilityai/stable-diffusion)
|
| 31 |
|
|
|
|
| 35 |
|
| 36 |
## Getting started
|
| 37 |
|
| 38 |
+
- Select the model+embedding combination you want to use
|
| 39 |
- Enter your API Key ([Open AI](https://platform.openai.com/account/api-keys) or [Huggingface](https://huggingface.co/docs/hub/security-tokens)).
|
| 40 |
+
- Upload a scientific article as a PDF document. You will see a spinner or loading indicator while the processing is in progress.
|
| 41 |
- Once the spinner stops, you can proceed to ask your questions
|
| 42 |
|
| 43 |

|
|
|
|
| 53 |
|
| 54 |
### Chunks size
|
| 55 |
When uploaded, each document is split into blocks of a determined size (250 tokens by default).
|
| 56 |
+
This setting allows users to modify the size of such blocks.
|
| 57 |
+
Smaller blocks will result in a smaller context, yielding more precise sections of the document.
|
| 58 |
+
Larger blocks will result in a larger context less constrained around the question.
|
| 59 |
|
| 60 |
### Query mode
|
| 61 |
Indicates whether sending a question to the LLM (Language Model) or to the vector storage.
|
|
|
|
| 65 |
### NER (Named Entities Recognition)
|
| 66 |
|
| 67 |
This feature is specifically crafted for people working with scientific documents in materials science.
|
| 68 |
+
It enables to run NER on the response from the LLM, to identify materials mentions and properties (quantities, measurements).
|
| 69 |
This feature leverages both [grobid-quantities](https://github.com/kermitt2/grobid-quanities) and [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors) external services.
|
| 70 |
|
| 71 |
|
|
|
|
| 78 |
|
| 79 |
To use docker:
|
| 80 |
|
| 81 |
+
- docker run `lfoppiano/document-insights-qa:{latest_version)`
|
| 82 |
+
|
| 83 |
+
- docker run `lfoppiano/document-insights-qa:latest-develop` for the latest development version
|
| 84 |
|
| 85 |
To install the library with Pypi:
|
| 86 |
|
|
|
|
| 90 |
## Acknolwedgement
|
| 91 |
|
| 92 |
This project is developed at the [National Institute for Materials Science](https://www.nims.go.jp) (NIMS) in Japan in collaboration with the [Lambard-ML-Team](https://github.com/Lambard-ML-Team).
|
| 93 |
+
Contributed by Pedro Ortiz Suarez (@pjox), Tomoya Mato (@t29mato).
|
| 94 |
+
Thanks also to [Patrice Lopez](https://www.science-miner.com), the author of [Grobid](https://github.com/kermitt2/grobid).
|
| 95 |
+
|
| 96 |
|
| 97 |
|
| 98 |
|