CYFRAGOVPL
/

PLLuM-12B-nc-chat

Text Generation

text-generation-inference

Model card Files Files and versions Community

MinistryofDigitalAffairs commited on Mar 11

Commit

a3c3d0f

·

verified ·

1 Parent(s): d599530

Update README.md

Files changed (1) hide show

README.md +16 -0

README.md CHANGED Viewed

@@ -141,8 +141,24 @@ zrób mi tę przyjemność i przyjdź wreszcie, proszę!
 ```
 Your results may vary depending on model parameters (e.g., temperature, top_k, top_p), hardware, and other settings.
 ## Training Procedure
 - **Datasets**: ~150B tokens from Polish and multilingual sources, with ~28B tokens available for fully open-source commercial use.

 ```
 Your results may vary depending on model parameters (e.g., temperature, top_k, top_p), hardware, and other settings.
+### 6. Retrieval Augmented Generation (RAG)
+Our Llama-PLLuM models (both chat and instruct versions) were additionally trained to perform well in Retrieval Augmented Generation (RAG) setting. The prompt is in .jinja format, where docs is a list of document texts and question is a query that should be answered based on the provided documents. If there is no answer in the provided documents model generates "Nie udało mi się odnaleźć odpowiedzi na pytanie".
+Prompt:
+```
+Numerowana lista dokumentów jest poniżej:
+---------------------
+<results>{% for doc in docs %}
+Dokument: {{ loop.index0 }}
+{{ doc }}
+{% endfor %}</results>
+---------------------
+Odpowiedz na pytanie użytkownika wykorzystując tylko informacje znajdujące się w dokumentach, a nie wcześniejszą wiedzę.
+Udziel wysokiej jakości, poprawnej gramatycznie odpowiedzi w języku polskim. Odpowiedź powinna zawierać cytowania do dokumentów, z których pochodzą informacje. Zacytuj dokument za pomocą symbolu [nr_dokumentu] powołując się na fragment np. [0] dla fragmentu z dokumentu 0. Jeżeli w dokumentach nie ma informacji potrzebnych do odpowiedzi na pytanie, zamiast odpowiedzi zwróć tekst: "Nie udało mi się odnaleźć odpowiedzi na pytanie".
+Pytanie: {{ question }}
+```
 ## Training Procedure
 - **Datasets**: ~150B tokens from Polish and multilingual sources, with ~28B tokens available for fully open-source commercial use.