Update leaderboard.md
Browse files- leaderboard.md +3 -3
leaderboard.md
CHANGED
@@ -19,10 +19,10 @@ Last updated on November 1st, 2023
|
|
19 |
|Google Palm-Chat|72.8 % |27.2 % |88.8 % |221.1|
|
20 |
|
21 |
## Model
|
22 |
-
You can find the model used to compute this leaderboard open sourced for commercial use on hugging face
|
23 |
|
24 |
## Data
|
25 |
-
See [leaderboard-summaries.csv](
|
26 |
|
27 |
## Methodology
|
28 |
To determine this leaderboard, we trained a model to detect hallucinations in LLM outputs, using various open source datasets from the factual consistency research into summarization models. Using a model that is competitive with the best state of the art models, we then fed 1000 short documents to each of the LLMs above via their public APIs and asked them to summarize each short document, using only the facts presented in the document. Of these 1000 documents, only 831 document were summarized by every model, the remaining documents were rejected by at least one model due to content restrictions. Using these 831 documents, we then computed the overall accuracy (no hallucinations) and hallucination rate (100 - accuracy) for each model. The rate at which each model refuses to respond to the prompt is detailed in the 'Answer Rate' column. None of the content sent to the models contained illicit or 'not safe for work' content but the present of trigger words was enough to trigger some of the content filters. The documents were taken primarily from the [CNN / Daily Mail Corpus](https://huggingface.co/datasets/cnn_dailymail/viewer/1.0.0/test).
|
@@ -32,7 +32,7 @@ We evaluate summarization accuracy instead of overall factual accuracy because i
|
|
32 |
## Prompt Used
|
33 |
> You are a chat bot answering questions using data. You must stick to the answers provided solely by the text in the passage provided. You are asked the question 'Provide a concise summary of the following passage, covering the core pieces of information described.' <PASSAGE>'
|
34 |
|
35 |
-
When calling the API, the <PASSAGE> token was then replaced with the source document (see the 'source' column in [leaderboard-summaries.csv](
|
36 |
|
37 |
## API Details
|
38 |
For GPT 3.5 we used the model name ```gpt-3.5-turbo``` in their API, and ```gpt-4``` for GPT4, and we used the ```ChatCompletion``` endpoint from the python client library. For the 3 Llama models, we used the Anyscale hosted endpoints for each model. For the Cohere models, we used their ```/generate``` endpoint for *Cohere*, and ```/chat``` for *Cohere-Chat*. For Anthropic, we used the largest ```claude 2``` model they offer through their API. For the Miustral 7B model, we used the [Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) model, hosted via Hugging Face's API. For Google Palm we used the ```text-bison-001``` model, and for Google Palm Chat we used ```chat-bison-001```.
|
|
|
19 |
|Google Palm-Chat|72.8 % |27.2 % |88.8 % |221.1|
|
20 |
|
21 |
## Model
|
22 |
+
You can find the model used to compute this leaderboard open sourced for commercial use on [hugging face](https://huggingface.co/vectara/hallucination_evaluation_model) along with instructions how to use the model.
|
23 |
|
24 |
## Data
|
25 |
+
See [leaderboard-summaries.csv](leaderboard_summaries.csv) for the generated summaries we used to evaluate the models with.
|
26 |
|
27 |
## Methodology
|
28 |
To determine this leaderboard, we trained a model to detect hallucinations in LLM outputs, using various open source datasets from the factual consistency research into summarization models. Using a model that is competitive with the best state of the art models, we then fed 1000 short documents to each of the LLMs above via their public APIs and asked them to summarize each short document, using only the facts presented in the document. Of these 1000 documents, only 831 document were summarized by every model, the remaining documents were rejected by at least one model due to content restrictions. Using these 831 documents, we then computed the overall accuracy (no hallucinations) and hallucination rate (100 - accuracy) for each model. The rate at which each model refuses to respond to the prompt is detailed in the 'Answer Rate' column. None of the content sent to the models contained illicit or 'not safe for work' content but the present of trigger words was enough to trigger some of the content filters. The documents were taken primarily from the [CNN / Daily Mail Corpus](https://huggingface.co/datasets/cnn_dailymail/viewer/1.0.0/test).
|
|
|
32 |
## Prompt Used
|
33 |
> You are a chat bot answering questions using data. You must stick to the answers provided solely by the text in the passage provided. You are asked the question 'Provide a concise summary of the following passage, covering the core pieces of information described.' <PASSAGE>'
|
34 |
|
35 |
+
When calling the API, the <PASSAGE> token was then replaced with the source document (see the 'source' column in [leaderboard-summaries.csv](leaderboard_summaries.csv) ).
|
36 |
|
37 |
## API Details
|
38 |
For GPT 3.5 we used the model name ```gpt-3.5-turbo``` in their API, and ```gpt-4``` for GPT4, and we used the ```ChatCompletion``` endpoint from the python client library. For the 3 Llama models, we used the Anyscale hosted endpoints for each model. For the Cohere models, we used their ```/generate``` endpoint for *Cohere*, and ```/chat``` for *Cohere-Chat*. For Anthropic, we used the largest ```claude 2``` model they offer through their API. For the Miustral 7B model, we used the [Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) model, hosted via Hugging Face's API. For Google Palm we used the ```text-bison-001``` model, and for Google Palm Chat we used ```chat-bison-001```.
|