HallOumi-8B / README.md
Nessii013's picture
Update README.md
fa4488b verified
metadata
library_name: transformers
license: cc-by-nc-4.0
datasets:
  - oumi-ai/oumi-anli-subset
  - oumi-ai/oumi-c2d-d2c-subset
  - oumi-ai/oumi-synthetic-claims
  - oumi-ai/oumi-synthetic-document-claims
language:
  - en
base_model:
  - meta-llama/Llama-3.1-8B-Instruct

oumi logo Made with Oumi

Documentation Blog Discord

oumi-ai/HallOumi-8B

Introducing HallOumi-8B, a SOTA hallucination detection model, outperforming DeepSeek R1, OpenAI o1, Google Gemini 1.5 Pro, and Claude Sonnet 3.5 at only 8 billion parameters!

Give HallOumi a try now!

Model Macro F1 Score Open? Model Size
HallOumi-8B 77.2% ± 2.2% Truly Open Source 8B
Claude Sonnet 3.5 69.6% ± 2.8% Closed ??
OpenAI o1-preview 65.9% ± 2.3% Closed ??
DeepSeek R1 61.6% ± 2.5% Open Weights 671B
Llama 3.1 405B 58.8% ± 2.4% Open Weights 405B
Google Gemini 1.5 Pro 48.2% ± 1.8% Closed ??

HallOumi, the hallucination detection model built with Oumi, is a system built specifically to enable per-sentence verification of any content (either AI or human-generated) with sentence-level citations and human-readable explanations. For example, when given one or more context documents, as well as an AI-generated summary, HallOumi goes through every claim being made in the summary and identifies:

  • A determination whether that particular statement is supported or unsupported by the provided context combined with a confidence score.
  • The relevant context sentences associated with that claim to facilitate human review.
  • An explanation describing why a particular claim is supported or unsupported to boost human review accuracy. Some hallucinations may be nuanced and hard for humans to catch without help.

Hallucinations

Hallucinations are often cited as the most important issue with being able to deploy generative models in numerous commercial and personal applications, and for good reason:

It ultimately comes down to an issue of trust — generative models are trained to produce outputs which are probabilistically likely, but not necessarily true. While such tools are useful in the right hands, being unable to trust them prevents AI from being adopted more broadly, where it can be utilized safely and responsibly.

Building Trust with Verifiability

To be able to begin trusting AI systems, we have to be able to verify their outputs. To verify, we specifically mean that we need to:

  • Understand the truthfulness of a particular statement produced by any model.
  • Understand what information supports that statement’s truth (or lack thereof).
  • Have full traceability connecting the statement to that information.

Missing any one of these aspects results in a system that cannot be verified and therefore cannot be trusted. However, this is not enough, as we have to be capable of doing these things in a way that is meticulous, scalable, and human-readable. With explanations, confidence scores, and citations, all at an affordable model size, HallOumi takes us towards a more grounded, trustworthy future for AI.


Uses

Use to verify claims/detect hallucinations in scenarios where a known source of truth is available.

Demo: https://oumi.ai/halloumi-demo

Example prompt:

EXAMPLE_CONTEXT = """<|context|><|s1|><This is sentence 1 of the document.><end||s><|s2|><This is sentence 2 of the document.><end||s><end||context>"""
EXAMPLE_REQUEST = """<|request|><Make one or more claims about information in the documents.><end||request>"""
EXAMPLE_RESPONSE = """<|response|><|r1|><This is sentence 1 of the claims/response.><end||r><|r2|><This is sentence 2 of the claims/response.><end||r><end||response>"""

messages = [
    {'role': 'user', 'content': f"{EXAMPLE_CONTEXT}{EXAMPLE_REQUEST}{EXAMPLE_RESPONSE}",
]

Out-of-Scope Use

Smaller LLMs have limited capabilities and should be used with caution. Avoid using this model for purposes outside of claim verification.

Bias, Risks, and Limitations

This model was finetuned with Llama-3.1-405B-Instruct data on top of a Llama-3.1-8B-Instruct model, so any biases or risks associated with those models may be present.

Training Details

Training Data

Training data:

Training Procedure

For information on training, see https://oumi.ai/halloumi

Evaluation

Follow along with our notebook on how to evaluate hallucination with HallOumi and other popular models:

https://github.com/oumi-ai/oumi/blob/main/configs/projects/halloumi/halloumi_eval_notebook.ipynb

Environmental Impact

  • Hardware Type: H100
  • Hours used: 32 (4 * 8 GPUs)
  • Cloud Provider: Google Cloud Platform
  • Compute Region: us-east5
  • Carbon Emitted: 2.8 kg

Citation

@misc{oumiHalloumi8B,
  author = {Jeremy Greer, Konstantinos Aisopos, Panos Achlioptas, Michael Schuler, Oussama Elachqar, Emmanouil Koukoumidis},
  title = {HallOumi-8B},
  month = {March},
  year = {2025},
  url = {https://huggingface.co/oumi-ai/HallOumi-8B}
}

@software{oumi2025,
  author = {Oumi Community},
  title = {Oumi: an Open, End-to-end Platform for Building Large Foundation Models},
  month = {January},
  year = {2025},
  url = {https://github.com/oumi-ai/oumi}
}