LLM Hallucinations: bug or feature? The US Supreme Court 2025 cases experiment

Community Article Published July 8, 2025

This experiment shows how LLMs hallucinate when answering questions about recent events without search access. Built with Hugging Face AISheets and Inference Providers, it compares responses with/without search on 2025 Supreme Court cases, creating an open dataset to showcase hallucination patterns.

With all the recent noise around hallucinations (are they a feature or a bug?) I wanted to contribute something valuable to the community.

The main argument I've seen is that LLM hallucinations are a feature because they let LLMs give creative answers and explore new directions. But what if you're looking for factuality? The consensus is that you need to provide LLMs with access to search.

image/png

My take:

Hallucinations are not a feature if you don't know about them and you're using AI to answer questions, solve real-world issues, etc. Based on my experience, most users don't know about this and use AI as the new Google (just talk to people outside the AI builders/geeks bubble).

Given the above, I wanted to build a simple, actionable resource for the community to see the effects of hallucinations. Something that clearly shows the issue and can be used by AI builders as an example, a playground, and food for thought.

As I'm building dataset tools and datasets, why not create a dataset? So this weekend, I came up with the following idea:

  1. Download very recent information: 2025 US Supreme Court Cases

  2. Use AISheets and Hugging Face Inference Providers to generate 20 questions for each case.

  3. Generate responses with different LLMs, with and without access to real-time search.

  4. Use Llama to analyze the factuality of the model's response without access to search.

image/png

Generate responses using real-time search results with AISheets

image/png

Generate responses using real-time search results with AISheets

Dataset

image/png

The result?

A Hub dataset with:

  • Questions to test hallucinations in a real-world context.

  • Examples of different types of fabricated information and facts. All models I tried responded with false facts 100% of the time when the context was not provided.

image/png

Example of hallucination analysis

Appendix: Prompts

Here are the main prompts I have used in AISheets for building the dataset (with zero setup and no code required)

AISheets configuration for generating summaries and questions

columns:
  AI_Summary:
    modelName: meta-llama/Llama-3.3-70B-Instruct
    modelProvider: nebius
    userPrompt: >-
      Find news and analyses and provide a short summary of the impact in plain
      English:


      {{Case_Name}} / {{Docket_Number}} / {{Court}}
    prompt: "

      You are a rigorous, intelligent data-processing engine. Generate only the
      requested response format, with no explanations following the user
      instruction. You might be provided with positive, accurate examples of how
      the user instruction must be completed.




      # User instruction

      Find news and analyses and provide a short summary of the impact in plain
      English:


      {{Case_Name}} / {{Docket_Number}} / {{Court}}




      # Your response

      \    "
    searchEnabled: true
    columnsReferences:
      - Case_Name
      - Docket_Number
      - Court
  questions:
    modelName: meta-llama/Llama-3.3-70B-Instruct
    modelProvider: nebius
    userPrompt: >
      Craft a list of 20 questions about this case, for someone that really
      knows the case and its impact:


      {{Case_Name}}


      {{Summary}}


      {{Column 8}}
    prompt: "

      You are a rigorous, intelligent data-processing engine. Generate only the
      requested response format, with no explanations following the user
      instruction. You might be provided with positive, accurate examples of how
      the user instruction must be completed.




      # User instruction

      Craft a list of 20 questions about this case, for someone that really
      knows the case and its impact:


      {{Case_Name}}


      {{Summary}}







      # Your response

      \    "
    searchEnabled: false
    columnsReferences:
      - Case_Name
      - AI_Summary
      - Summary

Hallucination assessment

# Hallucination Classification Guide

Given the context, question, incorrect answer and grounded answer, identify which types of hallucination appear in the incorrect answer.

## Input Format:
**Question:** {{question}}  
**Context:** {{AI_Summary}}
**Incorrect answer:** {{llama70B-no-search}}
**Grounded answer:** {{llama70B-search}}

---

## Hallucination Categories:

**Factual Inconsistency**
- The answer states facts that are objectively wrong
- Example: Claiming Paris is the capital of Germany
- Check: Does the incorrect answer contradict established real-world knowledge?

**Factual Fabrication** 
- The answer invents completely non-existent information
- Example: Creating fake historical events, people, or statistics
- Check: Does the incorrect answer contain entirely made-up facts that don't exist?

**Logical Inconsistency**
- The answer contains internal contradictions or logical errors
- Example: Mathematical calculations with wrong results
- Check: Does the incorrect answer contradict itself or contain logical flaws?

**Intrinsic Hallucination**
- The answer directly contradicts the provided context/source
- Example: Context says "FDA approved the vaccine" but answer says "FDA rejected it"
- Check: Does the incorrect answer directly contradict the context?

**Extrinsic Hallucination**  
- The answer adds information that cannot be verified from the context
- Example: Context discusses a meeting but answer adds specific attendee names not mentioned
- Check: Does the incorrect answer add unverifiable information beyond the context?

---

## Instructions:
1. Compare the incorrect answer to the context and grounded answer
2. Identify which categories apply (multiple categories may apply to the same answer)
3. Provide your response as a comma-separated list

## Output Format:
comma-separated list of applicable categories with reason in () for each category

Community

Sign up or log in to comment