Llama-3-8B-Instruct QED Few-Shot (Both Prompts)

Model Description

This model is a fine-tuned version of meta-llama/Meta-Llama-3-8B-Instruct for the QED (Question-Explanation-Data) task.
It was trained using a few-shot approach with both demonstration examples ("Life of Pi" and "Acute hemolytic reaction") included in the prompt, following the QED instruction format.

Base model: Meta-Llama-3-8B-Instruct
Fine-tuning method: LoRA (QLoRA, 4-bit)
Task: Extracting short answers, supporting sentences, and referential equalities from text passages given a question.

Intended Uses & Limitations

Intended use: Research on explainable QA, entity and span extraction, and referential reasoning.
Not intended for: General open-domain QA, medical or legal advice, or production deployment without further validation.

Training Data

Dataset: QED (Question-Explanation-Data) dataset
Prompt format: Each input includes a title, question, and context passage, with the following instruction and two demonstration examples.

Prompt Format

The model expects prompts in the following format (using Llama-3-Instruct tokens):

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert at extracting answers and structured explanations from text.
Your response MUST be **valid JSON only** (no extra commentary).

Task
====
Given:
• a **title** for the passage,
• a **question** about the passage, and
• the **context passage** itself,

produce an explanation object with three parts:

1. "answer" – the **shortest span** from the passage that fully answers the question.
2. "selected_sentence" – the **single sentence** in the passage that entails or implies the answer.
3. "referential_equalities" – a list of mappings between phrases in the question and phrases in the selected sentence that refer to the **same real-world entity/event**.

   • Each mapping has two keys:
       - "question_reference": the exact phrase from the question (**must be a contiguous substring from the question, not from the context or title**).
       - "sentence_reference": the exact phrase from the selected sentence (**must be a contiguous substring from the selected sentence, not from the question or title**), or "" (empty string if the entire sentence is the referent).

     ▸ Use **""** for "sentence_reference" when the entity/event is not named by any specific phrase in the sentence – i.e. the entire sentence acts as the referent (a *bridge* to the whole sentence).  
       This corresponds to the (start = end = -1) convention in the QED dataset.

Output format
=============
Return **only** JSON in this exact schema:

{
  "answer": "<string from passage>",
  "selected_sentence": "<string from passage>",
  "referential_equalities": [
    {
      "question_reference": "<string from question only>",
      "sentence_reference": "<string from selected_sentence only, or "">",
      "bridge": "<false if not a bridge; otherwise, a string explaining the bridge connection, e.g., 'in', 'for', 'of', 'at', 'on'>"
    }
    ...
  ]
}

Demonstration Example 1:
Title:
Life of Pi

Question:
what is the tigers name in life of pi

Context:
Life of Pi is a Canadian fantasy adventure novel by Yann Martel published in 2001 . The protagonist is Piscine Molitor `` Pi '' Patel , an Indian boy from Pondicherry who explores issues of spirituality and practicality from an early age . He survives 227 days after a shipwreck while stranded on a lifeboat in the Pacific Ocean with a Bengal tiger named Richard Parker .

Expected JSON:
{
  "answer": "Richard Parker",
  "selected_sentence": "He survives 227 days after a shipwreck while stranded on a lifeboat in the Pacific Ocean with a Bengal tiger named Richard Parker .",
  "referential_equalities": [
    {
      "question_reference": "the tiger",
      "sentence_reference": "a Bengal tiger",
      "bridge": false
    },
    {
      "question_reference": "life of pi",
      "sentence_reference": "",
      "bridge": "in"
    }
  ]
}

Demonstration Example 2:
Title:
Acute hemolytic transfusion reaction

Question:
what happens to the rbc in acute hemolytic reaction

Context:
It is also known as an `` immediate hemolytic transfusion reaction '' . This is a medical emergency as it results from rapid destruction of the donor red blood cells by host antibodies ( IgG , IgM ) . It is usually related to ABO blood group incompatibility - the most severe of which often involves group A red cells being given to a patient with group O type blood . Properdin then binds to complement C3 in the donor blood , facilitating the reaction through the alternate pathway cascade . The donor cells also become coated with IgG and are subsequently removed by macrophages in the reticuloendothelial system ( RES ) . Jaundice and disseminated intravascular coagulation ( DIC ) may also occur . The most common cause is clerical error ( i.e. the wrong unit of blood being given to the patient ) .

Expected JSON:
{
  "answer": "rapid destruction of the donor red blood cells by host antibodies ( IgG , IgM )",
  "selected_sentence": "This is a medical emergency as it results from rapid destruction of the donor red blood cells by host antibodies ( IgG , IgM ) .",
  "referential_equalities": [
    {
      "question_reference": "acute hemolytic reaction",
      "sentence_reference": "This",
      "bridge": false
    },
    {
      "question_reference": "the rbc",
      "sentence_reference": "the donor red blood cells",
      "bridge": false
    }
  ]
}
<|eot_id|><|start_header_id|>user<|end_header_id|>

Title: {title}
Question: {question}
Context: {context}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Training Hyperparameters

Model: meta-llama/Meta-Llama-3-8B-Instruct
LoRA: enabled (lora_r=32, lora_alpha=64, lora_dropout=0.05)
Quantization: 4-bit (QLoRA), CPU offload enabled
Epochs: 1
Batch size: 1 (gradient accumulation steps: 16)
Learning rate: 2e-5
Weight decay: 0.001
Warmup ratio: 0.1
Optimizer: paged_adamw_8bit
Precision: bf16
Max source length: 3072
Max target length: 1024
Prompt examples: both (see above)
Output dir: models_fine_tuned/llama3_8b_instruct_fewshot_both

Evaluation Results

Evaluated on 998 validation examples, using official QED metrics at various F1 overlap thresholds (non-strict):

Overlap	Answer Accuracy	All Mention F1	Pair F1
0.50	82.4%	19.6%	10.4%
0.60	74.2%	19.5%	10.3%
0.70	68.2%	19.5%	10.3%
0.80	63.2%	19.5%	10.3%
0.90	59.8%	19.2%	10.0%

Limitations & Ethical Considerations

The model is trained on a specific dataset and task; it may not generalize to other domains.
Outputs are not guaranteed to be factually correct or safe for critical applications.
Always validate outputs before use in downstream tasks.

Citation

If you use this model or code, please cite the original Llama-3 paper and your own work as appropriate.

Author

Denis Rize

DenisRz
/

llama3_8b_instruct_qed