OLMoE-1B-7B-Eagle3 Draft Model

This repository provides the EAGLE Draft model weights, related code, and training data based on OLMoE-1B-7B-Eagle3.

📦 Included Files

pytorch_model.bin: Trained EAGLE Draft model weights
config.json: Model configuration file (OLMoE architecture)
tokenizer_config.json: Tokenizer configuration file
modeling_olmoe_kv.py: OLMoE-specific model code (required for EAGLE inference)
eagle_data.json: Training dataset (ShareGPT questions + OLMoE-generated answers)
.gitattributes: Git LFS settings, etc.

🦅 What is the EAGLE Draft Model?

EAGLE is a framework designed to dramatically accelerate inference for large language models (LLMs)
by training a draft decoder layer separately.

Fully compatible with OLMoE-1B-7B-0125-Instruct architecture
The EAGLE Draft layer is structurally similar to the main model’s decoder
During inference, the draft layer generates multiple tokens in advance, which are then verified/accepted by the main model

📝 Training Data Description

eagle_data.json
- Only questions (prompts) are extracted from the ShareGPT dataset
- For each question, the allenai/OLMoE-1B-7B-0125-Instruct model generates its own answer
- Thus, the model’s self-generated answers are used as ground truth to train the draft layer
- This approach ensures the draft layer learns a distribution very close to the main model’s decoder,
  maximizing EAGLE inference performance

🛠️ Usage

1. Using Model Weights/Config Files

pytorch_model.bin, config.json, and tokenizer_config.json
can be used directly with HuggingFace Transformers or EAGLE code.

2. Integrating with EAGLE Inference Code

Copy modeling_olmoe_kv.py
into the official EAGLE repo at EAGLE/eagle/model/.

In your EAGLE inference script, import as:

from eagle.model.modeling_olmoe_kv import OlmoeForCausalLM

3. Example Code

from eagle.model.ea_model import EaModel
from fastchat.model import get_conversation_template
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
 
tokenizer = AutoTokenizer.from_pretrained('allenai/OLMoE-1B-7B-0125-Instruct')
model = EaModel.from_pretrained(
    base_model_path='allenai/OLMoE-1B-7B-0125-Instruct',
    ea_model_path='wantsleep/OLMoE_1B_7B_Eagle3',
    torch_dtype='bfloat16',
    low_cpu_mem_usage=True,
    total_token=-1
)

your_message = "Why we study math?"
conv = get_conversation_template("vicuna")
conv.append_message(conv.roles[0], your_message)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = model.tokenizer([prompt]).input_ids
input_ids = torch.as_tensor(input_ids).to(DEVICE)

output_ids = model.eagenerate(input_ids, temperature=0.5, max_new_tokens=512, top_k=8)
output = model.tokenizer.decode(output_ids[0])
print(output)

⚠️ Notes

eagle_data.json contains only OLMoE-generated answers for public ShareGPT questions.
The EAGLE Draft layer should be designed as close as possible to the main model’s decoder
for optimal inference efficiency.
modeling_olmoe_kv.py must be included in your EAGLE inference code for correct operation.

📚 References

For questions or feedback, please open an issue!

wantsleep
/

OLMoE_1B_7B_Eagle3