LLaMA-3-8B Fine-tuned for BioNLP Named Entity Recognition

This is a fine-tuned version of meta-llama/Meta-Llama-3-8B-Instruct specifically adapted for Named Entity Recognition (NER) in the biomedical domain.

The model was trained using parameter-efficient fine-tuning (PEFT) with QLoRA on the tner/bionlp2004 dataset. The entire training process was accelerated and memory-optimized using Unsloth.

Model Description

This model takes a medical or biological text as input and identifies and extracts the following five entity types:

DNA
RNA
protein
cell_type
cell_line

The output is a clean, machine-readable Python list of tuples.

Intended Use

This model is intended for researchers, bioinformaticians, and developers working on applications that require the parsing of biomedical literature. It can be used as a foundation for information extraction systems, knowledge graph population, and data analysis pipelines.

⚠️ Disclaimer: This model is a research tool and should not be used for clinical diagnosis or any real-world medical decision-making.

How to Use

This model was trained with Unsloth, and using it for inference is highly recommended for optimal performance.

First, install the necessary libraries:

pip install "unsloth[kaggle-torch] @ git+[https://github.com/unslothai/unsloth.git](https://github.com/unslothai/unsloth.git)"
pip install "trl>=0.8.6" "peft>=0.10.0" "accelerate>=0.28.0"

Next, use the following Python code to run inference:

from unsloth import FastLanguageModel
from transformers import pipeline
import torch

# Load the fine-tuned model from the Hub
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Arnic/llama-3-8b-bionlp-ner", 
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)

# Configure the model for inference
FastLanguageModel.for_inference(model)

# The Alpaca prompt template used during training
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# The instruction for the NER task
instruction = "You are an expert in medical text analysis. Your task is to identify and extract specific biological entities from the given text. The entity types to extract are: DNA, RNA, protein, cell_type, and cell_line."

# Your input text
input_text = "Interactions between the N-terminal domains of p53 and the human papillomavirus E6 protein."

# Format the prompt
prompt = alpaca_prompt.format(instruction, input_text, "")

# Use the text-generation pipeline
fast_pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Define terminators to stop generation cleanly
terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

# Get the model's response
outputs = fast_pipe(
    prompt,
    max_new_tokens=128,
    do_sample=False,
    eos_token_id=terminators,
)

# Print the clean response
print(outputs[0]['generated_text'].split("### Response:")[1].strip())
# Expected output: [('protein', 'p53'), ('protein', 'human papillomavirus E6 protein')]

Evaluation

This model has not been formally evaluated on a held-out test set for metrics. Qualitative analysis on examples from the bionlp2004 test set shows a strong ability to correctly identify and format the target entities.

For a formal evaluation, one could run predictions on the test set and use a library like seqeval.

Limitations and Bias

Domain Specificity: The model is highly specialized for the bionlp2004 dataset. Its performance may degrade on biomedical texts from different sub-domains (e.g., clinical patient notes).
Limited Entity Scope: The model can only identify the five entity types it was trained on. It will not recognize other common medical entities like "Disease" or "Symptom."
Hallucination: Like all LLMs, this model can make mistakes or hallucinate entities, especially on ambiguous or out-of-domain text. All outputs should be validated by a human expert if used in a critical workflow.

Uploaded model

Developed by: Arnic
License: apache-2.0
Finetuned from model : unsloth/llama-3-8b-Instruct-bnb-4bit

This llama model was trained 2x faster with Unsloth and Huggingface's TRL library.

Arnic
/

llama-3-8b-bionlp-ner