Introduction

This model is quantized version of Universal-NER/UniNER-7B-all.

Quantization

The quantization was applied using LLM Compressor with 512 random examples from Universal-NER/Pile-NER-definition dataset.

The recipe for quantization:

recipe = [
    SmoothQuantModifier(smoothing_strength=0.8),
    GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]),
]

Inference

We added chat template for the tokenizer, thus it can be directly used with vLLM without any other preprocessing compered to original model.

Example:

import json

from vllm import LLM, SamplingParams

# Loading model
llm = LLM(model="daisd-ai/UniNER-W4A16")
sampling_params = SamplingParams(temperature=0, max_tokens=256)

# Define text and entities types
text = "Some long text with multiple entities"
entities_types = ["entity type 1", "entity type 2"]

# Applying tokenizer
prompts = []
for entity_type in entities_types:
    messages = [
        {
            "role": "user",
            "content": f"Text: {text}",
        },
        {"role": "assistant", "content": "I've read this text."},
        {"role": "user", "content":f"What describes {entity_type} in the text?"},
    ]
    prompt = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    prompts.append(prompt)

# Run inference
outputs = llm.generate(prompts, self.sampling_params)
outputs = [output.outputs[0].text for output in outputs]

# Results are returned is JSON format, parse it to python list
results = []
for lst in outputs:
    try:
        entities = list(set(json.loads(lst)))
    except Exception:
        entities = []

    results.append(entities)
Downloads last month
4
Safetensors
Model size
1.12B params
Tensor type
I64
F32
I32
Inference API
Inference API (serverless) has been turned off for this model.

Model tree for daisd-ai/UniNER-W4A16

Quantized
(2)
this model