llama3.1-8bのAWQ量子化版です。
4GB超のGPUメモリがあれば高速に動かす事ができます。

This is the AWQ quantization version of llama3.1-8b.
If you have more than 4GB of GPU memory, you can run it at high speed.  

量子化時に日本語と中国語を多めに使っているため、hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4より日本語データを使って計測したPerplexityが良い事がわかっています
Because Japanese and Chinese are used a lot during quantization, It is known that Perplexity measured using Japanese data is better than hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4.

セットアップ(setup)

pip install transformers==4.43.3 autoawq==0.2.6 accelerate==0.33.0

サンプルスクリプト(sample script)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig

model_id = "dahara1/llama3.1-8b-Instruct-awq"

quantization_config = AwqConfig(
    bits=4,
    fuse_max_seq_len=512, # Note: Update this as per your use-case
    do_fuse=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
  model_id,
  torch_dtype=torch.float16,
  low_cpu_mem_usage=True,
  device_map="auto",
  quantization_config=quantization_config
)

prompt = [
  {"role": "system", "content": "あなたは親切で役に立つアシスタントです。常に海賊のように返答してください"},
  {"role": "user", "content": "ディープラーニングとは何ですか?"},
]
inputs = tokenizer.apply_chat_template(
  prompt,
  tokenize=True,
  add_generation_prompt=True,
  return_tensors="pt",
  return_dict=True,
).to("cuda")

outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
print(tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0])

kaizoku

Downloads last month
109
Safetensors
Model size
1.98B params
Tensor type
I32
·
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.