Llama-3.3-70B-Instruct-4bit (LeanQuant)
This is a 4-bit quantized version of embraceableAI/e2-llama-v3p3-70B-Merged-v1
, using LeanQuant for optimized memory and inference speed.
It is suitable for instruction following, dialogue, and general-purpose generation on memory-constrained hardware.
π§ Model Details
- Base model: EmbraceableAI LLaMA-3.3 70B merged checkpoint
- Quantization: 4-bit via LeanQuant
- File:
Llama-3.3-70B-Instruct-4bit.safetensors
- Size: ~36GB
- Format:
safetensors
- Device support: Multi-GPU via
device_map="auto"
π§ͺ Intended Use
- Instruction following (chat-style)
π Usage Example
import torch
from leanquant import LeanQuantModelForCausalLM
from transformers import AutoTokenizer
### Load model and tokenizer
base_model_name = "embraceableAI/e2-llama-v3p3-70B-Merged-v1"
model = LeanQuantModelForCausalLM.from_pretrained(
base_model_name,
"./model.safetensors",
bits=4,
device_map="auto"
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
### Tokenize prompt
prompt = [
{"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
{"role": "user", "content": "What is quantization for deep learning models?"},
]
inputs = tokenizer.apply_chat_template(
prompt,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
).to(model.device)
### Run generation and decode generated tokens
with torch.no_grad():
output = model.generate(**inputs, do_sample=True, max_new_tokens=256)
generated_text = tokenizer.decode(output[0], skip_special_tokens=False)
print(generated_text)
> π **Try it in Colab for quantization**:
> [](https://colab.research.google.com/drive/1RGfgqQm4XVmEWQVph5-4D3xmYGbAwEwW)
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support