Instructions to use Snowflake/snowflake-arctic-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Snowflake/snowflake-arctic-instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Snowflake/snowflake-arctic-instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Snowflake/snowflake-arctic-instruct", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Snowflake/snowflake-arctic-instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Snowflake/snowflake-arctic-instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Snowflake/snowflake-arctic-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Snowflake/snowflake-arctic-instruct

SGLang

How to use Snowflake/snowflake-arctic-instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Snowflake/snowflake-arctic-instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Snowflake/snowflake-arctic-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Snowflake/snowflake-arctic-instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Snowflake/snowflake-arctic-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Snowflake/snowflake-arctic-instruct with Docker Model Runner:
```
docker model run hf.co/Snowflake/snowflake-arctic-instruct
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Model Details

Arctic is a dense-MoE Hybrid transformer architecture pre-trained from scratch by the Snowflake AI Research Team. We are releasing model checkpoints for both the base and instruct-tuned versions of Arctic under an Apache-2.0 license. This means you can use them freely in your own research, prototypes, and products. Please see our blog Snowflake Arctic: The Best LLM for Enterprise AI — Efficiently Intelligent, Truly Open for more information on Arctic and links to other relevant resources such as our series of cookbooks covering topics around training your own custom MoE models, how to produce high-quality training data, and much more.

For the latest details about Snowflake Arctic including tutorials, etc., please refer to our GitHub repo:

https://github.com/Snowflake-Labs/snowflake-arctic

Try a live demo with our Streamlit app.

Model developers Snowflake AI Research Team

License Apache-2.0

Input Models input text only.

Output Models generate text and code only.

Model Release Date April, 24th 2024.

Model Architecture

Arctic combines a 10B dense transformer model with a residual 128x3.66B MoE MLP resulting in 480B total and 17B active parameters chosen using a top-2 gating. For more details about Arctic's model Architecture, training process, data, etc. see our series of cookbooks.

Usage

Arctic is currently supported with transformers by leveraging the custom code feature, to use this you simply need to add trust_remote_code=True to your AutoTokenizer and AutoModelForCausalLM calls. However, we recommend that you use a transformers version at or above 4.39:

pip install transformers>=4.39.0

Arctic leverages several features from DeepSpeed, you will need to install the DeepSpeed 0.14.2 or higher to get all of these required features:

pip install deepspeed>=0.14.2

Inference examples

Due to the model size we recommend using a single 8xH100 instance from your favorite cloud provider such as: AWS p5.48xlarge, Azure ND96isr_H100_v5, etc.

In this example we are using FP8 quantization provided by DeepSpeed in the backend, we can also use FP6 quantization by specifying q_bits=6 in the QuantizationConfig config. The "150GiB" setting for max_memory is required until we can get DeepSpeed's FP quantization supported natively as a HFQuantizer which we are actively working on.

import os
# enable hf_transfer for faster ckpt download
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from deepspeed.linear.config import QuantizationConfig

tokenizer = AutoTokenizer.from_pretrained(
    "Snowflake/snowflake-arctic-instruct",
    trust_remote_code=True
)
quant_config = QuantizationConfig(q_bits=8)

model = AutoModelForCausalLM.from_pretrained(
    "Snowflake/snowflake-arctic-instruct",
    trust_remote_code=True,
    low_cpu_mem_usage=True,
    device_map="auto",
    ds_quantization_config=quant_config,
    max_memory={i: "150GiB" for i in range(8)},
    torch_dtype=torch.bfloat16)


content = "5x + 35 = 7x - 60 + 10. Solve for x"
messages = [{"role": "user", "content": content}]
input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to("cuda")

outputs = model.generate(input_ids=input_ids, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))

The Arctic GitHub page has additional code snippets and examples around running inference:

Example with pure-HF: https://github.com/Snowflake-Labs/snowflake-arctic/blob/main/inference
Tutorial using vLLM: https://github.com/Snowflake-Labs/snowflake-arctic/tree/main/inference/vllm

Downloads last month: 30,564

Model tree for Snowflake/snowflake-arctic-instruct

Finetunes

2 models

Quantizations

1 model

Spaces using Snowflake/snowflake-arctic-instruct 15

Collection including Snowflake/snowflake-arctic-instruct

Arctic

Collection

A collection of pre-trained dense-MoE Hybrid transformer models • 2 items • Updated Apr 24, 2024 • 24