Inferless

Serverless GPUs to scale your machine learning inference without any hassle of managing servers, deploy complicated and custom models with ease.

Go through this tutorial, for quickly deploy Mixtral-8x7B-v0.1 using Inferless


Mixtral-8x7B - GPTQ

Description

This repo contains GPTQ model files for Mistralai's Mixtral-8x7B-v0.1.

About GPTQ

GPTQ is a method that compresses the model size and accelerates inference by quantizing weights based on a calibration dataset, aiming to minimize mean squared error in a single post-quantization step. GPTQ achieves both memory efficiency and faster inference.

It is supported by:

Shared files, and GPTQ parameters

Models are released as sharded safetensors files.

Branch Bits GS AWQ Dataset Seq Len Size
main 4 128 VMware Open Instruct 4096 5.96 GB

How to use

You will need the following software packages and python libraries:

build:
  cuda_version: "12.1.1"
  system_packages:
    - "libssl-dev"
  python_packages:
    - "torch==2.1.2"
    - "vllm==0.2.6"
    - "transformers==4.36.2"
    - "accelerate==0.25.0"
Downloads last month
12
Inference Examples
Inference API (serverless) has been turned off for this model.

Model tree for Inferless/Mixtral-8x7B-v0.1-int8-GPTQ

Finetuned
(58)
this model