OPEA/MiniMax-Text-01-int4-sym-inc-preview

Model Details

This model is an int4 model with group_size 128 and symmetric quantization of MiniMaxAI/MiniMax-Text-01 generated by intel/auto-round algorithm. This model is in AutoRound format, which is NOT supported by other serving frameworks, such as vLLM.

Please follow the license of the original model.

INT4 Inference on CUDA(4*80G**)

Requirements

pip3 install git+https://github.com/intel/auto-round.git
pip3 install auto-gptq

This model is prone to overflow when running with int4 kernel with FP16 computation dtype and does not support CPU, as it explicitly relies on CUDA operations in the model files. While we have implemented several workarounds to ensure functionality, some prompts may still produce unexpected and random outputs.

from auto_round import AutoRoundConfig  ##must import for autoround format
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

quantized_model_dir = "OPEA/MiniMax-Text-01-int4-sym-inc-preview"

tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(quantized_model_dir, trust_remote_code=True, torch_dtype=torch.bfloat16,
                                             device_map="auto")


def forward_hook(module, input, output):
    return torch.clamp(output, -65504, 65504).to(torch.bfloat16)


def register_fp16_hooks(model):
    for name, module in model.named_modules():
        if "QuantLinear" in module.__class__.__name__ or isinstance(module, torch.nn.Linear):
            module.register_forward_hook(forward_hook)


register_fp16_hooks(model)
tokenizer.pad_token = tokenizer.eos_token

prompts = [
    "为什么企鹅没有被北极熊吃掉？",
    "树枝上有十只鸟，如果你射杀了一只，还剩下几只？请用中文回答",
    "How many r in strawberry.",
    "There is a girl who likes adventure,",
    "hello"
]

texts = []
for prompt in prompts:
    messages = [
        {"role": "system", "content": [{"type": "text",
                                        "text": "You are a helpful assistant created by MiniMax based on MiniMax-Text-01 model."}]},
        {"role": "user", "content": [{"type": "text", "text": prompt}]},
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    texts.append(text)
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, padding_side='left')

outputs = model.generate(
    input_ids=inputs["input_ids"].to(model.device),
    attention_mask=inputs["attention_mask"].to(model.device),
    max_new_tokens=512,
    num_return_sequences=1,
    do_sample=False,
    eos_token_id=200020,
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs["input_ids"], outputs)
]

decoded_outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

for i, prompt in enumerate(prompts):
    input_id = inputs
    print(f"Prompt: {prompt}")
    print(f"Generated: {decoded_outputs[i]}")
    print("-" * 50)


"""
Prompt: 为什么企鹅没有被北极熊吃掉？
Generated: ### 1. **地理分布差异**
   - **企鹅**：主要生活在**南半球**，例如**南极洲**。在南极洲，企鹅没有天敌，因为这里的环境非常恶劣，食物资源有限，动物数量也有限，企鹅是这里的顶级掠食者之一。
   - **北极熊**：主要生活在**北半球**，例如**北极地区**。北极熊是北极地区的顶级掠食者之一，它们以海豹等动物为食。
   - **结论**：由于**地理分布**的差异，**企鹅和北极熊**在自然界中**无法相遇**，因此**北极熊无法吃掉企鹅**。

### 2. **人为因素**
   - **动物园或水族馆**：在**人为因素**的影响
--------------------------------------------------
Prompt: 树枝上有十只鸟，如果你射杀了一只，还剩下几只？请用中文回答
Generated: 让我一步步思考这个问题：

1. 首先,树枝上有10只鸟
2. 射杀1只后,还剩9只
3. 但实际上,当枪声响起,其他鸟会因惊吓而飞走
4. 所以,当射杀1只后,树上不会剩下任何鸟

因此,答案是:0只

因为鸟会因枪声而飞走,不会继续停留在树上。
--------------------------------------------------
Prompt: How many r in strawberry.
Generated: Let me solve this step by step.

1. First, let me count the r's in "strawberry" as I say it
   * s (not r)
   * t (not r)
   * r (1st r)
   * a (not r)
   * w (not r)
   * b (not r)
   * b (not r)
   * e (not r)
   * r (2nd r)
   * r (3rd r)
   * y (not r)

2. Counting the r's: 3 r's

Therefore, there is 3 r in strawberry.

The answer is 3.
--------------------------------------------------
Prompt: There is a girl who likes adventure,
Generated: There is a girl who likes adventure, and her name is Emily. Emily has always been drawn to the thrill of the unknown, the excitement of stepping into uncharted territory. Here is a story about
--------------------------------------------------
Prompt: hello
Generated: Hello! How can I assist you today?
--------------------------------------------------
"""

Generated the model (2*80G)

pip3 install git+https://github.com/intel/auto-round.git

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "MiniMaxAI/MiniMax-Text-01"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.bfloat16)

fp_layers = [f"model.layers.{i}.block_sparse_moe.gate" for i in range(model.config.num_hidden_layers)]
layer_config = {}
for fp_layer in fp_layers:
    layer_config[fp_layer] = {"bits": 16}
    
device_map = {}
for i in range(32):
    key = fr"model\.layers\.\d+\.block_sparse_moe\.experts\.{str(i)}\..*$"
    if i < 14:
        device_map[key] = 0
    else:
        device_map[key] = 1


from auto_round import AutoRound

autoround = AutoRound(model=model, tokenizer=tokenizer, layer_config=layer_config, device_map=device_map,
                       batch_size=1,gradient_accumulate_steps=4, seqlen=512)
autoround.quantize()
autoround.save_quantized(format="auto_round", output_dir="tmp_autoround")

Ethical Considerations and Limitations

The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

Therefore, before deploying any applications of the model, developers should perform safety testing.

Caveats and Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

Here are a couple of useful links to learn more about Intel's AI software:

Intel Neural Compressor link

Disclaimer

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.

Cite

@article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }

arxiv github

OPEA
/

MiniMax-Text-01-int4-sym-inc-preview

Model Details

INT4 Inference on CUDA(4*80G**)

Generated the model (2*80G)

Ethical Considerations and Limitations

Caveats and Recommendations

Disclaimer

Cite

Model tree for OPEA/MiniMax-Text-01-int4-sym-inc-preview

Dataset used to train OPEA/MiniMax-Text-01-int4-sym-inc-preview

Model Details

INT4 Inference on CUDA**(4*80G)

Generated the model (2*80G)

Ethical Considerations and Limitations

Caveats and Recommendations

Disclaimer

Cite

Model tree for OPEA/MiniMax-Text-01-int4-sym-inc-preview

Dataset used to train OPEA/MiniMax-Text-01-int4-sym-inc-preview

INT4 Inference on CUDA(4*80G**)