Quantized version of LeoLM/leo-hessianai-13b-chat, compatible with MLC / webLLM

This is a web-llm compatible version of LeoLM's German "leo-hessianai" language model. It is a 4-bit quantized version of a 13 billion parameter model based on llama2, the LeoLM/leo-hessianai-13b-chat.

It can be used with the MLC LLM project https://llm.mlc.ai/ . Notably you can use web-llm https://github.com/mlc-ai/web-llm to run this model in your browser.

Expect some quality loss due to the quantization, however the model is still able to generate good german text.

This model was created by running

mlc_chat convert_weight leo-hessianai-13b-chat --quantization q4f16_1 -o leo-hessianai-13b-chat-q4f16_1-MLC

where leo-hessianai-13b-chat is the directory into which the original model was downloaded. This is in line with the description in converting weights https://llm.mlc.ai/docs/compilation/convert_weights.html

The model here includes the mlc-chat.config.json to implement the correct prompt format for this particular model, as described in https://llm.mlc.ai/docs/get_started/mlc_chat_config.html

To use this model with the simple chat example of web-llm https://github.com/mlc-ai/web-llm/tree/main/examples/simple-chat, edit your copy of gh-config.js to include the following:

export default {
    "model_list": [
        {
            "model_url": "https://huggingface.co/poemAI/leo-hessianai-13b-chat-q4f16_1-MLC/resolve/main/",
            "local_id": "leo-hessianai-13b-chat-poemai",
            "model_lib_url": "https://raw.githubusercontent.com/mlc-ai/binary-mlc-llm-libs/main/Llama-2-13b-chat-hf/Llama-2-13b-chat-hf-q4f16_1-ctx4k_cs1k-webgpu.wasm",
            "vram_required_MB": 9109.03,
            "low_resource_required": false,
        },
    ],
    "use_web_worker": true
}

As you can see you can use the LLama-2-13b webgpu wasm library provided by mlc-ai.

Note that the users of this model will need to download 6.8 GB of data to run it in their browser. Hardware requirements for the model when running in the browser have not yet been thoroughly tested - YMMV. However, it runs well on a MacBook Pro M2 Max with 64 GB RAM, MacOS Sonoma, Chrome 121.0.6167.184 (arm64).

Thanks to the people at LeoLM for providing the original model and to the MLC team for providing the tools to convert it to a web-llm compatible format.