MiniCPM-V-2_6-rkllm

Run the Powerful MiniCPM-V-2.6 Visual Language Model on RK3588!

Inference speed (RK3588): Visual encoder 3.2s (triple core parallel) + LLM prefill 1.7s (92 tokens / 53 tps) + decoding 4.03 tps
Memory usage (RK3588, default context length): Visual encoder 1.9GB + LLM 7.8GB = 9.7GB

Usage

Clone or download this repository locally. The model is large, so make sure you have enough disk space.
The RKNPU2 kernel driver version on the development board must be >=0.9.6 to run such a large model. Use the following command with root privileges to check the driver version:
```
> cat /sys/kernel/debug/rknpu/version 
RKNPU driver: v0.9.8
```
If the version is too low, please update the driver. You may need to update the kernel or refer to official documentation for help.
Install dependencies

pip install numpy<2 opencv-python

You also need to manually install rknn-toolkit2-lite2.

python multiprocess_inference.py

If the performance is not satisfactory, you can change the CPU scheduler to keep the CPU running at the highest frequency, and bind the inference program to the big core cluster (taskset -c 4-7 python multiprocess_inference.py).

man.jpg:

admin@orangepi5:~/MiniCPM-V-2_6-rkllm$ python multiprocess_inference.py
Start loading language model (size: 7810.02 MB)
I rkllm: rkllm-runtime version: 1.1.4, rknpu driver version: 0.9.8, platform: RK3588

W rknn-toolkit-lite2 version: 2.3.0
Start loading vision encoder model (size: 942.29 MB)
Vision encoder loaded in 4.95 seconds
I RKNN: [13:13:11.477] RKNN Runtime Information, librknnrt version: 2.3.0 (c949ad889d@2024-11-07T11:35:33)
I RKNN: [13:13:11.477] RKNN Driver Information, version: 0.9.8
I RKNN: [13:13:11.478] RKNN Model Information, version: 6, toolkit version: 2.2.0(compiler version: 2.2.0 (c195366594@2024-09-14T12:24:14)), target: RKNPU v2, target platform: rk3588, framework name: ONNX, framework layout: NCHW, model inference type: dynamic_shape
Received ready signal: vision_ready
Language model loaded in 30.56 seconds
Received ready signal: llm_ready
All models loaded, starting interactive mode...

Enter your input :

How many people are in the image {{./man.jpg}}?
Describe the person in the image in detail.



Start vision inference...
Vision encoder inference time: 3.35 seconds
In this black and white photograph, we see an older gentleman immersed in his work. He is seated at what appears to be a drafting table or desk laden with various papers and sketches, suggesting he might be an artist or designer of some sort. His hands are actively engaged on the paper; one hand seems to hold it steady while the other may be making adjustments or additions.

The man's attire consists of a light-colored shirt paired with a darker suit jacket, giving him a professional appearance. The image evokes a sense of concentration and creativity as he focuses intently on his work. There is no visible digital interference in this picture; it seems to capture an authentic moment from the past when such manual sketching was more prevalent.

The background features what looks like curtains or drapes, adding depth to the scene but keeping the focus firmly on the man and his task at hand. The absence of other people and modern elements places him squarely within a bygone era, hinting at stories untold about this individual's profession and life during that time.

(finished)

--------------------------------------------------------------------------------------
 Stage         Total Time (ms)  Tokens    Time per Token (ms)      Tokens per Second
--------------------------------------------------------------------------------------
 Prefill       1927.42          107       18.01                    55.51
 Generate      62126.48         210       297.28                   3.36
--------------------------------------------------------------------------------------

Model Conversion

Preparation

Install rknn-toolkit2 v2.1.0 or higher, and rkllm-toolkit v1.1.2 or higher.
Download this repository locally, but you don't need to download the model files ending with .rkllm and .rknn.
Download the MiniCPM-V-2.6 Hugging Face model repository locally. (https://huggingface.co/openbmb/MiniCPM-V-2_6)

Converting LLM

Copy the rename_tensors.py file from this repository to the root directory of the MiniCPM-V-2.6 Hugging Face model repository and run it. Wait for a moment, it will generate 4 safetensors files like model-renamed-00001-of-00004.safetensors and a json file.
Ignore the json file, move those 4 safetensors files to the root directory of this repository.
Execute rkllm-convert.py. After a while, it will generate qwen.rkllm, which is the converted model.

Converting Visual Encoder

Copy patched_modeling_navit_siglip.py and patched_resampler.py from this repository to the root directory of the MiniCPM-V-2.6 Hugging Face model repository, rename them to modeling_navit_siglip.py and resampler.py, replacing the original files.
Open vision_export_onnx.py, modify the MODEL_PATH to the path of the MiniCPM-V-2.6 model folder. Then execute it. After a while, it will generate vision_encoder.onnx.
Execute vision_convert_rknn.py. After a while, it will generate vision_encoder.rknn, which is the converted visual encoder.

Known Issues

~~Due to a suspected issue in RKLLM, this model currently cannot perform inference normally.~~ (Fixed)
Due to an issue in RKLLM, the visual encoder and LLM cannot be loaded simultaneously at present. The visual encoder must be unloaded first, then the LLM reloaded. If multiple inferences are required, the unloading and loading operations must be repeated, which is very slow. (Fixed)
Due to a suspected issue in RKLLM, if the visual encoder and LLM are loaded into the same Python process, the LLM inference will segmentation fault. You can use multiprocessing to solve this problem. See multiprocess_inference.py.
Due to an issue in RKLLM, LLM inference will segfault with long input sequences. See https://github.com/airockchip/rknn-llm/issues/123
Due to the limitation of RKLLM's multimodal input, only one image can be loaded in the entire conversation. This can be solved by using embedding input, but I haven't implemented it yet.
I don't implement multi-turn chat.
There is a significant precision loss in RKLLM's w8a8 quantization.
The code for converting the visual encoder to ONNX is taken from https://github.com/sophgo/LLM-TPU/tree/main/models/MiniCPM-V-2_6, thanks to Sophgo for providing the code. However, this conversion method seems to have removed the adaptive image partitioning algorithm from the original model, which may lead to a decrease in accuracy.

thanhtantran
/

MiniCPM-V-2_6-rkllm