MiniCPM-V-2_6-rkllm
Run the Powerful MiniCPM-V-2.6 Visual Language Model on RK3588!
- Inference speed (RK3588): Visual encoder 3.2s (triple core parallel) + LLM prefill 1.7s (92 tokens / 53 tps) + decoding 4.03 tps
- Memory usage (RK3588, default context length): Visual encoder 1.9GB + LLM 7.8GB = 9.7GB
Usage
Clone or download this repository locally. The model is large, so make sure you have enough disk space.
The RKNPU2 kernel driver version on the development board must be >=0.9.6 to run such a large model. Use the following command with root privileges to check the driver version:
> cat /sys/kernel/debug/rknpu/version RKNPU driver: v0.9.8
If the version is too low, please update the driver. You may need to update the kernel or refer to official documentation for help.
Install dependencies
pip install numpy<2 opencv-python
You also need to manually install rknn-toolkit2-lite2.
- Run
python multiprocess_inference.py
If the performance is not satisfactory, you can change the CPU scheduler to keep the CPU running at the highest frequency, and bind the inference program to the big core cluster (taskset -c 4-7 python multiprocess_inference.py
).
admin@orangepi5:~/MiniCPM-V-2_6-rkllm$ python multiprocess_inference.py
Start loading language model (size: 7810.02 MB)
I rkllm: rkllm-runtime version: 1.1.4, rknpu driver version: 0.9.8, platform: RK3588
W rknn-toolkit-lite2 version: 2.3.0
Start loading vision encoder model (size: 942.29 MB)
Vision encoder loaded in 4.95 seconds
I RKNN: [13:13:11.477] RKNN Runtime Information, librknnrt version: 2.3.0 (c949ad889d@2024-11-07T11:35:33)
I RKNN: [13:13:11.477] RKNN Driver Information, version: 0.9.8
I RKNN: [13:13:11.478] RKNN Model Information, version: 6, toolkit version: 2.2.0(compiler version: 2.2.0 (c195366594@2024-09-14T12:24:14)), target: RKNPU v2, target platform: rk3588, framework name: ONNX, framework layout: NCHW, model inference type: dynamic_shape
Received ready signal: vision_ready
Language model loaded in 30.56 seconds
Received ready signal: llm_ready
All models loaded, starting interactive mode...
Enter your input :
How many people are in the image {{./man.jpg}}?
Describe the person in the image in detail.
Start vision inference...
Vision encoder inference time: 3.35 seconds
In this black and white photograph, we see an older gentleman immersed in his work. He is seated at what appears to be a drafting table or desk laden with various papers and sketches, suggesting he might be an artist or designer of some sort. His hands are actively engaged on the paper; one hand seems to hold it steady while the other may be making adjustments or additions.
The man's attire consists of a light-colored shirt paired with a darker suit jacket, giving him a professional appearance. The image evokes a sense of concentration and creativity as he focuses intently on his work. There is no visible digital interference in this picture; it seems to capture an authentic moment from the past when such manual sketching was more prevalent.
The background features what looks like curtains or drapes, adding depth to the scene but keeping the focus firmly on the man and his task at hand. The absence of other people and modern elements places him squarely within a bygone era, hinting at stories untold about this individual's profession and life during that time.
(finished)
--------------------------------------------------------------------------------------
Stage Total Time (ms) Tokens Time per Token (ms) Tokens per Second
--------------------------------------------------------------------------------------
Prefill 1927.42 107 18.01 55.51
Generate 62126.48 210 297.28 3.36
--------------------------------------------------------------------------------------
Model Conversion
Preparation
- Install rknn-toolkit2 v2.1.0 or higher, and rkllm-toolkit v1.1.2 or higher.
- Download this repository locally, but you don't need to download the model files ending with
.rkllm
and.rknn
. - Download the MiniCPM-V-2.6 Hugging Face model repository locally. (https://huggingface.co/openbmb/MiniCPM-V-2_6)
Converting LLM
- Copy the
rename_tensors.py
file from this repository to the root directory of the MiniCPM-V-2.6 Hugging Face model repository and run it. Wait for a moment, it will generate 4 safetensors files likemodel-renamed-00001-of-00004.safetensors
and a json file. - Ignore the json file, move those 4 safetensors files to the root directory of this repository.
- Execute
rkllm-convert.py
. After a while, it will generateqwen.rkllm
, which is the converted model.
Converting Visual Encoder
Copy
patched_modeling_navit_siglip.py
andpatched_resampler.py
from this repository to the root directory of the MiniCPM-V-2.6 Hugging Face model repository, rename them tomodeling_navit_siglip.py
andresampler.py
, replacing the original files.Open
vision_export_onnx.py
, modify theMODEL_PATH
to the path of the MiniCPM-V-2.6 model folder. Then execute it. After a while, it will generatevision_encoder.onnx
.Execute
vision_convert_rknn.py
. After a while, it will generatevision_encoder.rknn
, which is the converted visual encoder.
Known Issues
Due to a suspected issue in RKLLM, this model currently cannot perform inference normally.(Fixed)Due to an issue in RKLLM, the visual encoder and LLM cannot be loaded simultaneously at present. The visual encoder must be unloaded first, then the LLM reloaded. If multiple inferences are required, the unloading and loading operations must be repeated, which is very slow.(Fixed)- Due to a suspected issue in RKLLM, if the visual encoder and LLM are loaded into the same Python process, the LLM inference will segmentation fault. You can use multiprocessing to solve this problem. See
multiprocess_inference.py
. - Due to an issue in RKLLM, LLM inference will segfault with long input sequences. See https://github.com/airockchip/rknn-llm/issues/123
- Due to the limitation of RKLLM's multimodal input, only one image can be loaded in the entire conversation. This can be solved by using embedding input, but I haven't implemented it yet.
- I don't implement multi-turn chat.
- There is a significant precision loss in RKLLM's w8a8 quantization.
- The code for converting the visual encoder to ONNX is taken from https://github.com/sophgo/LLM-TPU/tree/main/models/MiniCPM-V-2_6, thanks to Sophgo for providing the code. However, this conversion method seems to have removed the adaptive image partitioning algorithm from the original model, which may lead to a decrease in accuracy.
References
- Downloads last month
- 15