Could you share more information about the on device demo?
Your demo looks impressive! To achieve the performance shown in your on-device demo, it seems that a processing speed of at least one frame per second is required.
However, real-time streaming input at a high frame rate on the iPad (even with the M4 chip) seems unlikely. According to the performance data of llama.cpp(https://github.com/ggerganov/llama.cpp/discussions/4167), llama.cpp on M4 chip achieves only 230.18 tokens/s.
Is your demo running entirely on the iPad? If so, could you share which backend you're using in your on device demo (MLC, llama.cpp, GPU, CPU, or a proprietary non-open-source solution)? Also, what is the sampling frame rate for the streaming input in your demo video?
Thank you very much!
你能分析出这是什么视频吗
上面这段视频拍的什么
Yes, the code shown in our video is running completely on iPad in flight mode. The frame rate of the video is one frame per second.
As you said, the original llama.cpp running on iPad cannot achieve the effect in the video. Our team has deeply modified the code of llama.cpp in the past few months and let different modules run on npu/gpu/cpu at the same time to achieve the effect in the video.
I will submit pr to the official of llama.cpp in the future so that everyone can experience the omni mode for themselves.
Yes, the code shown in our video is running completely on iPad in flight mode. The frame rate of the video is one frame per second.
As you said, the original llama.cpp running on iPad cannot achieve the effect in the video. Our team has deeply modified the code of llama.cpp in the past few months and let different modules run on npu/gpu/cpu at the same time to achieve the effect in the video.
I will submit pr to the official of llama.cpp in the future so that everyone can experience the omni mode for themselves.
Thanks a lot, currently llama.cpp/ggml dose not have on device NPU backends support. Hope everyone can benefits from your PR.