OmniNeural — World’s First NPU-aware Multimodal Model
Overview
OmniNeural is the first fully multimodal model designed specifically for Neural Processing Units (NPUs). It natively understands text, images, and audio, and runs across PCs, mobile devices, automobile, IoT, and robotics.
Demos
📱 Mobile Phone NPU - Demo on Samsung S25 Ultra
The first-ever fully local, multimodal, and conversational AI assistant that hears you and sees what you see, running natively on Snapdragon NPU for long battery life and low latency.
✨ PC NPU - Capabilities Highlights
🖼️ Multi-Image Reasoning |
🤖 Image + Text → Function Call |
🎶 Multi-Audio Comparison |
Key Features
- Multimodal Intelligence – Processes text, image, and audio in a unified model for richer reasoning and perception.
- NPU-Optimized Architecture – Uses ReLU ops, sparse tensors, convolutional layers, and static graph execution for maximum throughput — 20% faster than non-NPU-aware models .
- Hardware-Aware Attention – Attention patterns tuned for NPU, lowering compute and memory demand .
- Native Static Graph – Supports variable-length multimodal inputs with stable, predictable latency .
- Performance Gains – 9× faster audio processing and 3.5× faster image processing on NPUs compared to baseline encoders .
- Privacy-First Inference – All computation stays local: private, offline-capable, and cost-efficient.
Performance / Benchmarks
Human Evaluation (vs baselines)
- Vision: Wins/ties in ~75% of prompts against Apple Foundation, Gemma-3n-E4B, Qwen2.5-Omni-3B.
- Audio: Clear lead over baselines, much better than Gemma3n and Apple foundation model.
- Text: Matches or outperforms leading multimodal baselines.
Nexa Attention Speedups
- 9× faster audio encoding (vs Whisper encoder).
- 3.5× faster image encoding (vs SigLIP encoder).
Architecture Overview
OmniNeural’s design is tightly coupled with NPU hardware:
- NPU-friendly ops (ReLU > GELU/SILU).
- Sparse + small tensor multiplications for efficiency.
- Convolutional layers favored over linear for better NPU parallelization.
- Hardware-aware attention patterns to cut compute cost.
- Static graph execution for predictable latency.
Production Use Cases
PC & Mobile – On-device AI agents combine voice, vision, and text for natural, accurate responses.
- Examples: Summarize slides into an email (PC)*, *extract action items from chat (mobile).
- Benefits: Private, offline, battery-efficient.
Automotive – In-car assistants handle voice control, cabin safety, and environment awareness.
- Examples: Detects risks (child unbuckled, pet left, loose objects) and road conditions (fog, construction).
- Benefits: Decisions run locally in milliseconds.
IoT & Robotics – Multimodal sensing for factories, AR/VR, drones, and robots.
- Examples: Defect detection, technician overlays, hazard spotting mid-flight, natural robot interaction.
- Benefits: Works without network connectivity.
How to use
⚠️ Hardware requirement: OmniNeural-4B currently runs only on Qualcomm NPUs (e.g., Snapdragon-powered AIPC).
Apple NPU support is planned next.
1) Install Nexa-SDK
- Download and follow the steps under "Deploy Section" Nexa's model page: Download Windows arm64 SDK
- (Other platforms coming soon)
2) Get an access token
Create a token in the Model Hub, then log in:
nexa config set license '<access_token>'
3) Run the model
Running:
nexa infer NexaAI/OmniNeural-4B
/mic mode. Once the model is running, you can type below to record your voice directly in terminal
> /mic
For images and audio, simply drag your files into the command line. Remember to leave space between file paths.
Links & Community
- Issues / Feedback: Use the HF Discussions tab or submit an issue in our discord or nexa-sdk github.
- Roadmap & updates: Follow us on X and Discord.
If you want to see more NPU-first, multimodal releases on HF, please give our model a like ❤️.
Citation
@misc{
title={OmniNeural: World’s First NPU-aware Multimodal Model},
author={Nexa AI},
year={2025},
url={https://huggingface.co/NexaAI/OmniNeural-4B},
}
- Downloads last month
- 12