aquif-moe-800m

aquif-moe-800m is our first Mixture of Experts (MoE) model, with only 800 million active parameters. Despite its compact size, it delivers exceptional performance-per-VRAM efficiency compared to larger models.

Model Overview

Name: aquif-moe-800m
Parameters: 800 million active parameters (3.3 billion total)
Context Window: 128,000 tokens
Architecture: Mixture of Experts (MoE)
Type: General-purpose LLM
Hosted on: Ollama

Key Features

Extremely efficient VRAM utilization (57.8 performance points per GB)
Expansive 128K token context window for handling long documents
Competitive performance against models with more parameters
Optimized for local inference on consumer hardware
Ideal for resource-constrained environments
Supports high-throughput concurrent sessions

Performance Benchmarks

aquif-moe-800m demonstrates state-of-the-art performance across multiple benchmarks, especially when considering its parameter efficiency:

Benchmark	aquif-moe (0.8b)	Llama 3.2 (1b)	Gemma 3 (4b)
MMLU	52.2	49.3	59.6
HumanEval	37.5	22.6	36.0
GSM8K	49.0	44.4	38.4
Average	46.2	38.8	44.7

VRAM Efficiency

One of aquif-moe-800m's standout features is its exceptional VRAM efficiency:

Model	Average Performance	VRAM (GB)	Performance per VRAM
aquif-moe	46.2	0.8	57.8
Llama 3.2	38.8	1.2	32.3
Gemma 3	44.7	4.3	10.4

Use Cases

Edge computing and resource-constrained environments
Mobile and embedded applications
Local development environments
Quick prototyping and testing
Personal assistants on consumer hardware
Enterprise deployment with multiple concurrent sessions
Long document analysis and summarization
High-throughput production environments

Limitations

No thinking mode capability
May show hallucinations in some areas
May struggle with more complex reasoning tasks
Not optimized for specialized domains

Getting Started

To run via Ollama:

ollama run aquiffoo/aquif-moe-800m

Technical Details

The aquif-moe-800m leverages a Mixture of Experts architecture to achieve high parameter efficiency. While the total parameter count is larger, only 800 million parameters are activated during inference, allowing for significantly reduced VRAM requirements while maintaining competitive performance.

Enterprise Deployment

The model's exceptional VRAM efficiency makes it particularly valuable for enterprise deployments:

Concurrent Sessions: Run multiple model instances on a single GPU
High Throughput: Serve more users with the same hardware footprint
Cost Efficiency: Lower infrastructure costs for production deployments
Scalability: Easier horizontal scaling across available resources

The 128K context window enables comprehensive document analysis while maintaining the model's efficient resource utilization, making it suitable for enterprises dealing with large documents or extended conversations.

*Note: All performance metrics are approximated estimates based on internal evaluations.

aquiffoo
/

aquif-moe-800m