aquif-moe-800m

aquif-moe-800m is our first Mixture of Experts (MoE) model, with only 800 million active parameters. Despite its compact size, it delivers exceptional performance-per-VRAM efficiency compared to larger models.

Model Overview

  • Name: aquif-moe-800m
  • Parameters: 800 million active parameters (3.3 billion total)
  • Context Window: 128,000 tokens
  • Architecture: Mixture of Experts (MoE)
  • Type: General-purpose LLM
  • Hosted on: Ollama

Key Features

  • Extremely efficient VRAM utilization (57.8 performance points per GB)
  • Expansive 128K token context window for handling long documents
  • Competitive performance against models with more parameters
  • Optimized for local inference on consumer hardware
  • Ideal for resource-constrained environments
  • Supports high-throughput concurrent sessions

Performance Benchmarks

aquif-moe-800m demonstrates state-of-the-art performance across multiple benchmarks, especially when considering its parameter efficiency:

Benchmark aquif-moe (0.8b) Llama 3.2 (1b) Gemma 3 (4b)
MMLU 52.2 49.3 59.6
HumanEval 37.5 22.6 36.0
GSM8K 49.0 44.4 38.4
Average 46.2 38.8 44.7

VRAM Efficiency

One of aquif-moe-800m's standout features is its exceptional VRAM efficiency:

Model Average Performance VRAM (GB) Performance per VRAM
aquif-moe 46.2 0.8 57.8
Llama 3.2 38.8 1.2 32.3
Gemma 3 44.7 4.3 10.4

Use Cases

  • Edge computing and resource-constrained environments
  • Mobile and embedded applications
  • Local development environments
  • Quick prototyping and testing
  • Personal assistants on consumer hardware
  • Enterprise deployment with multiple concurrent sessions
  • Long document analysis and summarization
  • High-throughput production environments

Limitations

  • No thinking mode capability
  • May show hallucinations in some areas
  • May struggle with more complex reasoning tasks
  • Not optimized for specialized domains

Getting Started

To run via Ollama:

ollama run aquiffoo/aquif-moe-800m

Technical Details

The aquif-moe-800m leverages a Mixture of Experts architecture to achieve high parameter efficiency. While the total parameter count is larger, only 800 million parameters are activated during inference, allowing for significantly reduced VRAM requirements while maintaining competitive performance.

Enterprise Deployment

The model's exceptional VRAM efficiency makes it particularly valuable for enterprise deployments:

  • Concurrent Sessions: Run multiple model instances on a single GPU
  • High Throughput: Serve more users with the same hardware footprint
  • Cost Efficiency: Lower infrastructure costs for production deployments
  • Scalability: Easier horizontal scaling across available resources

The 128K context window enables comprehensive document analysis while maintaining the model's efficient resource utilization, making it suitable for enterprises dealing with large documents or extended conversations.

*Note: All performance metrics are approximated estimates based on internal evaluations.

Downloads last month
32
Safetensors
Model size
3.3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 4 Ask for provider support

Model tree for aquiffoo/aquif-moe-800m

Finetuned
(5)
this model

Collection including aquiffoo/aquif-moe-800m