# Model Performance Testing Methodology

This document outlines the methodology used for testing various LLM models through Ollama on a GPU Poor setup.

## Hardware Specifications

### GPU
- Model: AMD Radeon RX 7600 XT 16GB
- Note: Currently the most affordable (GPU-poorest) graphics card with 16GB VRAM on the market, making it an excellent choice for budget-conscious AI enthusiasts

### System Specifications
- CPU: AMD Ryzen 7 5700X (16) @ 4.66 GHz
- Motherboard: B550 Pro4
- RAM: 64GB
- OS: Debian 12 Bookworm
- Kernel: Linux 6.8.12-8
- Testing Environment: Ollama with ROCm backend

## Testing Methodology

Each model is tested using a consistent creative writing prompt designed to evaluate both the model's performance and creative capabilities. The testing process includes:

1. Model Loading: Each model is loaded fresh before testing
2. Initial Warmup: A small test prompt is run to ensure model is properly loaded
3. Main Test: A comprehensive creative writing prompt is processed
4. Performance Metrics Collection: Various metrics are gathered during generation

### Test Prompt

The following creative writing prompt is used to test all models:

```
You are a creative writing assistant. Write a short story about a futuristic city where:
1. The city is powered by a mysterious energy source
2. The inhabitants have developed unique abilities
3. There's a hidden conflict between different factions
4. The protagonist discovers a shocking truth about the city's origins

Make the story engaging and include vivid descriptions of the city's architecture and technology.
```

This prompt was chosen because it:
- Requires creative thinking and complex reasoning
- Generates substantial output (typically 500-1000 tokens)
- Tests both context understanding and generation capabilities
- Produces consistent length outputs for fair comparison

## Metrics Collected

For each model, we collect and analyze:

1. Performance Metrics:
   - Tokens per second (overall)
   - Generation tokens per second
   - Total response time
   - Total tokens generated

2. Resource Usage:
   - VRAM usage
   - Model size
   - Parameter count

3. Model Information:
   - Quantization level
   - Model format
   - Model family

## Testing Parameters

All tests are run with consistent generation parameters:
- Temperature: 0.7
- Top P: 0.9
- Top K: 40
- Max Tokens: 1000
- Repetition Penalty: 1.0
- Seed: 42 (for reproducibility)

## Notes

- Tests are run sequentially to ensure no resource contention
- A 3-second cooldown period is maintained between tests
- Models are unloaded after each test to ensure clean state
- Results are saved both in detailed and summary formats
- The testing script automatically handles model pulling and cleanup