# Model Performance Testing Methodology This document outlines the methodology used for testing various LLM models through Ollama on a GPU Poor setup. ## Hardware Specifications ### GPU - Model: AMD Radeon RX 7600 XT 16GB - Note: Currently the most affordable (GPU-poorest) graphics card with 16GB VRAM on the market, making it an excellent choice for budget-conscious AI enthusiasts ### System Specifications - CPU: AMD Ryzen 7 5700X (16) @ 4.66 GHz - Motherboard: B550 Pro4 - RAM: 64GB - OS: Debian 12 Bookworm - Kernel: Linux 6.8.12-8 - Testing Environment: Ollama with ROCm backend ## Testing Methodology Each model is tested using a consistent creative writing prompt designed to evaluate both the model's performance and creative capabilities. The testing process includes: 1. Model Loading: Each model is loaded fresh before testing 2. Initial Warmup: A small test prompt is run to ensure model is properly loaded 3. Main Test: A comprehensive creative writing prompt is processed 4. Performance Metrics Collection: Various metrics are gathered during generation ### Test Prompt The following creative writing prompt is used to test all models: ``` You are a creative writing assistant. Write a short story about a futuristic city where: 1. The city is powered by a mysterious energy source 2. The inhabitants have developed unique abilities 3. There's a hidden conflict between different factions 4. The protagonist discovers a shocking truth about the city's origins Make the story engaging and include vivid descriptions of the city's architecture and technology. ``` This prompt was chosen because it: - Requires creative thinking and complex reasoning - Generates substantial output (typically 500-1000 tokens) - Tests both context understanding and generation capabilities - Produces consistent length outputs for fair comparison ## Metrics Collected For each model, we collect and analyze: 1. Performance Metrics: - Tokens per second (overall) - Generation tokens per second - Total response time - Total tokens generated 2. Resource Usage: - VRAM usage - Model size - Parameter count 3. Model Information: - Quantization level - Model format - Model family ## Testing Parameters All tests are run with consistent generation parameters: - Temperature: 0.7 - Top P: 0.9 - Top K: 40 - Max Tokens: 1000 - Repetition Penalty: 1.0 - Seed: 42 (for reproducibility) ## Notes - Tests are run sequentially to ensure no resource contention - A 3-second cooldown period is maintained between tests - Models are unloaded after each test to ensure clean state - Results are saved both in detailed and summary formats - The testing script automatically handles model pulling and cleanup