dubswayAgenticV2 / setup-ollama.md
peace2024's picture
new update
71d10ae

Ollama Worker Setup Guide

Overview

The Ollama worker runs locally and uses your GPU for video processing, reducing cloud compute costs and quota issues.

Prerequisites

1. Install Ollama

# Windows (PowerShell)
winget install Ollama.Ollama

# Or download from https://ollama.ai/download

2. Start Ollama Service

ollama serve

3. Pull Required Models

# Pull the main LLM model
ollama pull llama3.2:latest

# Pull the whisper model for transcription
ollama pull whisper:latest

# Optional: Pull other models
ollama pull llama3.2:13b  # For better quality (requires more VRAM)
ollama pull codellama:latest  # For code analysis

Configuration

Environment Variables

Create a .env file or set these variables:

# Ollama Configuration
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llama3.2:latest
OLLAMA_WHISPER_MODEL=whisper:latest

# Worker Configuration
OLLAMA_POLL_INTERVAL_SECONDS=120
OLLAMA_MAX_VIDEOS_PER_CYCLE=1
OLLAMA_BACKOFF_SECONDS=300

Model Selection

  • llama3.2:latest - Good balance of speed and quality
  • llama3.2:13b - Better quality, needs more VRAM
  • llama3.2:70b - Best quality, needs 40GB+ VRAM
  • whisper:latest - For transcription (works well on GPU)

Running the Worker

Option 1: Windows Batch File

# Double-click or run:
start-ollama-worker.bat

Option 2: Manual Start

# Activate virtual environment
aienv\Scripts\activate

# Set environment variables
set OLLAMA_POLL_INTERVAL_SECONDS=120
set OLLAMA_MAX_VIDEOS_PER_CYCLE=1
set OLLAMA_BASE_URL=http://localhost:11434
set OLLAMA_MODEL=llama3.2:latest
set OLLAMA_WHISPER_MODEL=whisper:latest

# Start worker
python worker\ollama_daemon.py

Option 3: Docker with Ollama

# Add to your Dockerfile
RUN curl -fsSL https://ollama.ai/install.sh | sh
RUN ollama pull llama3.2:latest
RUN ollama pull whisper:latest

Features

1. Local GPU Processing

  • Uses your local GPU for all AI tasks
  • No cloud API calls for transcription/analysis
  • Reduces quota usage and costs

2. Video Processing Pipeline

  1. Transcription: Ollama Whisper model
  2. Summarization: Ollama LLM model
  3. Enhanced Analysis: Topics, sentiment, insights
  4. PDF Generation: Local report creation
  5. S3 Upload: Cloud storage for PDFs

3. Error Handling

  • Health checks for Ollama service
  • Automatic backoff on errors
  • Graceful fallbacks
  • Comprehensive logging

4. Configuration Options

  • Adjustable poll intervals
  • Max videos per cycle
  • Model selection
  • Timeout settings

Monitoring

Logs

  • Worker logs: ollama_worker.log
  • Console output for real-time monitoring
  • Error tracking and backoff notifications

Health Checks

  • Ollama service availability
  • Model loading status
  • Processing success rates

Troubleshooting

Common Issues

  1. Ollama not running

    # Check if Ollama is running
    curl http://localhost:11434/api/tags
    
    # Start Ollama
    ollama serve
    
  2. Model not found

    # List available models
    ollama list
    
    # Pull missing model
    ollama pull llama3.2:latest
    
  3. GPU memory issues

    • Use smaller models (llama3.2:latest instead of 13b/70b)
    • Reduce batch sizes
    • Check GPU memory usage
  4. Slow processing

    • Ensure GPU is being used (check Ollama logs)
    • Use faster models
    • Increase timeout values

Performance Tuning

  1. GPU Optimization

    # Set GPU memory fraction
    export CUDA_VISIBLE_DEVICES=0
    
    # Use specific GPU
    ollama serve --gpu 0
    
  2. Model Optimization

    • Use quantized models for faster inference
    • Adjust temperature and top_p parameters
    • Monitor VRAM usage
  3. Worker Configuration

    • Increase OLLAMA_MAX_VIDEOS_PER_CYCLE for batch processing
    • Decrease OLLAMA_POLL_INTERVAL_SECONDS for faster processing
    • Adjust timeout values based on your hardware

Integration

With Existing System

  • Runs alongside the cloud worker
  • Processes the same pending videos
  • Uses the same database and S3 storage
  • Can be used as primary or backup worker

Switching Between Workers

  1. Cloud-only: Set WORKER_MAX_VIDEOS_PER_CYCLE=0 in cloud worker
  2. Local-only: Stop cloud worker, run Ollama worker
  3. Hybrid: Run both with different priorities

Cost Benefits

  • No cloud API costs for transcription/analysis
  • Uses local GPU resources
  • Reduces quota usage on cloud platforms
  • Better privacy (data stays local)

Security

  • All processing happens locally
  • No data sent to external APIs (except S3 for storage)
  • Full control over models and data
  • Can run completely offline (except for S3 uploads)