PPO-Pyramids Unity ML-Agents Model

Model Description

This model is a Proximal Policy Optimization (PPO) agent trained to navigate and solve the Pyramids environment from Unity ML-Agents. The Pyramids environment is a complex 3D navigation and puzzle-solving task where agents must learn to reach goals while avoiding obstacles and navigating through pyramid-like structures.

Model Details

Model Architecture

Algorithm: Proximal Policy Optimization (PPO)
Framework: Unity ML-Agents with PyTorch backend
Policy Type: Actor-Critic with shared feature extraction
Network Architecture:
- Hidden Units: 512 per layer
- Number of Layers: 2
- Activation: ReLU (default)
- Normalization: Disabled
- Visual Encoding: Simple CNN for visual observations

Environment: Pyramids

The Pyramids environment is one of Unity ML-Agents' example environments featuring:

Objective: Navigate to randomly spawned goal locations
Setting: 3D pyramid-like structures with multiple levels and obstacles
Complexity: Multi-agent environment with navigation and spatial reasoning challenges
Visual Component: First-person or third-person visual observations

Training Configuration

PPO Hyperparameters

batch_size: 128
buffer_size: 2048
learning_rate: 0.0003
beta: 0.01                    # Entropy regularization
epsilon: 0.2                  # PPO clipping parameter
lambda: 0.95                  # GAE parameter
num_epoch: 3                  # Training epochs per update
learning_rate_schedule: linear

Network Settings

normalize: false              # Input normalization
hidden_units: 512            # Units per hidden layer
num_layers: 2                # Number of hidden layers
vis_encode_type: simple      # Visual encoder type

Reward Structure

Extrinsic Rewards:
- Gamma: 0.99 (discount factor)
- Strength: 1.0
- Sparse rewards for reaching goals
- Time penalties for efficiency
Intrinsic Rewards (RND):
- Random Network Distillation for exploration
- Gamma: 0.99
- Strength: 0.01
- Separate network: 64 units, 3 layers
- Learning rate: 0.0001

Training Process

Max Steps: 1,000,000 training steps
Time Horizon: 128 steps per trajectory
Checkpoints: Keep 5 best models
Summary Frequency: Every 30,000 steps
Training Time: Approximately 4-8 hours on modern GPU

Observation Space

The agent receives:

Visual Observations: RGB camera input (84x84x3 typically)
Vector Observations: Agent position, rotation, velocity
Goal Information: Relative goal position and distance
Environmental Context: Obstacle proximity, platform information

Action Space

Action Type: Continuous
Action Dimensions: 3-4 continuous values
- Forward/backward movement
- Left/right movement
- Rotation (yaw)
- Optional: Jump action

Performance Metrics

Expected Performance

Goal Reaching Success Rate: 80-95%
Average Episode Length: Optimal path finding
Training Convergence: Stable improvement over 1M steps
Exploration Efficiency: Balanced exploration vs exploitation

Key Metrics Tracked

Cumulative Reward: Total reward per episode
Success Rate: Percentage of episodes reaching goal
Episode Length: Steps to complete episode
Policy Entropy: Measure of action diversity
Value Function Accuracy: Critic network performance

Technical Implementation

PPO Algorithm Features

Policy Clipping: Prevents destructive policy updates (ε=0.2)
Generalized Advantage Estimation: GAE with λ=0.95
Entropy Regularization: Encourages exploration (β=0.01)
Value Function Learning: Shared network with policy

Random Network Distillation (RND)

Purpose: Intrinsic motivation for exploration
Implementation: Separate predictor and target networks
Benefit: Encourages visiting novel states
Balance: Low strength (0.01) to avoid overwhelming extrinsic rewards

Unity ML-Agents Integration

Training Interface: Python mlagents-learn command
Environment Communication: Unity-Python API
Parallel Training: Multiple environment instances
Real-time Monitoring: TensorBoard integration

Files and Structure

├── Pyramids.onnx              # Trained policy network
├── Pyramids/
│   ├── checkpoint-{step}.onnx # Training checkpoints  
│   ├── configuration.yaml     # Training configuration
│   └── run_logs/             # Training metrics
├── results/
│   ├── training_summary.json # Training statistics
│   └── tensorboard_logs/     # TensorBoard data

Usage

Loading the Model

from mlagents_envs import UnityEnvironment
from mlagents_envs.side_channel.engine_configuration_channel import EngineConfigurationChannel

# Load environment
channel = EngineConfigurationChannel()
env = UnityEnvironment(file_name="Pyramids", side_channels=[channel])

# Model is loaded automatically when using mlagents-learn

Training Command

mlagents-learn config.yaml --env=Pyramids --run-id=pyramids_run_01

Resume the training

mlagents-learn <your_configuration_file_path.yaml> --run-id=<run_id> --resume

Inference

# The trained model can be used directly in Unity builds
# or through the ML-Agents Python API for evaluation

Limitations and Considerations

Environment Specific: Trained specifically for Pyramids environment layout
Visual Dependency: Performance tied to visual observation quality
Exploration Balance: RND parameters may need tuning for different scenarios
Computational Requirements: Requires GPU for efficient training
Generalization: May not transfer well to significantly different navigation tasks

Optimization Suggestions

For improved performance, consider:

Enable normalization: normalize: true
Increase network capacity: hidden_units: 768
Longer time horizon: time_horizon: 256
Higher batch size: batch_size: 256
More training steps: max_steps: 2000000

Applications

Game AI: Intelligent NPC navigation in 3D games
Robotics Research: Transfer learning for robot navigation
Pathfinding: Advanced pathfinding algorithm development
Educational: Demonstration of RL in complex 3D environments

Ethical Considerations

This model represents a benign navigation task with no ethical concerns:

Content: Abstract geometric environment
Purpose: Educational and research applications
Safety: No real-world safety implications

System Requirements

Training

OS: Windows 10+, macOS 10.14+, Ubuntu 18.04+
GPU: NVIDIA GPU with CUDA support (recommended)
RAM: 8GB minimum, 16GB recommended
Storage: 2GB for environment and model files

Dependencies

unity-ml-agents>=0.28.0
torch>=1.8.0
tensorboard
numpy

Citation

If you use this model, please cite:

@misc{ppo-pyramids-2024,
  title={PPO-Pyramids: Navigation Agent for Unity ML-Agents},
  author={Adilbai},
  year={2024},
  publisher={Hugging Face Hub},
  url={https://huggingface.co/Adilbai/ppo-pyramids}
}

References

Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347
Burda, Y., et al. (2018). Exploration by Random Network Distillation. arXiv:1810.12894
Unity Technologies. ML-Agents Toolkit Documentation
Juliani, A., et al. (2018). Unity: A General Platform for Intelligent Agents. arXiv:1809.02627

Training Logs and Monitoring

Monitor training progress through:

TensorBoard: Real-time training metrics
Console Output: Episode rewards and statistics
Checkpoint Analysis: Model performance over time
Success Rate Tracking: Goal completion percentage

For optimal results, consider using the improved configuration with normalization enabled and increased network capacity. 🏗️🎯