PPO-Pyramids Unity ML-Agents Model

Model Description

This model is a Proximal Policy Optimization (PPO) agent trained to navigate and solve the Pyramids environment from Unity ML-Agents. The Pyramids environment is a complex 3D navigation and puzzle-solving task where agents must learn to reach goals while avoiding obstacles and navigating through pyramid-like structures.

Model Details

Model Architecture

  • Algorithm: Proximal Policy Optimization (PPO)
  • Framework: Unity ML-Agents with PyTorch backend
  • Policy Type: Actor-Critic with shared feature extraction
  • Network Architecture:
    • Hidden Units: 512 per layer
    • Number of Layers: 2
    • Activation: ReLU (default)
    • Normalization: Disabled
    • Visual Encoding: Simple CNN for visual observations

Environment: Pyramids

The Pyramids environment is one of Unity ML-Agents' example environments featuring:

  • Objective: Navigate to randomly spawned goal locations
  • Setting: 3D pyramid-like structures with multiple levels and obstacles
  • Complexity: Multi-agent environment with navigation and spatial reasoning challenges
  • Visual Component: First-person or third-person visual observations

Training Configuration

PPO Hyperparameters

batch_size: 128
buffer_size: 2048
learning_rate: 0.0003
beta: 0.01                    # Entropy regularization
epsilon: 0.2                  # PPO clipping parameter
lambda: 0.95                  # GAE parameter
num_epoch: 3                  # Training epochs per update
learning_rate_schedule: linear

Network Settings

normalize: false              # Input normalization
hidden_units: 512            # Units per hidden layer
num_layers: 2                # Number of hidden layers
vis_encode_type: simple      # Visual encoder type

Reward Structure

  • Extrinsic Rewards:

    • Gamma: 0.99 (discount factor)
    • Strength: 1.0
    • Sparse rewards for reaching goals
    • Time penalties for efficiency
  • Intrinsic Rewards (RND):

    • Random Network Distillation for exploration
    • Gamma: 0.99
    • Strength: 0.01
    • Separate network: 64 units, 3 layers
    • Learning rate: 0.0001

Training Process

  • Max Steps: 1,000,000 training steps
  • Time Horizon: 128 steps per trajectory
  • Checkpoints: Keep 5 best models
  • Summary Frequency: Every 30,000 steps
  • Training Time: Approximately 4-8 hours on modern GPU

Observation Space

The agent receives:

  • Visual Observations: RGB camera input (84x84x3 typically)
  • Vector Observations: Agent position, rotation, velocity
  • Goal Information: Relative goal position and distance
  • Environmental Context: Obstacle proximity, platform information

Action Space

  • Action Type: Continuous
  • Action Dimensions: 3-4 continuous values
    • Forward/backward movement
    • Left/right movement
    • Rotation (yaw)
    • Optional: Jump action

Performance Metrics

Expected Performance

  • Goal Reaching Success Rate: 80-95%
  • Average Episode Length: Optimal path finding
  • Training Convergence: Stable improvement over 1M steps
  • Exploration Efficiency: Balanced exploration vs exploitation

Key Metrics Tracked

  • Cumulative Reward: Total reward per episode
  • Success Rate: Percentage of episodes reaching goal
  • Episode Length: Steps to complete episode
  • Policy Entropy: Measure of action diversity
  • Value Function Accuracy: Critic network performance

Technical Implementation

PPO Algorithm Features

  • Policy Clipping: Prevents destructive policy updates (Ξ΅=0.2)
  • Generalized Advantage Estimation: GAE with Ξ»=0.95
  • Entropy Regularization: Encourages exploration (Ξ²=0.01)
  • Value Function Learning: Shared network with policy

Random Network Distillation (RND)

  • Purpose: Intrinsic motivation for exploration
  • Implementation: Separate predictor and target networks
  • Benefit: Encourages visiting novel states
  • Balance: Low strength (0.01) to avoid overwhelming extrinsic rewards

Unity ML-Agents Integration

  • Training Interface: Python mlagents-learn command
  • Environment Communication: Unity-Python API
  • Parallel Training: Multiple environment instances
  • Real-time Monitoring: TensorBoard integration

Files and Structure

β”œβ”€β”€ Pyramids.onnx              # Trained policy network
β”œβ”€β”€ Pyramids/
β”‚   β”œβ”€β”€ checkpoint-{step}.onnx # Training checkpoints  
β”‚   β”œβ”€β”€ configuration.yaml     # Training configuration
β”‚   └── run_logs/             # Training metrics
β”œβ”€β”€ results/
β”‚   β”œβ”€β”€ training_summary.json # Training statistics
β”‚   └── tensorboard_logs/     # TensorBoard data

Usage

Loading the Model

from mlagents_envs import UnityEnvironment
from mlagents_envs.side_channel.engine_configuration_channel import EngineConfigurationChannel

# Load environment
channel = EngineConfigurationChannel()
env = UnityEnvironment(file_name="Pyramids", side_channels=[channel])

# Model is loaded automatically when using mlagents-learn

Training Command

mlagents-learn config.yaml --env=Pyramids --run-id=pyramids_run_01

Resume the training

mlagents-learn <your_configuration_file_path.yaml> --run-id=<run_id> --resume

Inference

# The trained model can be used directly in Unity builds
# or through the ML-Agents Python API for evaluation

Limitations and Considerations

  1. Environment Specific: Trained specifically for Pyramids environment layout
  2. Visual Dependency: Performance tied to visual observation quality
  3. Exploration Balance: RND parameters may need tuning for different scenarios
  4. Computational Requirements: Requires GPU for efficient training
  5. Generalization: May not transfer well to significantly different navigation tasks

Optimization Suggestions

For improved performance, consider:

  • Enable normalization: normalize: true
  • Increase network capacity: hidden_units: 768
  • Longer time horizon: time_horizon: 256
  • Higher batch size: batch_size: 256
  • More training steps: max_steps: 2000000

Applications

  • Game AI: Intelligent NPC navigation in 3D games
  • Robotics Research: Transfer learning for robot navigation
  • Pathfinding: Advanced pathfinding algorithm development
  • Educational: Demonstration of RL in complex 3D environments

Ethical Considerations

This model represents a benign navigation task with no ethical concerns:

  • Content: Abstract geometric environment
  • Purpose: Educational and research applications
  • Safety: No real-world safety implications

System Requirements

Training

  • OS: Windows 10+, macOS 10.14+, Ubuntu 18.04+
  • GPU: NVIDIA GPU with CUDA support (recommended)
  • RAM: 8GB minimum, 16GB recommended
  • Storage: 2GB for environment and model files

Dependencies

unity-ml-agents>=0.28.0
torch>=1.8.0
tensorboard
numpy

Citation

If you use this model, please cite:

@misc{ppo-pyramids-2024,
  title={PPO-Pyramids: Navigation Agent for Unity ML-Agents},
  author={Adilbai},
  year={2024},
  publisher={Hugging Face Hub},
  url={https://huggingface.co/Adilbai/ppo-pyramids}
}

References

  • Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347
  • Burda, Y., et al. (2018). Exploration by Random Network Distillation. arXiv:1810.12894
  • Unity Technologies. ML-Agents Toolkit Documentation
  • Juliani, A., et al. (2018). Unity: A General Platform for Intelligent Agents. arXiv:1809.02627

Training Logs and Monitoring

Monitor training progress through:

  • TensorBoard: Real-time training metrics
  • Console Output: Episode rewards and statistics
  • Checkpoint Analysis: Model performance over time
  • Success Rate Tracking: Goal completion percentage

For optimal results, consider using the improved configuration with normalization enabled and increased network capacity. πŸ—οΈπŸŽ―

Downloads last month
47
Video Preview
loading