PPO-Pyramids Unity ML-Agents Model
Model Description
This model is a Proximal Policy Optimization (PPO) agent trained to navigate and solve the Pyramids environment from Unity ML-Agents. The Pyramids environment is a complex 3D navigation and puzzle-solving task where agents must learn to reach goals while avoiding obstacles and navigating through pyramid-like structures.
Model Details
Model Architecture
- Algorithm: Proximal Policy Optimization (PPO)
- Framework: Unity ML-Agents with PyTorch backend
- Policy Type: Actor-Critic with shared feature extraction
- Network Architecture:
- Hidden Units: 512 per layer
- Number of Layers: 2
- Activation: ReLU (default)
- Normalization: Disabled
- Visual Encoding: Simple CNN for visual observations
Environment: Pyramids
The Pyramids environment is one of Unity ML-Agents' example environments featuring:
- Objective: Navigate to randomly spawned goal locations
- Setting: 3D pyramid-like structures with multiple levels and obstacles
- Complexity: Multi-agent environment with navigation and spatial reasoning challenges
- Visual Component: First-person or third-person visual observations
Training Configuration
PPO Hyperparameters
batch_size: 128
buffer_size: 2048
learning_rate: 0.0003
beta: 0.01 # Entropy regularization
epsilon: 0.2 # PPO clipping parameter
lambda: 0.95 # GAE parameter
num_epoch: 3 # Training epochs per update
learning_rate_schedule: linear
Network Settings
normalize: false # Input normalization
hidden_units: 512 # Units per hidden layer
num_layers: 2 # Number of hidden layers
vis_encode_type: simple # Visual encoder type
Reward Structure
Extrinsic Rewards:
- Gamma: 0.99 (discount factor)
- Strength: 1.0
- Sparse rewards for reaching goals
- Time penalties for efficiency
Intrinsic Rewards (RND):
- Random Network Distillation for exploration
- Gamma: 0.99
- Strength: 0.01
- Separate network: 64 units, 3 layers
- Learning rate: 0.0001
Training Process
- Max Steps: 1,000,000 training steps
- Time Horizon: 128 steps per trajectory
- Checkpoints: Keep 5 best models
- Summary Frequency: Every 30,000 steps
- Training Time: Approximately 4-8 hours on modern GPU
Observation Space
The agent receives:
- Visual Observations: RGB camera input (84x84x3 typically)
- Vector Observations: Agent position, rotation, velocity
- Goal Information: Relative goal position and distance
- Environmental Context: Obstacle proximity, platform information
Action Space
- Action Type: Continuous
- Action Dimensions: 3-4 continuous values
- Forward/backward movement
- Left/right movement
- Rotation (yaw)
- Optional: Jump action
Performance Metrics
Expected Performance
- Goal Reaching Success Rate: 80-95%
- Average Episode Length: Optimal path finding
- Training Convergence: Stable improvement over 1M steps
- Exploration Efficiency: Balanced exploration vs exploitation
Key Metrics Tracked
- Cumulative Reward: Total reward per episode
- Success Rate: Percentage of episodes reaching goal
- Episode Length: Steps to complete episode
- Policy Entropy: Measure of action diversity
- Value Function Accuracy: Critic network performance
Technical Implementation
PPO Algorithm Features
- Policy Clipping: Prevents destructive policy updates (Ξ΅=0.2)
- Generalized Advantage Estimation: GAE with Ξ»=0.95
- Entropy Regularization: Encourages exploration (Ξ²=0.01)
- Value Function Learning: Shared network with policy
Random Network Distillation (RND)
- Purpose: Intrinsic motivation for exploration
- Implementation: Separate predictor and target networks
- Benefit: Encourages visiting novel states
- Balance: Low strength (0.01) to avoid overwhelming extrinsic rewards
Unity ML-Agents Integration
- Training Interface: Python mlagents-learn command
- Environment Communication: Unity-Python API
- Parallel Training: Multiple environment instances
- Real-time Monitoring: TensorBoard integration
Files and Structure
βββ Pyramids.onnx # Trained policy network
βββ Pyramids/
β βββ checkpoint-{step}.onnx # Training checkpoints
β βββ configuration.yaml # Training configuration
β βββ run_logs/ # Training metrics
βββ results/
β βββ training_summary.json # Training statistics
β βββ tensorboard_logs/ # TensorBoard data
Usage
Loading the Model
from mlagents_envs import UnityEnvironment
from mlagents_envs.side_channel.engine_configuration_channel import EngineConfigurationChannel
# Load environment
channel = EngineConfigurationChannel()
env = UnityEnvironment(file_name="Pyramids", side_channels=[channel])
# Model is loaded automatically when using mlagents-learn
Training Command
mlagents-learn config.yaml --env=Pyramids --run-id=pyramids_run_01
Resume the training
mlagents-learn <your_configuration_file_path.yaml> --run-id=<run_id> --resume
Inference
# The trained model can be used directly in Unity builds
# or through the ML-Agents Python API for evaluation
Limitations and Considerations
- Environment Specific: Trained specifically for Pyramids environment layout
- Visual Dependency: Performance tied to visual observation quality
- Exploration Balance: RND parameters may need tuning for different scenarios
- Computational Requirements: Requires GPU for efficient training
- Generalization: May not transfer well to significantly different navigation tasks
Optimization Suggestions
For improved performance, consider:
- Enable normalization:
normalize: true
- Increase network capacity:
hidden_units: 768
- Longer time horizon:
time_horizon: 256
- Higher batch size:
batch_size: 256
- More training steps:
max_steps: 2000000
Applications
- Game AI: Intelligent NPC navigation in 3D games
- Robotics Research: Transfer learning for robot navigation
- Pathfinding: Advanced pathfinding algorithm development
- Educational: Demonstration of RL in complex 3D environments
Ethical Considerations
This model represents a benign navigation task with no ethical concerns:
- Content: Abstract geometric environment
- Purpose: Educational and research applications
- Safety: No real-world safety implications
System Requirements
Training
- OS: Windows 10+, macOS 10.14+, Ubuntu 18.04+
- GPU: NVIDIA GPU with CUDA support (recommended)
- RAM: 8GB minimum, 16GB recommended
- Storage: 2GB for environment and model files
Dependencies
unity-ml-agents>=0.28.0
torch>=1.8.0
tensorboard
numpy
Citation
If you use this model, please cite:
@misc{ppo-pyramids-2024,
title={PPO-Pyramids: Navigation Agent for Unity ML-Agents},
author={Adilbai},
year={2024},
publisher={Hugging Face Hub},
url={https://huggingface.co/Adilbai/ppo-pyramids}
}
References
- Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347
- Burda, Y., et al. (2018). Exploration by Random Network Distillation. arXiv:1810.12894
- Unity Technologies. ML-Agents Toolkit Documentation
- Juliani, A., et al. (2018). Unity: A General Platform for Intelligent Agents. arXiv:1809.02627
Training Logs and Monitoring
Monitor training progress through:
- TensorBoard: Real-time training metrics
- Console Output: Episode rewards and statistics
- Checkpoint Analysis: Model performance over time
- Success Rate Tracking: Goal completion percentage
For optimal results, consider using the improved configuration with normalization enabled and increased network capacity. ποΈπ―
- Downloads last month
- 47