|
--- |
|
language: en |
|
tags: |
|
- evolutionary-strategy |
|
- cma-es |
|
- gymnasium |
|
- cartpole |
|
- optimization |
|
library_name: custom |
|
datasets: |
|
- gymnasium/CartPole-v1 |
|
metrics: |
|
- mean_episode_length |
|
model-index: |
|
- name: CartPole-CMA-ES |
|
results: |
|
- task: |
|
type: optimization |
|
name: CartPole-v1 |
|
dataset: |
|
name: gymnasium/CartPole-v1 |
|
type: gymnasium |
|
metrics: |
|
- type: mean_episode_length |
|
value: 500 |
|
name: Mean Episode Length |
|
license: mit |
|
pipeline_tag: reinforcement-learning |
|
--- |
|
|
|
# CartPole-v1 CMA-ES Solution |
|
|
|
This model provides a solution to the CartPole-v1 environment using CMA-ES (Covariance Matrix Adaptation Evolution Strategy), |
|
achieving perfect performance with a simple linear policy. The implementation demonstrates how evolutionary strategies can |
|
effectively solve classic control problems with minimal architecture complexity. |
|
|
|
### Video Preview |
|
|
|
<video controls width="480"> |
|
<source src="https://huggingface.co/bniladridas/cartpole-cmaes/resolve/main/preview.mp4" type="video/mp4"> |
|
Your browser does not support the video tag. |
|
</video> |
|
|
|
|
|
### Training Convergence |
|
 |
|
*Figure: Training convergence showing the mean fitness (episode length) across generations. The model achieves optimal performance (500 steps) within 3 generations.* |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
This is a linear policy model for the CartPole-v1 environment that: |
|
- Uses a simple weight matrix to map 4D state inputs to 2D action outputs |
|
- Achieves optimal performance (500/500 steps) consistently |
|
- Was optimized using CMA-ES, requiring only 3 generations for convergence |
|
- Demonstrates sample-efficient learning for the CartPole balancing task |
|
|
|
```python |
|
def get_action(self, observation): |
|
observation = np.array(observation, dtype=np.float32) |
|
action_scores = np.dot(observation, self.weights) |
|
action_scores += np.random.randn(*action_scores.shape) * 1e-5 |
|
return int(np.argmax(action_scores)) |
|
``` |
|
|
|
- **Developed by:** Niladri Das |
|
- **Model type:** Linear Policy |
|
- **Language:** Python |
|
- **License:** MIT |
|
- **Finetuned from model:** No (trained from scratch) |
|
|
|
### Model Sources |
|
|
|
- **Repository:** https://github.com/bniladridas/cmaes-rl |
|
- **Hugging Face:** https://huggingface.co/bniladridas/cartpole-cmaes |
|
- **Website:** https://bniladridas.github.io/cmaes-rl/ |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
|
|
The model is designed for: |
|
1. Solving the CartPole-v1 environment from Gymnasium |
|
2. Demonstrating CMA-ES optimization for RL tasks |
|
3. Serving as a baseline for comparison with other algorithms |
|
4. Educational purposes in evolutionary strategies |
|
|
|
### Out-of-Scope Use |
|
|
|
The model should not be used for: |
|
1. Complex control tasks beyond CartPole |
|
2. Real-world robotics applications |
|
3. Tasks requiring non-linear policies |
|
4. Environments with partial observability |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
### Technical Limitations |
|
- Limited to CartPole-v1 environment |
|
- Requires full state observation |
|
- Linear policy architecture |
|
- No transfer learning capability |
|
- Environment-specific solution |
|
|
|
### Performance Limitations |
|
- May not handle significant environment variations |
|
- No adaptation to changing dynamics |
|
- Limited by linear policy capacity |
|
- Requires precise state information |
|
|
|
### Recommendations |
|
|
|
Users should: |
|
1. Only use for CartPole-v1 environment |
|
2. Ensure full state observability |
|
3. Understand the limitations of linear policies |
|
4. Consider more complex architectures for other tasks |
|
5. Validate performance in their specific setup |
|
|
|
## How to Get Started with the Model |
|
|
|
### Method 1: Using the CMAESAgent Class |
|
|
|
```python |
|
from model import CMAESAgent |
|
|
|
# Load the model |
|
agent = CMAESAgent.from_pretrained("bniladridas/cartpole-cmaes") |
|
|
|
# Evaluate |
|
mean_reward, std_reward = agent.evaluate(num_episodes=5) |
|
print(f"Mean reward: {mean_reward:.2f} ± {std_reward:.2f}") |
|
``` |
|
|
|
### Method 2: Manual Implementation |
|
|
|
```python |
|
import numpy as np |
|
from gymnasium import make |
|
|
|
# Load model weights |
|
weights = np.load('model_weights.npy') # 4x2 matrix |
|
|
|
# Create environment |
|
env = make('CartPole-v1') |
|
|
|
# Run inference |
|
def get_action(observation): |
|
logits = observation @ weights |
|
return int(np.argmax(logits)) |
|
|
|
observation, _ = env.reset() |
|
while True: |
|
action = get_action(observation) |
|
observation, reward, done, truncated, info = env.step(action) |
|
if done or truncated: |
|
break |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
- **Environment:** Gymnasium CartPole-v1 |
|
- **State Space:** 4D continuous (cart position, velocity, pole angle, angular velocity) |
|
- **Action Space:** 2D discrete (left, right) |
|
- **Reward:** +1 for each step, max 500 steps |
|
- **Episode Termination:** Pole angle > 15°, cart position > 2.4, or 500 steps reached |
|
- **Training Approach:** Direct environment interaction (no pre-collected dataset) |
|
|
|
### Training Procedure |
|
|
|
#### Training Hyperparameters |
|
|
|
- **Algorithm:** CMA-ES |
|
- **Population size:** 16 |
|
- **Number of generations:** 100 (early convergence by generation 3) |
|
- **Initial step size:** 0.5 |
|
- **Parameters:** 8 (4x2 weight matrix) |
|
- **Training regime:** Single precision (fp32) |
|
|
|
#### Hardware Requirements |
|
|
|
- **CPU:** Single core sufficient |
|
- **Memory:** <100MB RAM |
|
- **GPU:** Not required |
|
- **Training time:** ~5 minutes on standard CPU |
|
|
|
### Evaluation |
|
|
|
#### Testing Data & Metrics |
|
|
|
- **Environment:** Same as training (CartPole-v1) |
|
- **Episodes:** 100 test episodes |
|
- **Metrics:** Episode length, success rate |
|
|
|
#### Results |
|
|
|
- **Average Episode Length:** 500.0 ±0.0 |
|
- **Success Rate:** 100% |
|
- **Convergence:** Achieved in 3 generations |
|
- **Final Population Mean:** 500.00 |
|
- **Best Performance:** 500/500 consistently |
|
|
|
## Implementation Details |
|
|
|
The implementation employs a straightforward linear policy: |
|
|
|
```python |
|
class CMAESAgent: |
|
def __init__(self, env_name): |
|
self.env = gym.make(env_name) |
|
self.observation_space = self.env.observation_space.shape[0] # 4 for CartPole |
|
self.action_space = self.env.action_space.n # 2 for CartPole |
|
self.num_params = self.observation_space * self.action_space # 8 total parameters |
|
self.weights = None |
|
|
|
def get_action(self, observation): |
|
observation = np.array(observation, dtype=np.float32) |
|
action_scores = np.dot(observation, self.weights) |
|
action_scores += np.random.randn(*action_scores.shape) * 1e-5 # Small noise for stability |
|
return int(np.argmax(action_scores)) |
|
``` |
|
|
|
The model's simplicity demonstrates that CartPole's optimal control policy is approximately linear in the state variables. |
|
|
|
## Environmental Impact |
|
|
|
- **Training time:** ~5 minutes |
|
- **Hardware:** Standard CPU |
|
- **Energy consumption:** Negligible (<0.001 kWh) |
|
- **CO2 emissions:** Minimal (<0.001 kg) |
|
|
|
## Citation |
|
|
|
**BibTeX:** |
|
```bibtex |
|
@misc{das2024cartpole, |
|
author = {Niladri Das}, |
|
title = {CartPole-v1 CMA-ES Solution}, |
|
year = {2025}, |
|
publisher = {Hugging Face}, |
|
journal = {Hugging Face Model Hub}, |
|
howpublished = {https://huggingface.co/bniladridas/cartpole-cmaes}, |
|
url = {https://github.com/bniladridas/cmaes-rl} |
|
} |