File size: 7,059 Bytes

a0229a8
e41d7de
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d58baf8
e41d7de
 
 
 
 
 
d58baf8
 
 
 
 
e41d7de
e6bcd78
b4ca741
e6bcd78
daf5d71
 
 
e6bcd78
d4a0976
 
 
 
 
 
 
 
85a253c
 
 
 
4da2759
e6bcd78
4da2759
e6bcd78
daf5d71
 
 
 
 
4da2759
b4ca741
 
 
 
 
 
 
 
86d77a3
 
85a253c
daf5d71
85a253c
4da2759
85a253c
4da2759
b4ca741
 
 
4da2759
 
 
 
 
daf5d71
 
 
 
 
4da2759
 
 
daf5d71
 
 
 
 
4da2759
 
 
daf5d71
 
 
 
 
 
4da2759
daf5d71
 
 
 
 
4da2759
 
 
daf5d71
 
 
 
 
 
4da2759
 
 
b4ca741
 
 
 
 
 
 
 
 
 
 
 
 
 
 
daf5d71
 
 
e6bcd78
daf5d71
 
4da2759
daf5d71
 
4da2759
daf5d71
 
 
b4ca741
4da2759
daf5d71
 
 
 
 
 
 
4da2759
daf5d71
4da2759
daf5d71
4da2759
daf5d71
 
 
 
 
b4ca741
4da2759
daf5d71
4da2759
 
 
daf5d71
 
b4ca741
daf5d71
 
 
4da2759
daf5d71
4da2759
daf5d71
 
 
 
4da2759
daf5d71
4da2759
daf5d71
4da2759
daf5d71
 
 
4da2759
daf5d71
4da2759
daf5d71
 
 
b4ca741
daf5d71
4da2759
b4ca741
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4da2759
 
daf5d71
 
 
 
4da2759
daf5d71
4da2759
 
daf5d71
 
 
 
a9aa5e7
daf5d71
 
b4ca741

---
language: en
tags:
- evolutionary-strategy
- cma-es
- gymnasium
- cartpole
- optimization
library_name: custom
datasets:
- gymnasium/CartPole-v1
metrics:
- mean_episode_length
model-index:
- name: CartPole-CMA-ES
  results:
  - task:
      type: optimization
      name: CartPole-v1
    dataset:
      name: gymnasium/CartPole-v1
      type: gymnasium
    metrics:
    - type: mean_episode_length
      value: 500
      name: Mean Episode Length
license: mit
pipeline_tag: reinforcement-learning
---

# CartPole-v1 CMA-ES Solution

This model provides a solution to the CartPole-v1 environment using CMA-ES (Covariance Matrix Adaptation Evolution Strategy), 
achieving perfect performance with a simple linear policy. The implementation demonstrates how evolutionary strategies can 
effectively solve classic control problems with minimal architecture complexity.

### Video Preview

<video controls width="480">
  <source src="https://huggingface.co/bniladridas/cartpole-cmaes/resolve/main/preview.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>


### Training Convergence
![Training Convergence](assets/training_convergence.png)
*Figure: Training convergence showing the mean fitness (episode length) across generations. The model achieves optimal performance (500 steps) within 3 generations.*

## Model Details

### Model Description

This is a linear policy model for the CartPole-v1 environment that:
- Uses a simple weight matrix to map 4D state inputs to 2D action outputs
- Achieves optimal performance (500/500 steps) consistently
- Was optimized using CMA-ES, requiring only 3 generations for convergence
- Demonstrates sample-efficient learning for the CartPole balancing task

```python
def get_action(self, observation):
    observation = np.array(observation, dtype=np.float32)
    action_scores = np.dot(observation, self.weights)
    action_scores += np.random.randn(*action_scores.shape) * 1e-5
    return int(np.argmax(action_scores))
```

- **Developed by:** Niladri Das
- **Model type:** Linear Policy
- **Language:** Python
- **License:** MIT
- **Finetuned from model:** No (trained from scratch)

### Model Sources

- **Repository:** https://github.com/bniladridas/cmaes-rl
- **Hugging Face:** https://huggingface.co/bniladridas/cartpole-cmaes
- **Website:** https://bniladridas.github.io/cmaes-rl/

## Uses

### Direct Use

The model is designed for:
1. Solving the CartPole-v1 environment from Gymnasium
2. Demonstrating CMA-ES optimization for RL tasks
3. Serving as a baseline for comparison with other algorithms
4. Educational purposes in evolutionary strategies

### Out-of-Scope Use

The model should not be used for:
1. Complex control tasks beyond CartPole
2. Real-world robotics applications
3. Tasks requiring non-linear policies
4. Environments with partial observability

## Bias, Risks, and Limitations

### Technical Limitations
- Limited to CartPole-v1 environment
- Requires full state observation
- Linear policy architecture
- No transfer learning capability
- Environment-specific solution

### Performance Limitations
- May not handle significant environment variations
- No adaptation to changing dynamics
- Limited by linear policy capacity
- Requires precise state information

### Recommendations

Users should:
1. Only use for CartPole-v1 environment
2. Ensure full state observability
3. Understand the limitations of linear policies
4. Consider more complex architectures for other tasks
5. Validate performance in their specific setup

## How to Get Started with the Model

### Method 1: Using the CMAESAgent Class

```python
from model import CMAESAgent

# Load the model
agent = CMAESAgent.from_pretrained("bniladridas/cartpole-cmaes")

# Evaluate
mean_reward, std_reward = agent.evaluate(num_episodes=5)
print(f"Mean reward: {mean_reward:.2f} ± {std_reward:.2f}")
```

### Method 2: Manual Implementation

```python
import numpy as np
from gymnasium import make

# Load model weights
weights = np.load('model_weights.npy')  # 4x2 matrix

# Create environment
env = make('CartPole-v1')

# Run inference
def get_action(observation):
    logits = observation @ weights
    return int(np.argmax(logits))

observation, _ = env.reset()
while True:
    action = get_action(observation)
    observation, reward, done, truncated, info = env.step(action)
    if done or truncated:
        break
```

## Training Details

### Training Data

- **Environment:** Gymnasium CartPole-v1
- **State Space:** 4D continuous (cart position, velocity, pole angle, angular velocity)
- **Action Space:** 2D discrete (left, right)
- **Reward:** +1 for each step, max 500 steps
- **Episode Termination:** Pole angle > 15°, cart position > 2.4, or 500 steps reached
- **Training Approach:** Direct environment interaction (no pre-collected dataset)

### Training Procedure

#### Training Hyperparameters

- **Algorithm:** CMA-ES
- **Population size:** 16
- **Number of generations:** 100 (early convergence by generation 3)
- **Initial step size:** 0.5
- **Parameters:** 8 (4x2 weight matrix)
- **Training regime:** Single precision (fp32)

#### Hardware Requirements

- **CPU:** Single core sufficient
- **Memory:** <100MB RAM
- **GPU:** Not required
- **Training time:** ~5 minutes on standard CPU

### Evaluation

#### Testing Data & Metrics

- **Environment:** Same as training (CartPole-v1)
- **Episodes:** 100 test episodes
- **Metrics:** Episode length, success rate

#### Results

- **Average Episode Length:** 500.0 ±0.0
- **Success Rate:** 100%
- **Convergence:** Achieved in 3 generations
- **Final Population Mean:** 500.00
- **Best Performance:** 500/500 consistently

## Implementation Details

The implementation employs a straightforward linear policy:

```python
class CMAESAgent:
    def __init__(self, env_name):
        self.env = gym.make(env_name)
        self.observation_space = self.env.observation_space.shape[0]  # 4 for CartPole
        self.action_space = self.env.action_space.n  # 2 for CartPole
        self.num_params = self.observation_space * self.action_space  # 8 total parameters
        self.weights = None
    
    def get_action(self, observation):
        observation = np.array(observation, dtype=np.float32)
        action_scores = np.dot(observation, self.weights)
        action_scores += np.random.randn(*action_scores.shape) * 1e-5  # Small noise for stability
        return int(np.argmax(action_scores))
```

The model's simplicity demonstrates that CartPole's optimal control policy is approximately linear in the state variables.

## Environmental Impact

- **Training time:** ~5 minutes
- **Hardware:** Standard CPU
- **Energy consumption:** Negligible (<0.001 kWh)
- **CO2 emissions:** Minimal (<0.001 kg)

## Citation

**BibTeX:**
```bibtex
@misc{das2024cartpole,
  author = {Niladri Das},
  title = {CartPole-v1 CMA-ES Solution},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {https://huggingface.co/bniladridas/cartpole-cmaes},
  url = {https://github.com/bniladridas/cmaes-rl}
}