cartpole-cmaes / README.md

Update README.md

d4a0976 verified 10 days ago

7.06 kB

	---
	language: en
	tags:
	- evolutionary-strategy
	- cma-es
	- gymnasium
	- cartpole
	- optimization
	library_name: custom
	datasets:
	- gymnasium/CartPole-v1
	metrics:
	- mean_episode_length
	model-index:
	- name: CartPole-CMA-ES
	results:
	- task:
	type: optimization
	name: CartPole-v1
	dataset:
	name: gymnasium/CartPole-v1
	type: gymnasium
	metrics:
	- type: mean_episode_length
	value: 500
	name: Mean Episode Length
	license: mit
	pipeline_tag: reinforcement-learning
	---

	# CartPole-v1 CMA-ES Solution

	This model provides a solution to the CartPole-v1 environment using CMA-ES (Covariance Matrix Adaptation Evolution Strategy),
	achieving perfect performance with a simple linear policy. The implementation demonstrates how evolutionary strategies can
	effectively solve classic control problems with minimal architecture complexity.

	### Video Preview

	<video controls width="480">
	<source src="https://huggingface.co/bniladridas/cartpole-cmaes/resolve/main/preview.mp4" type="video/mp4">
	Your browser does not support the video tag.
	</video>


	### Training Convergence
	![Training Convergence](assets/training_convergence.png)
	Figure: Training convergence showing the mean fitness (episode length) across generations. The model achieves optimal performance (500 steps) within 3 generations.

	## Model Details

	### Model Description

	This is a linear policy model for the CartPole-v1 environment that:
	- Uses a simple weight matrix to map 4D state inputs to 2D action outputs
	- Achieves optimal performance (500/500 steps) consistently
	- Was optimized using CMA-ES, requiring only 3 generations for convergence
	- Demonstrates sample-efficient learning for the CartPole balancing task

	```python
	def get_action(self, observation):
	observation = np.array(observation, dtype=np.float32)
	action_scores = np.dot(observation, self.weights)
	action_scores += np.random.randn(action_scores.shape) 1e-5
	return int(np.argmax(action_scores))
	```

	- Developed by: Niladri Das
	- Model type: Linear Policy
	- Language: Python
	- License: MIT
	- Finetuned from model: No (trained from scratch)

	### Model Sources

	- Repository: https://github.com/bniladridas/cmaes-rl
	- Hugging Face: https://huggingface.co/bniladridas/cartpole-cmaes
	- Website: https://bniladridas.github.io/cmaes-rl/

	## Uses

	### Direct Use

	The model is designed for:
	1. Solving the CartPole-v1 environment from Gymnasium
	2. Demonstrating CMA-ES optimization for RL tasks
	3. Serving as a baseline for comparison with other algorithms
	4. Educational purposes in evolutionary strategies

	### Out-of-Scope Use

	The model should not be used for:
	1. Complex control tasks beyond CartPole
	2. Real-world robotics applications
	3. Tasks requiring non-linear policies
	4. Environments with partial observability

	## Bias, Risks, and Limitations

	### Technical Limitations
	- Limited to CartPole-v1 environment
	- Requires full state observation
	- Linear policy architecture
	- No transfer learning capability
	- Environment-specific solution

	### Performance Limitations
	- May not handle significant environment variations
	- No adaptation to changing dynamics
	- Limited by linear policy capacity
	- Requires precise state information

	### Recommendations

	Users should:
	1. Only use for CartPole-v1 environment
	2. Ensure full state observability
	3. Understand the limitations of linear policies
	4. Consider more complex architectures for other tasks
	5. Validate performance in their specific setup

	## How to Get Started with the Model

	### Method 1: Using the CMAESAgent Class

	```python
	from model import CMAESAgent

	# Load the model
	agent = CMAESAgent.from_pretrained("bniladridas/cartpole-cmaes")

	# Evaluate
	mean_reward, std_reward = agent.evaluate(num_episodes=5)
	print(f"Mean reward: {mean_reward:.2f} ± {std_reward:.2f}")
	```

	### Method 2: Manual Implementation

	```python
	import numpy as np
	from gymnasium import make

	# Load model weights
	weights = np.load('model_weights.npy') # 4x2 matrix

	# Create environment
	env = make('CartPole-v1')

	# Run inference
	def get_action(observation):
	logits = observation @ weights
	return int(np.argmax(logits))

	observation, _ = env.reset()
	while True:
	action = get_action(observation)
	observation, reward, done, truncated, info = env.step(action)
	if done or truncated:
	break
	```

	## Training Details

	### Training Data

	- Environment: Gymnasium CartPole-v1
	- State Space: 4D continuous (cart position, velocity, pole angle, angular velocity)
	- Action Space: 2D discrete (left, right)
	- Reward: +1 for each step, max 500 steps
	- Episode Termination: Pole angle > 15°, cart position > 2.4, or 500 steps reached
	- Training Approach: Direct environment interaction (no pre-collected dataset)

	### Training Procedure

	#### Training Hyperparameters

	- Algorithm: CMA-ES
	- Population size: 16
	- Number of generations: 100 (early convergence by generation 3)
	- Initial step size: 0.5
	- Parameters: 8 (4x2 weight matrix)
	- Training regime: Single precision (fp32)

	#### Hardware Requirements

	- CPU: Single core sufficient
	- Memory: <100MB RAM
	- GPU: Not required
	- Training time: ~5 minutes on standard CPU

	### Evaluation

	#### Testing Data & Metrics

	- Environment: Same as training (CartPole-v1)
	- Episodes: 100 test episodes
	- Metrics: Episode length, success rate

	#### Results

	- Average Episode Length: 500.0 ±0.0
	- Success Rate: 100%
	- Convergence: Achieved in 3 generations
	- Final Population Mean: 500.00
	- Best Performance: 500/500 consistently

	## Implementation Details

	The implementation employs a straightforward linear policy:

	```python
	class CMAESAgent:
	def __init__(self, env_name):
	self.env = gym.make(env_name)
	self.observation_space = self.env.observation_space.shape[0] # 4 for CartPole
	self.action_space = self.env.action_space.n # 2 for CartPole
	self.num_params = self.observation_space * self.action_space # 8 total parameters
	self.weights = None

	def get_action(self, observation):
	observation = np.array(observation, dtype=np.float32)
	action_scores = np.dot(observation, self.weights)
	action_scores += np.random.randn(action_scores.shape) 1e-5 # Small noise for stability
	return int(np.argmax(action_scores))
	```

	The model's simplicity demonstrates that CartPole's optimal control policy is approximately linear in the state variables.

	## Environmental Impact

	- Training time: ~5 minutes
	- Hardware: Standard CPU
	- Energy consumption: Negligible (<0.001 kWh)
	- CO2 emissions: Minimal (<0.001 kg)

	## Citation

	BibTeX:
	```bibtex
	@misc{das2024cartpole,
	author = {Niladri Das},
	title = {CartPole-v1 CMA-ES Solution},
	year = {2025},
	publisher = {Hugging Face},
	journal = {Hugging Face Model Hub},
	howpublished = {https://huggingface.co/bniladridas/cartpole-cmaes},
	url = {https://github.com/bniladridas/cmaes-rl}
	}