Original Tutorial: Getting Started With OpenAI Gym: The Basic Building Blocks by DigitalOcean
Updated for Gymnasium by: Claude (Anthropic) in collaboration with the course materials from ADS-AI program at Breda University of Applied Sciences
Last Updated: December 2024
This tutorial provides an updated guide to using Gymnasium (the successor to OpenAI Gym) for reinforcement learning. All code examples use the modern Gymnasium API and have been tested with the latest library versions. The original tutorial focused on the deprecated gym library; this version updates all content for gymnasium while preserving the pedagogical structure and adding practical insights from real-world testing.
- Installation and Setup
- Understanding the Environment Class
- Interacting with the Environment
- Rendering the Environment
- Running a Complete Episode
- Understanding Space Types
- Introduction to Wrappers
- Training with Stable-Baselines3
Gymnasium is the actively maintained successor to OpenAI Gym. Install it using pip:
pip install gymnasium
pip install pygameThe pygame package is needed for rendering environments visually.
The fundamental building block of Gymnasium is the Env class. It's a Python class that implements a simulator for the environment where you train your agent. Gymnasium comes with many environments: moving a car up a hill, balancing a swinging pendulum, playing Atari games, etc.
We'll start with MountainCar, where the objective is to drive a car up a mountain. The car sits on a one-dimensional track between two mountains. The goal is to reach the flag on the right mountain, but the engine isn't strong enough to scale it directly. You must drive back and forth to build momentum.
import gymnasium as gym
env = gym.make('MountainCar-v0')The environment structure is described by two key attributes:
observation_space - defines the structure and legitimate values for observing the environment's state. For MountainCar, this is a vector of position and velocity.
action_space - defines the numerical structure of legitimate actions. For MountainCar, this is a discrete set of three actions: push left, do nothing, push right.
obs_space = env.observation_space
action_space = env.action_space
print("The observation space: {}".format(obs_space))
print("The action space: {}".format(action_space))Output:
The observation space: Box([-1.2 -0.07], [0.6 0.07], (2,), float32)
The action space: Discrete(3)The observation space is a Box, representing a 2-dimensional continuous space. The action space is Discrete with 3 possible values: 0 (push left), 1 (do nothing), or 2 (push right).
There are two critical functions for interacting with environments:
reset() - Resets the environment to its initial state and returns the starting observation and info dictionary.
step(action) - Applies an action to the environment and returns five values: the new observation, reward, terminated flag, truncated flag, and info dictionary.
import gymnasium as gym
env = gym.make('MountainCar-v0')
# Reset the environment and see the initial observation
obs, info = env.reset()
print("The initial observation is {}".format(obs))
# Sample a random action from the entire action space
random_action = env.action_space.sample()
# Take the action and get the new observation
new_obs, reward, terminated, truncated, info = env.step(random_action)
print("The new observation is {}".format(new_obs))Output:
The initial observation is [-0.48235664 0. ]
The new observation is [-0.48366517 -0.00130853]The observation is a vector with two values: position and velocity. The middle point between the mountains is the origin, with right being positive and left being negative.
Key concept: The step() function returns five values in gymnasium:
- observation: the new state
- reward: the reward for that action
- terminated: True if the episode ended naturally (reached goal or failed)
- truncated: True if the episode was cut off artificially (time limit, out of bounds)
- info: additional diagnostic information
The split between terminated and truncated helps distinguish between episodes that completed naturally versus those cut off by time limits or other constraints.
Note for macOS users: The standard render_mode='human' has issues on macOS where pygame windows don't close properly. We use matplotlib-based rendering instead, which works reliably in Jupyter notebooks.
If you want to see what the environment looks like visually, you need to specify the render mode when creating the environment:
import gymnasium as gym
import matplotlib.pyplot as plt
from IPython import display
env = gym.make('MountainCar-v0', render_mode='rgb_array')
obs, info = env.reset()
plt.figure(figsize=(6, 4))
for _ in range(100):
action = env.action_space.sample()
obs, reward, terminated, truncated, info = env.step(action)
frame = env.render()
plt.clf()
plt.imshow(frame)
plt.axis('off')
display.clear_output(wait=True)
display.display(plt.gcf())
env.close()
plt.close()This displays an animated view of the environment that updates smoothly in Jupyter notebooks.
Alternative: Using matplotlib.animation for smoother rendering
For more sophisticated visualization, you can use matplotlib's animation capabilities:
import gymnasium as gym
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import numpy as np
env = gym.make('MountainCar-v0', render_mode='rgb_array')
obs, info = env.reset()
frames = []
for _ in range(200):
action = env.action_space.sample()
obs, reward, terminated, truncated, info = env.step(action)
frames.append(env.render())
if terminated or truncated:
obs, info = env.reset()
env.close()
# Create animation
fig, ax = plt.subplots(figsize=(6, 4))
ax.axis('off')
img = ax.imshow(frames[0])
def update(frame_idx):
img.set_data(frames[frame_idx])
return [img]
ani = animation.FuncAnimation(fig, update, frames=len(frames), interval=50, blit=True)
plt.show()When to use different render modes:
- render_mode='rgb_array': When you want to record frames, process them programmatically, train without display overhead, or work on systems without display capabilities
- matplotlib rendering: Best for Jupyter notebooks on macOS
- No rendering: Most efficient for training - you don't need to see every step
Let's put together everything we've learned into a complete simulation. We'll run the agent for multiple steps, taking random actions:
import gymnasium as gym
import matplotlib.pyplot as plt
from IPython import display
import time
env = gym.make('MountainCar-v0', render_mode='rgb_array')
num_steps = 1500
obs, info = env.reset()
plt.figure(figsize=(6, 4))
for step in range(num_steps):
# Take random action
action = env.action_space.sample()
# Apply the action
obs, reward, terminated, truncated, info = env.step(action)
# Render the environment
frame = env.render()
plt.clf()
plt.imshow(frame)
plt.axis('off')
display.clear_output(wait=True)
display.display(plt.gcf())
time.sleep(0.001)
# If the episode is done, start another one
if terminated or truncated:
obs, info = env.reset()
plt.close()Notice the structure: reset the environment, loop through steps taking actions, check if the episode ended, and reset if needed.
In this code, we're taking random actions with env.action_space.sample(). To make the agent actually intelligent, you would replace this with a function that looks at the observation and chooses actions based on what it has learned maximizes reward. That's where reinforcement learning algorithms like Q-learning or policy gradients come in.
The observation_space for MountainCar was Box(2,) and the action_space was Discrete(3). These are data structures called Spaces that describe valid values for observations and actions. All spaces inherit from the gym.Space base class.
import gymnasium as gym
env = gym.make('MountainCar-v0')
print(type(env.observation_space))Output:
gym.spaces.box.BoxThe Box(n,) space represents an n-dimensional continuous space. For MountainCar, n=2, so it's a 2-dimensional continuous space with position and velocity.
Box spaces are bounded with upper and lower limits:
import gymnasium as gym
env = gym.make('MountainCar-v0')
print("Upper Bound for Env Observation:", env.observation_space.high)
print("Lower Bound for Env Observation:", env.observation_space.low)Output:
Upper Bound for Env Observation: [0.6 0.07]
Lower Bound for Env Observation: [-1.2 -0.07]Position ranges from -1.2 to 0.6, and velocity ranges from -0.07 to 0.07.
The Discrete(n) space describes a discrete space with values from 0 to n-1. For MountainCar, n=3, so actions can be 0, 1, or 2.
env.step(2) # Works fineenv.step(4) # Raises an error - 4 is not in [0, 1, 2]If you were designing an environment for a robot arm with 3 joints, each able to rotate continuously, you would use a Box space for the action space:
from gymnasium import spaces
import numpy as np
# 3 joints, each can rotate from -180 to 180 degrees
action_space = spaces.Box(
low=np.array([-180, -180, -180]),
high=np.array([180, 180, 180]),
dtype=np.float32
)This gives you continuous control over all three joints simultaneously.
Before we dive into Wrappers, let's switch to a more complex environment: the Atari game Breakout. This will help us understand why Wrappers are useful.
First, install the Atari components:
pip install "gymnasium[atari]"
pip install "gymnasium[accept-rom-license]"Alternatively:
pip install gymnasium
pip install gymnasium-atari
pip install ale-py
pip install "autorom[accept-rom-license]"Now let's run Breakout with random actions:
import gymnasium as gym
import matplotlib.pyplot as plt
from IPython import display
import time
env = gym.make("ALE/Breakout-v5", render_mode='rgb_array')
print("Observation Space:", env.observation_space)
print("Action Space:", env.action_space)
obs, info = env.reset()
plt.figure(figsize=(6, 8))
for i in range(1000):
action = env.action_space.sample()
obs, reward, terminated, truncated, info = env.step(action)
frame = env.render()
plt.clf()
plt.imshow(frame)
plt.axis('off')
display.clear_output(wait=True)
display.display(plt.gcf())
time.sleep(0.01)
if terminated or truncated:
obs, info = env.reset()
plt.close()Output:
Observation Space: Box(0, 255, (210, 160, 3), uint8)
Action Space: Discrete(4)The observation space is a 210x160 RGB image. The action space has 4 discrete actions: Left, Right, Do Nothing, Fire.
We might want to modify this environment before training because:
- The 210x160x3 image is large and contains redundant information
- We may want to normalize pixel values
- We may want to stack multiple frames together to capture motion
- We may want to clip or reshape rewards
This is where Wrappers become essential. They allow us to systematically modify environments without rewriting the core environment code.
Let's train an agent to actually solve MountainCar using Stable-Baselines3. We'll use the DQN (Deep Q-Network) algorithm, which works particularly well for discrete action space problems like MountainCar.
First, install Stable-Baselines3:
pip install stable-baselines3DQN is specifically designed for discrete action spaces and learns MountainCar reliably:
import gymnasium as gym
from stable_baselines3 import DQN
# Create the environment
env = gym.make('MountainCar-v0')
# Create the DQN model with tuned hyperparameters
model = DQN(
"MlpPolicy",
env,
learning_rate=1e-3,
buffer_size=50000,
learning_starts=1000,
batch_size=128,
tau=1.0,
gamma=0.99,
train_freq=4,
gradient_steps=1,
target_update_interval=250,
exploration_fraction=0.2,
exploration_final_eps=0.05,
verbose=1
)
# Train - DQN typically solves MountainCar in 200k timesteps
print("Training DQN on MountainCar...")
model.learn(total_timesteps=200000)
# Save the trained model
model.save("dqn_mountaincar")
print("Training complete!")Output:
Training DQN on MountainCar...
# You'll see training progress with episode rewards improving
---------------------------------
| rollout/ | |
| ep_len_mean | 200 |
| ep_rew_mean | -200 |
# ... training continues ...
# Episode rewards improve from -200 to around -100 as agent learns
---------------------------------
Training complete!Now let's test the trained agent with visualization:
import gymnasium as gym
from stable_baselines3 import DQN
import matplotlib.pyplot as plt
from IPython import display
# Load the trained model
model = DQN.load("dqn_mountaincar")
# Create environment with rendering
env = gym.make('MountainCar-v0', render_mode='rgb_array')
obs, info = env.reset()
plt.figure(figsize=(6, 4))
for step in range(200):
# Use the trained model to predict the action
action, _states = model.predict(obs, deterministic=True)
obs, reward, terminated, truncated, info = env.step(action)
frame = env.render()
plt.clf()
plt.imshow(frame)
plt.axis('off')
plt.title(f"Step: {step}, Position: {obs[0]:.3f}")
display.clear_output(wait=True)
display.display(plt.gcf())
if terminated:
print(f"Success! Reached goal in {step} steps")
break
env.close()
plt.close()Output:
Success! Reached goal in 89 steps
# You'll see the car successfully reach the flagTo create a shareable visualization of your trained agent:
from stable_baselines3 import DQN
import gymnasium as gym
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from pathlib import Path
# Create assets directory if it doesn't exist
Path("./assets").mkdir(exist_ok=True)
# Load the trained model
model = DQN.load("dqn_mountaincar")
env = gym.make('MountainCar-v0', render_mode='rgb_array')
# Collect frames
obs, info = env.reset()
frames = []
for step in range(200):
action, _states = model.predict(obs, deterministic=True)
obs, reward, terminated, truncated, info = env.step(action)
frames.append(env.render())
if terminated:
print(f"Success! Reached goal in {step} steps")
break
env.close()
# Create animation and save as GIF
fig, ax = plt.subplots(figsize=(6, 4))
ax.axis('off')
img = ax.imshow(frames[0])
def update(frame_idx):
img.set_data(frames[frame_idx])
return [img]
ani = animation.FuncAnimation(fig, update, frames=len(frames), interval=50, blit=True)
ani.save('./assets/mountain_car_trained.gif', writer='pillow', fps=20)
plt.close()
print(f"GIF saved to ./assets/mountain_car_trained.gif with {len(frames)} frames")Output:
Success! Reached goal in 89 steps
GIF saved to ./assets/mountain_car_trained.gif with 89 framesMountainCar is actually a challenging problem because:
- The reward is -1 for every timestep until you reach the goal
- The agent needs to learn the counterintuitive strategy of going backwards first to build momentum
- The sparse reward structure makes it difficult for policy gradient methods
DQN excels at this task because:
- It's specifically designed for discrete action spaces
- The experience replay buffer helps with sparse rewards
- Q-learning is more sample-efficient for this type of problem
While DQN is recommended for MountainCar, you can also use PPO (Proximal Policy Optimization). However, PPO typically requires more timesteps and careful hyperparameter tuning:
import gymnasium as gym
from stable_baselines3 import PPO
env = gym.make('MountainCar-v0')
model = PPO(
"MlpPolicy",
env,
verbose=1,
learning_rate=0.001,
n_steps=2048,
batch_size=64,
n_epochs=10
)
# PPO needs significantly more training time for MountainCar
model.learn(total_timesteps=500000)
model.save("ppo_mountaincar")For learning purposes and quick success, stick with DQN for MountainCar.
For simple environments like MountainCar with small observations (just 2 numbers), CPU training is actually faster than GPU. The overhead of moving data to the GPU outweighs any computational benefits.
GPU acceleration (using Metal on M-series Macs with device="mps") becomes beneficial when you have:
- Image observations (like Atari - 210x160x3 pixels)
- Large neural networks
- Batch processing many environments in parallel
For MountainCar and similar simple environments, stick with CPU (the default).
Wrappers allow you to modify environments systematically without changing the core environment code. Your course materials emphasize using wrappers to tweak rewards and simplify complex environments.
Gymnasium provides several wrapper types:
- gym.Wrapper - General purpose, can modify any aspect
- gym.ObservationWrapper - Modifies observations (override
observation()method) - gym.RewardWrapper - Modifies rewards (override
reward()method) - gym.ActionWrapper - Modifies actions (override
action()method)
This wrapper simplifies MountainCar's 3 discrete actions into just "left" or "right":
import gymnasium as gym
import numpy as np
class SimplifiedActionWrapper(gym.ActionWrapper):
def __init__(self, env):
super().__init__(env)
# Original env has 3 actions: [0=push_left, 1=no_action, 2=push_right]
# We simplify to binary: 0=left, 1=right
self.action_space = gym.spaces.Discrete(2)
def action(self, action):
# Map binary action to original 3-action space
# 0 -> 0 (push left)
# 1 -> 2 (push right)
# We never use "do nothing"
if action == 0:
return 0 # Push left
else:
return 2 # Push rightTest the wrapper:
env = gym.make('MountainCar-v0')
wrapped_env = SimplifiedActionWrapper(env)
print("Original action space:", env.action_space)
print("Wrapped action space:", wrapped_env.action_space)
obs, info = wrapped_env.reset()
# Now we can use just 0 or 1 as actions
obs, reward, terminated, truncated, info = wrapped_env.step(0)
print("Action 0 (left) executed successfully")Output:
Original action space: Discrete(3)
Wrapped action space: Discrete(2)
Action 0 (left) executed successfullyImportant lesson: Reward shaping can easily backfire. Let's look at a failed attempt to understand why:
class NaiveRewardShaping(gym.RewardWrapper):
def __init__(self, env):
super().__init__(env)
def reward(self, reward):
# Get current position (height on the hill)
position = self.env.unwrapped.state[0]
# Original reward is -1 per step
# Add bonus for being higher up: position ranges from -1.2 to 0.6
height_bonus = (position + 1.2) * 0.5
return reward + height_bonusThis seems logical - reward the car for gaining height. But let's see what happens when we train:
import gymnasium as gym
from stable_baselines3 import DQN
# Save baseline model first for comparison
import shutil
from pathlib import Path
Path("./models").mkdir(exist_ok=True)
shutil.copy("dqn_mountaincar.zip", "./models/dqn_mountaincar_baseline.zip")
# Create wrapped environment with naive reward shaping
env = gym.make('MountainCar-v0')
wrapped_env = NaiveRewardShaping(env)
# Train with shaped rewards
model = DQN(
"MlpPolicy",
wrapped_env,
learning_rate=1e-3,
buffer_size=50000,
learning_starts=1000,
batch_size=128,
verbose=1
)
print("Training with naive reward shaping...")
model.learn(total_timesteps=200000)
model.save("./models/dqn_mountaincar_naive_shaped")Now let's compare the baseline vs shaped model:
from stable_baselines3 import DQN
import gymnasium as gym
import numpy as np
def test_model(model_path, num_episodes=10):
model = DQN.load(model_path)
env = gym.make('MountainCar-v0')
episode_lengths = []
successes = 0
for episode in range(num_episodes):
obs, info = env.reset()
for step in range(200):
action, _states = model.predict(obs, deterministic=True)
obs, reward, terminated, truncated, info = env.step(action)
if terminated:
episode_lengths.append(step)
successes += 1
break
env.close()
avg_length = np.mean(episode_lengths) if episode_lengths else 200
success_rate = (successes / num_episodes) * 100
return avg_length, success_rate
# Compare models
baseline_length, baseline_success = test_model("./models/dqn_mountaincar_baseline")
shaped_length, shaped_success = test_model("./models/dqn_mountaincar_naive_shaped")
print("\n--- Model Comparison ---")
print(f"Baseline - Avg steps: {baseline_length:.1f}, Success rate: {baseline_success:.0f}%")
print(f"Naive Shaped - Avg steps: {shaped_length:.1f}, Success rate: {shaped_success:.0f}%")Output:
--- Model Comparison ---
Baseline - Avg steps: 147.8, Success rate: 100%
Naive Shaped - Avg steps: 200.0, Success rate: 0%What went wrong? The naive reward shaping rewards the car for being high on either hill. The agent learns to stay high on the left hill and collect rewards without actually reaching the goal on the right. This is called "reward hacking" - the agent found a way to maximize reward without solving the intended task.
A better approach only rewards rightward progress toward the goal:
class BetterRewardShaping(gym.RewardWrapper):
def __init__(self, env):
super().__init__(env)
self.best_position = -1.2 # Track the rightmost position reached
def reset(self, **kwargs):
self.best_position = -1.2
return self.env.reset(**kwargs)
def reward(self, reward):
position = self.env.unwrapped.state[0]
# Only reward new rightward progress
if position > self.best_position:
bonus = (position - self.best_position) * 10
self.best_position = position
return reward + bonus
return rewardTrain with improved shaping:
import gymnasium as gym
from stable_baselines3 import DQN
env = gym.make('MountainCar-v0')
wrapped_env = BetterRewardShaping(env)
model = DQN(
"MlpPolicy",
wrapped_env,
learning_rate=1e-3,
buffer_size=50000,
learning_starts=1000,
batch_size=128,
verbose=1
)
print("Training with improved reward shaping...")
model.learn(total_timesteps=200000)
model.save("./models/dqn_mountaincar_better_shaped")This wrapper only gives bonuses for reaching new rightward positions, preventing the agent from being rewarded for staying on the left hill.
Let's test all three models to see the full picture:
from stable_baselines3 import DQN
import gymnasium as gym
import numpy as np
def test_model(model_path, num_episodes=10):
model = DQN.load(model_path)
env = gym.make('MountainCar-v0')
episode_lengths = []
successes = 0
for episode in range(num_episodes):
obs, info = env.reset()
for step in range(200):
action, _states = model.predict(obs, deterministic=True)
obs, reward, terminated, truncated, info = env.step(action)
if terminated:
episode_lengths.append(step)
successes += 1
break
env.close()
avg_length = np.mean(episode_lengths) if episode_lengths else 200
success_rate = (successes / num_episodes) * 100
return avg_length, success_rate
# Compare all three models
baseline_length, baseline_success = test_model("./models/dqn_mountaincar_baseline")
naive_length, naive_success = test_model("./models/dqn_mountaincar_naive_shaped")
better_length, better_success = test_model("./models/dqn_mountaincar_better_shaped")
print("\n--- Model Comparison ---")
print(f"Baseline - Avg steps: {baseline_length:.1f}, Success rate: {baseline_success:.0f}%")
print(f"Naive Shaped - Avg steps: {naive_length:.1f}, Success rate: {naive_success:.0f}%")
print(f"Better Shaped - Avg steps: {better_length:.1f}, Success rate: {better_success:.0f}%")Output:
--- Model Comparison ---
Baseline - Avg steps: 147.3, Success rate: 100%
Naive Shaped - Avg steps: 200.0, Success rate: 0%
Better Shaped - Avg steps: 121.6, Success rate: 70%Analysis:
- Baseline: Reliable and consistent - 100% success rate with 147 steps average
- Naive Shaped: Complete failure - learned the wrong behavior (reward hacking)
- Better Shaped: Faster when it works (121 steps) but less reliable (70% success)
The better reward shaping achieves faster episode completion when successful, but hasn't fully solved the consistency problem. This shows that even improved reward shaping doesn't guarantee better performance than the baseline. The baseline's simplicity often wins.
You can stack multiple wrappers together:
import gymnasium as gym
# Create base environment
env = gym.make('MountainCar-v0')
# Apply multiple wrappers in sequence
wrapped_env = SimplifiedActionWrapper(env)
wrapped_env = BetterRewardShaping(wrapped_env)
print("Final action space:", wrapped_env.action_space)
print("Original environment preserved:", wrapped_env.unwrapped)Output:
Final action space: Discrete(2)
Original environment preserved: <MountainCarEnv instance>- Reward shaping is powerful but dangerous - Poorly designed rewards can mislead the agent
- Always test against a baseline - Compare wrapped vs unwrapped performance
- Reward hacking is common - Agents find unexpected ways to maximize reward
- Simple is often better - The original reward structure may work best
- Be specific about goals - Reward exactly what you want, not proxies
In many cases, the baseline environment without reward shaping works perfectly well, as we saw with MountainCar.
Vectorized environments allow you to run multiple environment instances in parallel, which can significantly speed up training by collecting experience from many environments simultaneously.
- Faster training - Collect more experience in less time
- Better exploration - Multiple environments explore different states simultaneously
- Smoother learning - Averages out randomness across environments
Gymnasium provides the make_vec_env utility from Stable-Baselines3:
from stable_baselines3.common.vec_env import DummyVecEnv, SubprocVecEnv
from stable_baselines3.common.env_util import make_vec_env
import gymnasium as gym
# Create 4 parallel environments
num_envs = 4
vec_env = make_vec_env('MountainCar-v0', n_envs=num_envs)
print(f"Created {num_envs} parallel environments")
print(f"Observation space: {vec_env.observation_space}")
print(f"Action space: {vec_env.action_space}")Output:
Created 4 parallel environments
Observation space: Box([-1.2 -0.07], [0.6 0.07], (2,), float32)
Action space: Discrete(3)The interface is similar to regular environments, but actions and observations are batched:
from stable_baselines3.common.env_util import make_vec_env
import numpy as np
vec_env = make_vec_env('MountainCar-v0', n_envs=4)
# Reset returns observations for all 4 environments
obs = vec_env.reset()
print("Observations shape:", obs.shape)
# Actions for all 4 environments
actions = np.array([0, 1, 2, 1])
obs, rewards, dones, infos = vec_env.step(actions)
print("Rewards:", rewards)
print("Dones:", dones)Output:
Observations shape: (4, 2)
Rewards: [-1. -1. -1. -1.]
Dones: [False False False False]The observations are stacked into a (4, 2) array because you have 4 parallel environments, each with 2 observations (position and velocity).
Training with vectorized environments is straightforward - just pass the vectorized environment to your algorithm:
from stable_baselines3 import DQN
from stable_baselines3.common.env_util import make_vec_env
# Create 8 parallel environments
vec_env = make_vec_env('MountainCar-v0', n_envs=8)
# Train with vectorized environments
model = DQN("MlpPolicy", vec_env, verbose=1)
model.learn(total_timesteps=200000)
model.save("./models/dqn_mountaincar_vectorized")
print("Training complete with 8 parallel environments")The algorithm automatically handles collecting experience from all environments in parallel. This typically speeds up training significantly.
There are two types of vectorized environments:
DummyVecEnv - Runs environments sequentially in the same process:
from stable_baselines3.common.vec_env import DummyVecEnv
import gymnasium as gym
def make_env():
return gym.make('MountainCar-v0')
# Create 4 environments in same process
envs = [make_env for _ in range(4)]
vec_env = DummyVecEnv(envs)SubprocVecEnv - Runs environments in separate processes (true parallelization):
from stable_baselines3.common.vec_env import SubprocVecEnv
import gymnasium as gym
def make_env():
return gym.make('MountainCar-v0')
# Create 4 environments in separate processes
envs = [make_env for _ in range(4)]
vec_env = SubprocVecEnv(envs)Contrary to intuition, SubprocVecEnv is NOT always faster. Let's compare:
from stable_baselines3.common.vec_env import DummyVecEnv, SubprocVecEnv
import gymnasium as gym
import time
def make_env():
return gym.make('MountainCar-v0')
# Test DummyVecEnv
start = time.time()
vec_env_dummy = DummyVecEnv([make_env for _ in range(8)])
vec_env_dummy.reset() # Must reset before stepping
for _ in range(1000):
vec_env_dummy.step([vec_env_dummy.action_space.sample() for _ in range(8)])
dummy_time = time.time() - start
vec_env_dummy.close()
# Test SubprocVecEnv
start = time.time()
vec_env_subproc = SubprocVecEnv([make_env for _ in range(8)])
vec_env_subproc.reset() # Must reset before stepping
for _ in range(1000):
vec_env_subproc.step([vec_env_subproc.action_space.sample() for _ in range(8)])
subproc_time = time.time() - start
vec_env_subproc.close()
print(f"DummyVecEnv time: {dummy_time:.2f}s")
print(f"SubprocVecEnv time: {subproc_time:.2f}s")
print(f"Speedup: {dummy_time/subproc_time:.2f}x")Output:
DummyVecEnv time: 0.10s
SubprocVecEnv time: 5.15s
Speedup: 0.02xFor simple environments like MountainCar, DummyVecEnv is 50x faster! The overhead of managing separate processes far outweighs any benefit from parallelization.
When to use SubprocVecEnv:
- Each environment step is computationally expensive (complex physics simulations)
- Processing images or running neural networks in the environment
- Environment steps take significantly longer than inter-process communication overhead
- You're training on Atari games with image observations
When to use DummyVecEnv:
- Simple environments with fast physics (MountainCar, CartPole, Pendulum)
- Small observation spaces (vectors rather than images)
- Debugging (easier to track errors in single process)
- The computation is so fast that communication overhead dominates
The easiest way is to use make_vec_env, which handles the details:
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import SubprocVecEnv
# Automatically uses DummyVecEnv by default
vec_env = make_vec_env('MountainCar-v0', n_envs=8)
# Or explicitly choose SubprocVecEnv for computationally intensive environments
vec_env = make_vec_env('MountainCar-v0', n_envs=8, vec_env_cls=SubprocVecEnv)Key Takeaway: For simple environments like MountainCar, always use DummyVecEnv (the default). Only use SubprocVecEnv when each environment step is expensive enough to justify the inter-process communication overhead.
You've learned:
- How to install and set up Gymnasium
- The core concepts: environments, observations, actions, rewards
- How to interact with environments using reset() and step()
- Different rendering modes (especially important for macOS users)
- Understanding Box and Discrete spaces
- The basics of environment wrappers
- How to train an agent using Stable-Baselines3
- That MountainCar requires patience and possibly extended training
The key difference between Gymnasium and the older Gym library:
- Import name:
import gymnasium as gym - reset() returns:
obs, info = env.reset()(two values instead of one) - step() returns:
obs, reward, terminated, truncated, info = env.step(action)(five values, with "done" split into "terminated" and "truncated") - Rendering: Must specify render_mode when creating environment
Next steps would include learning more about wrappers, custom environments, and comparing different RL algorithms for your specific tasks.

