Skip to content

Instantly share code, notes, and snippets.

@txoof
Created December 2, 2025 09:07
Show Gist options
  • Select an option

  • Save txoof/98acc1faef3b80926e518f479c79ded9 to your computer and use it in GitHub Desktop.

Select an option

Save txoof/98acc1faef3b80926e518f479c79ded9 to your computer and use it in GitHub Desktop.
Getting Started with Gymnasium

Getting Started with Gymnasium: Updated Tutorial

Original Tutorial: Getting Started With OpenAI Gym: The Basic Building Blocks by DigitalOcean

Updated for Gymnasium by: Claude (Anthropic) in collaboration with the course materials from ADS-AI program at Breda University of Applied Sciences

Last Updated: December 2024


This tutorial provides an updated guide to using Gymnasium (the successor to OpenAI Gym) for reinforcement learning. All code examples use the modern Gymnasium API and have been tested with the latest library versions. The original tutorial focused on the deprecated gym library; this version updates all content for gymnasium while preserving the pedagogical structure and adding practical insights from real-world testing.


Table of Contents

  1. Installation and Setup
  2. Understanding the Environment Class
  3. Interacting with the Environment
  4. Rendering the Environment
  5. Running a Complete Episode
  6. Understanding Space Types
  7. Introduction to Wrappers
  8. Training with Stable-Baselines3

Lesson 1: Installation and Setup

Gymnasium is the actively maintained successor to OpenAI Gym. Install it using pip:

pip install gymnasium
pip install pygame

The pygame package is needed for rendering environments visually.


Lesson 2: Understanding the Environment Class

The fundamental building block of Gymnasium is the Env class. It's a Python class that implements a simulator for the environment where you train your agent. Gymnasium comes with many environments: moving a car up a hill, balancing a swinging pendulum, playing Atari games, etc.

We'll start with MountainCar, where the objective is to drive a car up a mountain. The car sits on a one-dimensional track between two mountains. The goal is to reach the flag on the right mountain, but the engine isn't strong enough to scale it directly. You must drive back and forth to build momentum.

import gymnasium as gym
env = gym.make('MountainCar-v0')

The environment structure is described by two key attributes:

observation_space - defines the structure and legitimate values for observing the environment's state. For MountainCar, this is a vector of position and velocity.

action_space - defines the numerical structure of legitimate actions. For MountainCar, this is a discrete set of three actions: push left, do nothing, push right.

obs_space = env.observation_space
action_space = env.action_space
print("The observation space: {}".format(obs_space))
print("The action space: {}".format(action_space))

Output:

The observation space: Box([-1.2  -0.07], [0.6  0.07], (2,), float32)
The action space: Discrete(3)

The observation space is a Box, representing a 2-dimensional continuous space. The action space is Discrete with 3 possible values: 0 (push left), 1 (do nothing), or 2 (push right).


Lesson 3: Interacting with the Environment

There are two critical functions for interacting with environments:

reset() - Resets the environment to its initial state and returns the starting observation and info dictionary.

step(action) - Applies an action to the environment and returns five values: the new observation, reward, terminated flag, truncated flag, and info dictionary.

import gymnasium as gym
env = gym.make('MountainCar-v0')

# Reset the environment and see the initial observation
obs, info = env.reset()
print("The initial observation is {}".format(obs))

# Sample a random action from the entire action space
random_action = env.action_space.sample()

# Take the action and get the new observation
new_obs, reward, terminated, truncated, info = env.step(random_action)
print("The new observation is {}".format(new_obs))

Output:

The initial observation is [-0.48235664  0.        ]
The new observation is [-0.48366517 -0.00130853]

The observation is a vector with two values: position and velocity. The middle point between the mountains is the origin, with right being positive and left being negative.

Key concept: The step() function returns five values in gymnasium:

  • observation: the new state
  • reward: the reward for that action
  • terminated: True if the episode ended naturally (reached goal or failed)
  • truncated: True if the episode was cut off artificially (time limit, out of bounds)
  • info: additional diagnostic information

The split between terminated and truncated helps distinguish between episodes that completed naturally versus those cut off by time limits or other constraints.


Lesson 4: Rendering the Environment

Note for macOS users: The standard render_mode='human' has issues on macOS where pygame windows don't close properly. We use matplotlib-based rendering instead, which works reliably in Jupyter notebooks.

If you want to see what the environment looks like visually, you need to specify the render mode when creating the environment:

import gymnasium as gym
import matplotlib.pyplot as plt
from IPython import display

env = gym.make('MountainCar-v0', render_mode='rgb_array')

obs, info = env.reset()
plt.figure(figsize=(6, 4))

for _ in range(100):
    action = env.action_space.sample()
    obs, reward, terminated, truncated, info = env.step(action)
    
    frame = env.render()
    plt.clf()
    plt.imshow(frame)
    plt.axis('off')
    display.clear_output(wait=True)
    display.display(plt.gcf())

env.close()
plt.close()

This displays an animated view of the environment that updates smoothly in Jupyter notebooks.

MountainCar Environment Sample

Alternative: Using matplotlib.animation for smoother rendering

For more sophisticated visualization, you can use matplotlib's animation capabilities:

import gymnasium as gym
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import numpy as np

env = gym.make('MountainCar-v0', render_mode='rgb_array')
obs, info = env.reset()

frames = []
for _ in range(200):
    action = env.action_space.sample()
    obs, reward, terminated, truncated, info = env.step(action)
    frames.append(env.render())
    if terminated or truncated:
        obs, info = env.reset()

env.close()

# Create animation
fig, ax = plt.subplots(figsize=(6, 4))
ax.axis('off')
img = ax.imshow(frames[0])

def update(frame_idx):
    img.set_data(frames[frame_idx])
    return [img]

ani = animation.FuncAnimation(fig, update, frames=len(frames), interval=50, blit=True)
plt.show()

When to use different render modes:

  • render_mode='rgb_array': When you want to record frames, process them programmatically, train without display overhead, or work on systems without display capabilities
  • matplotlib rendering: Best for Jupyter notebooks on macOS
  • No rendering: Most efficient for training - you don't need to see every step

Lesson 5: Running a Complete Episode

Let's put together everything we've learned into a complete simulation. We'll run the agent for multiple steps, taking random actions:

import gymnasium as gym
import matplotlib.pyplot as plt
from IPython import display
import time

env = gym.make('MountainCar-v0', render_mode='rgb_array')
num_steps = 1500

obs, info = env.reset()
plt.figure(figsize=(6, 4))

for step in range(num_steps):
    # Take random action
    action = env.action_space.sample()
    
    # Apply the action
    obs, reward, terminated, truncated, info = env.step(action)
    
    # Render the environment
    frame = env.render()
    plt.clf()
    plt.imshow(frame)
    plt.axis('off')
    display.clear_output(wait=True)
    display.display(plt.gcf())
    time.sleep(0.001)
    
    # If the episode is done, start another one
    if terminated or truncated:
        obs, info = env.reset()

plt.close()

Notice the structure: reset the environment, loop through steps taking actions, check if the episode ended, and reset if needed.

In this code, we're taking random actions with env.action_space.sample(). To make the agent actually intelligent, you would replace this with a function that looks at the observation and chooses actions based on what it has learned maximizes reward. That's where reinforcement learning algorithms like Q-learning or policy gradients come in.


Lesson 6: Understanding Space Types

The observation_space for MountainCar was Box(2,) and the action_space was Discrete(3). These are data structures called Spaces that describe valid values for observations and actions. All spaces inherit from the gym.Space base class.

import gymnasium as gym

env = gym.make('MountainCar-v0')
print(type(env.observation_space))

Output:

gym.spaces.box.Box

Box Space

The Box(n,) space represents an n-dimensional continuous space. For MountainCar, n=2, so it's a 2-dimensional continuous space with position and velocity.

Box spaces are bounded with upper and lower limits:

import gymnasium as gym

env = gym.make('MountainCar-v0')
print("Upper Bound for Env Observation:", env.observation_space.high)
print("Lower Bound for Env Observation:", env.observation_space.low)

Output:

Upper Bound for Env Observation: [0.6  0.07]
Lower Bound for Env Observation: [-1.2  -0.07]

Position ranges from -1.2 to 0.6, and velocity ranges from -0.07 to 0.07.

Discrete Space

The Discrete(n) space describes a discrete space with values from 0 to n-1. For MountainCar, n=3, so actions can be 0, 1, or 2.

env.step(2)  # Works fine
env.step(4)  # Raises an error - 4 is not in [0, 1, 2]

Example: Robot Arm with Continuous Joints

If you were designing an environment for a robot arm with 3 joints, each able to rotate continuously, you would use a Box space for the action space:

from gymnasium import spaces
import numpy as np

# 3 joints, each can rotate from -180 to 180 degrees
action_space = spaces.Box(
    low=np.array([-180, -180, -180]), 
    high=np.array([180, 180, 180]), 
    dtype=np.float32
)

This gives you continuous control over all three joints simultaneously.


Lesson 7: Introduction to Wrappers

Before we dive into Wrappers, let's switch to a more complex environment: the Atari game Breakout. This will help us understand why Wrappers are useful.

First, install the Atari components:

pip install "gymnasium[atari]"
pip install "gymnasium[accept-rom-license]"

Alternatively:

pip install gymnasium
pip install gymnasium-atari
pip install ale-py
pip install "autorom[accept-rom-license]"

Now let's run Breakout with random actions:

import gymnasium as gym
import matplotlib.pyplot as plt
from IPython import display
import time

env = gym.make("ALE/Breakout-v5", render_mode='rgb_array')
print("Observation Space:", env.observation_space)
print("Action Space:", env.action_space)

obs, info = env.reset()
plt.figure(figsize=(6, 8))

for i in range(1000):
    action = env.action_space.sample()
    obs, reward, terminated, truncated, info = env.step(action)
    
    frame = env.render()
    plt.clf()
    plt.imshow(frame)
    plt.axis('off')
    display.clear_output(wait=True)
    display.display(plt.gcf())
    time.sleep(0.01)
    
    if terminated or truncated:
        obs, info = env.reset()

plt.close()

Output:

Observation Space: Box(0, 255, (210, 160, 3), uint8)
Action Space: Discrete(4)

The observation space is a 210x160 RGB image. The action space has 4 discrete actions: Left, Right, Do Nothing, Fire.

We might want to modify this environment before training because:

  • The 210x160x3 image is large and contains redundant information
  • We may want to normalize pixel values
  • We may want to stack multiple frames together to capture motion
  • We may want to clip or reshape rewards

This is where Wrappers become essential. They allow us to systematically modify environments without rewriting the core environment code.


Lesson 8: Training MountainCar with Stable-Baselines3

Let's train an agent to actually solve MountainCar using Stable-Baselines3. We'll use the DQN (Deep Q-Network) algorithm, which works particularly well for discrete action space problems like MountainCar.

First, install Stable-Baselines3:

pip install stable-baselines3

Training with DQN

DQN is specifically designed for discrete action spaces and learns MountainCar reliably:

import gymnasium as gym
from stable_baselines3 import DQN

# Create the environment
env = gym.make('MountainCar-v0')

# Create the DQN model with tuned hyperparameters
model = DQN(
    "MlpPolicy",
    env,
    learning_rate=1e-3,
    buffer_size=50000,
    learning_starts=1000,
    batch_size=128,
    tau=1.0,
    gamma=0.99,
    train_freq=4,
    gradient_steps=1,
    target_update_interval=250,
    exploration_fraction=0.2,
    exploration_final_eps=0.05,
    verbose=1
)

# Train - DQN typically solves MountainCar in 200k timesteps
print("Training DQN on MountainCar...")
model.learn(total_timesteps=200000)

# Save the trained model
model.save("dqn_mountaincar")
print("Training complete!")

Output:

Training DQN on MountainCar...
# You'll see training progress with episode rewards improving
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 200      |
|    ep_rew_mean     | -200     |
# ... training continues ...
# Episode rewards improve from -200 to around -100 as agent learns
---------------------------------
Training complete!

Testing the Trained Agent

Now let's test the trained agent with visualization:

import gymnasium as gym
from stable_baselines3 import DQN
import matplotlib.pyplot as plt
from IPython import display

# Load the trained model
model = DQN.load("dqn_mountaincar")

# Create environment with rendering
env = gym.make('MountainCar-v0', render_mode='rgb_array')
obs, info = env.reset()

plt.figure(figsize=(6, 4))

for step in range(200):
    # Use the trained model to predict the action
    action, _states = model.predict(obs, deterministic=True)
    
    obs, reward, terminated, truncated, info = env.step(action)
    
    frame = env.render()
    plt.clf()
    plt.imshow(frame)
    plt.axis('off')
    plt.title(f"Step: {step}, Position: {obs[0]:.3f}")
    display.clear_output(wait=True)
    display.display(plt.gcf())
    
    if terminated:
        print(f"Success! Reached goal in {step} steps")
        break

env.close()
plt.close()

Output:

Success! Reached goal in 89 steps
# You'll see the car successfully reach the flag

Trained MountainCar Agent

Saving Training Results as a GIF

To create a shareable visualization of your trained agent:

from stable_baselines3 import DQN
import gymnasium as gym
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from pathlib import Path

# Create assets directory if it doesn't exist
Path("./assets").mkdir(exist_ok=True)

# Load the trained model
model = DQN.load("dqn_mountaincar")
env = gym.make('MountainCar-v0', render_mode='rgb_array')

# Collect frames
obs, info = env.reset()
frames = []

for step in range(200):
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    frames.append(env.render())
    
    if terminated:
        print(f"Success! Reached goal in {step} steps")
        break

env.close()

# Create animation and save as GIF
fig, ax = plt.subplots(figsize=(6, 4))
ax.axis('off')
img = ax.imshow(frames[0])

def update(frame_idx):
    img.set_data(frames[frame_idx])
    return [img]

ani = animation.FuncAnimation(fig, update, frames=len(frames), interval=50, blit=True)
ani.save('./assets/mountain_car_trained.gif', writer='pillow', fps=20)
plt.close()

print(f"GIF saved to ./assets/mountain_car_trained.gif with {len(frames)} frames")

Output:

Success! Reached goal in 89 steps
GIF saved to ./assets/mountain_car_trained.gif with 89 frames

Why DQN Works Better for MountainCar

MountainCar is actually a challenging problem because:

  1. The reward is -1 for every timestep until you reach the goal
  2. The agent needs to learn the counterintuitive strategy of going backwards first to build momentum
  3. The sparse reward structure makes it difficult for policy gradient methods

DQN excels at this task because:

  • It's specifically designed for discrete action spaces
  • The experience replay buffer helps with sparse rewards
  • Q-learning is more sample-efficient for this type of problem

Alternative: Using PPO

While DQN is recommended for MountainCar, you can also use PPO (Proximal Policy Optimization). However, PPO typically requires more timesteps and careful hyperparameter tuning:

import gymnasium as gym
from stable_baselines3 import PPO

env = gym.make('MountainCar-v0')

model = PPO(
    "MlpPolicy", 
    env, 
    verbose=1,
    learning_rate=0.001,
    n_steps=2048,
    batch_size=64,
    n_epochs=10
)

# PPO needs significantly more training time for MountainCar
model.learn(total_timesteps=500000)
model.save("ppo_mountaincar")

For learning purposes and quick success, stick with DQN for MountainCar.

Note on GPU Usage

For simple environments like MountainCar with small observations (just 2 numbers), CPU training is actually faster than GPU. The overhead of moving data to the GPU outweighs any computational benefits.

GPU acceleration (using Metal on M-series Macs with device="mps") becomes beneficial when you have:

  • Image observations (like Atari - 210x160x3 pixels)
  • Large neural networks
  • Batch processing many environments in parallel

For MountainCar and similar simple environments, stick with CPU (the default).


Lesson 9: Creating Custom Wrappers

Wrappers allow you to modify environments systematically without changing the core environment code. Your course materials emphasize using wrappers to tweak rewards and simplify complex environments.

Understanding Wrappers

Gymnasium provides several wrapper types:

  1. gym.Wrapper - General purpose, can modify any aspect
  2. gym.ObservationWrapper - Modifies observations (override observation() method)
  3. gym.RewardWrapper - Modifies rewards (override reward() method)
  4. gym.ActionWrapper - Modifies actions (override action() method)

Example 1: Simplified Action Wrapper

This wrapper simplifies MountainCar's 3 discrete actions into just "left" or "right":

import gymnasium as gym
import numpy as np

class SimplifiedActionWrapper(gym.ActionWrapper):
    def __init__(self, env):
        super().__init__(env)
        # Original env has 3 actions: [0=push_left, 1=no_action, 2=push_right]
        # We simplify to binary: 0=left, 1=right
        self.action_space = gym.spaces.Discrete(2)
    
    def action(self, action):
        # Map binary action to original 3-action space
        # 0 -> 0 (push left)
        # 1 -> 2 (push right)
        # We never use "do nothing"
        if action == 0:
            return 0  # Push left
        else:
            return 2  # Push right

Test the wrapper:

env = gym.make('MountainCar-v0')
wrapped_env = SimplifiedActionWrapper(env)

print("Original action space:", env.action_space)
print("Wrapped action space:", wrapped_env.action_space)

obs, info = wrapped_env.reset()
# Now we can use just 0 or 1 as actions
obs, reward, terminated, truncated, info = wrapped_env.step(0)
print("Action 0 (left) executed successfully")

Output:

Original action space: Discrete(3)
Wrapped action space: Discrete(2)
Action 0 (left) executed successfully

Example 2: Reward Shaping Wrapper (Failed Attempt)

Important lesson: Reward shaping can easily backfire. Let's look at a failed attempt to understand why:

class NaiveRewardShaping(gym.RewardWrapper):
    def __init__(self, env):
        super().__init__(env)
    
    def reward(self, reward):
        # Get current position (height on the hill)
        position = self.env.unwrapped.state[0]
        
        # Original reward is -1 per step
        # Add bonus for being higher up: position ranges from -1.2 to 0.6
        height_bonus = (position + 1.2) * 0.5
        
        return reward + height_bonus

This seems logical - reward the car for gaining height. But let's see what happens when we train:

import gymnasium as gym
from stable_baselines3 import DQN

# Save baseline model first for comparison
import shutil
from pathlib import Path
Path("./models").mkdir(exist_ok=True)
shutil.copy("dqn_mountaincar.zip", "./models/dqn_mountaincar_baseline.zip")

# Create wrapped environment with naive reward shaping
env = gym.make('MountainCar-v0')
wrapped_env = NaiveRewardShaping(env)

# Train with shaped rewards
model = DQN(
    "MlpPolicy",
    wrapped_env,
    learning_rate=1e-3,
    buffer_size=50000,
    learning_starts=1000,
    batch_size=128,
    verbose=1
)

print("Training with naive reward shaping...")
model.learn(total_timesteps=200000)
model.save("./models/dqn_mountaincar_naive_shaped")

Now let's compare the baseline vs shaped model:

from stable_baselines3 import DQN
import gymnasium as gym
import numpy as np

def test_model(model_path, num_episodes=10):
    model = DQN.load(model_path)
    env = gym.make('MountainCar-v0')
    
    episode_lengths = []
    successes = 0
    
    for episode in range(num_episodes):
        obs, info = env.reset()
        for step in range(200):
            action, _states = model.predict(obs, deterministic=True)
            obs, reward, terminated, truncated, info = env.step(action)
            
            if terminated:
                episode_lengths.append(step)
                successes += 1
                break
    
    env.close()
    
    avg_length = np.mean(episode_lengths) if episode_lengths else 200
    success_rate = (successes / num_episodes) * 100
    
    return avg_length, success_rate

# Compare models
baseline_length, baseline_success = test_model("./models/dqn_mountaincar_baseline")
shaped_length, shaped_success = test_model("./models/dqn_mountaincar_naive_shaped")

print("\n--- Model Comparison ---")
print(f"Baseline - Avg steps: {baseline_length:.1f}, Success rate: {baseline_success:.0f}%")
print(f"Naive Shaped - Avg steps: {shaped_length:.1f}, Success rate: {shaped_success:.0f}%")

Output:

--- Model Comparison ---
Baseline - Avg steps: 147.8, Success rate: 100%
Naive Shaped - Avg steps: 200.0, Success rate: 0%

What went wrong? The naive reward shaping rewards the car for being high on either hill. The agent learns to stay high on the left hill and collect rewards without actually reaching the goal on the right. This is called "reward hacking" - the agent found a way to maximize reward without solving the intended task.

Example 3: Better Reward Shaping

A better approach only rewards rightward progress toward the goal:

class BetterRewardShaping(gym.RewardWrapper):
    def __init__(self, env):
        super().__init__(env)
        self.best_position = -1.2  # Track the rightmost position reached
    
    def reset(self, **kwargs):
        self.best_position = -1.2
        return self.env.reset(**kwargs)
    
    def reward(self, reward):
        position = self.env.unwrapped.state[0]
        
        # Only reward new rightward progress
        if position > self.best_position:
            bonus = (position - self.best_position) * 10
            self.best_position = position
            return reward + bonus
        
        return reward

Train with improved shaping:

import gymnasium as gym
from stable_baselines3 import DQN

env = gym.make('MountainCar-v0')
wrapped_env = BetterRewardShaping(env)

model = DQN(
    "MlpPolicy",
    wrapped_env,
    learning_rate=1e-3,
    buffer_size=50000,
    learning_starts=1000,
    batch_size=128,
    verbose=1
)

print("Training with improved reward shaping...")
model.learn(total_timesteps=200000)
model.save("./models/dqn_mountaincar_better_shaped")

This wrapper only gives bonuses for reaching new rightward positions, preventing the agent from being rewarded for staying on the left hill.

Comparing All Three Approaches

Let's test all three models to see the full picture:

from stable_baselines3 import DQN
import gymnasium as gym
import numpy as np

def test_model(model_path, num_episodes=10):
    model = DQN.load(model_path)
    env = gym.make('MountainCar-v0')
    
    episode_lengths = []
    successes = 0
    
    for episode in range(num_episodes):
        obs, info = env.reset()
        for step in range(200):
            action, _states = model.predict(obs, deterministic=True)
            obs, reward, terminated, truncated, info = env.step(action)
            
            if terminated:
                episode_lengths.append(step)
                successes += 1
                break
    
    env.close()
    
    avg_length = np.mean(episode_lengths) if episode_lengths else 200
    success_rate = (successes / num_episodes) * 100
    
    return avg_length, success_rate

# Compare all three models
baseline_length, baseline_success = test_model("./models/dqn_mountaincar_baseline")
naive_length, naive_success = test_model("./models/dqn_mountaincar_naive_shaped")
better_length, better_success = test_model("./models/dqn_mountaincar_better_shaped")

print("\n--- Model Comparison ---")
print(f"Baseline      - Avg steps: {baseline_length:.1f}, Success rate: {baseline_success:.0f}%")
print(f"Naive Shaped  - Avg steps: {naive_length:.1f}, Success rate: {naive_success:.0f}%")
print(f"Better Shaped - Avg steps: {better_length:.1f}, Success rate: {better_success:.0f}%")

Output:

--- Model Comparison ---
Baseline      - Avg steps: 147.3, Success rate: 100%
Naive Shaped  - Avg steps: 200.0, Success rate: 0%
Better Shaped - Avg steps: 121.6, Success rate: 70%

Analysis:

  • Baseline: Reliable and consistent - 100% success rate with 147 steps average
  • Naive Shaped: Complete failure - learned the wrong behavior (reward hacking)
  • Better Shaped: Faster when it works (121 steps) but less reliable (70% success)

The better reward shaping achieves faster episode completion when successful, but hasn't fully solved the consistency problem. This shows that even improved reward shaping doesn't guarantee better performance than the baseline. The baseline's simplicity often wins.

Combining Multiple Wrappers

You can stack multiple wrappers together:

import gymnasium as gym

# Create base environment
env = gym.make('MountainCar-v0')

# Apply multiple wrappers in sequence
wrapped_env = SimplifiedActionWrapper(env)
wrapped_env = BetterRewardShaping(wrapped_env)

print("Final action space:", wrapped_env.action_space)
print("Original environment preserved:", wrapped_env.unwrapped)

Output:

Final action space: Discrete(2)
Original environment preserved: <MountainCarEnv instance>

Key Lessons About Reward Shaping

  1. Reward shaping is powerful but dangerous - Poorly designed rewards can mislead the agent
  2. Always test against a baseline - Compare wrapped vs unwrapped performance
  3. Reward hacking is common - Agents find unexpected ways to maximize reward
  4. Simple is often better - The original reward structure may work best
  5. Be specific about goals - Reward exactly what you want, not proxies

In many cases, the baseline environment without reward shaping works perfectly well, as we saw with MountainCar.


Lesson 10: Vectorized Environments

Vectorized environments allow you to run multiple environment instances in parallel, which can significantly speed up training by collecting experience from many environments simultaneously.

Why Use Vectorized Environments?

  1. Faster training - Collect more experience in less time
  2. Better exploration - Multiple environments explore different states simultaneously
  3. Smoother learning - Averages out randomness across environments

Creating Vectorized Environments

Gymnasium provides the make_vec_env utility from Stable-Baselines3:

from stable_baselines3.common.vec_env import DummyVecEnv, SubprocVecEnv
from stable_baselines3.common.env_util import make_vec_env
import gymnasium as gym

# Create 4 parallel environments
num_envs = 4
vec_env = make_vec_env('MountainCar-v0', n_envs=num_envs)

print(f"Created {num_envs} parallel environments")
print(f"Observation space: {vec_env.observation_space}")
print(f"Action space: {vec_env.action_space}")

Output:

Created 4 parallel environments
Observation space: Box([-1.2  -0.07], [0.6  0.07], (2,), float32)
Action space: Discrete(3)

Using Vectorized Environments

The interface is similar to regular environments, but actions and observations are batched:

from stable_baselines3.common.env_util import make_vec_env
import numpy as np

vec_env = make_vec_env('MountainCar-v0', n_envs=4)

# Reset returns observations for all 4 environments
obs = vec_env.reset()
print("Observations shape:", obs.shape)

# Actions for all 4 environments
actions = np.array([0, 1, 2, 1])
obs, rewards, dones, infos = vec_env.step(actions)

print("Rewards:", rewards)
print("Dones:", dones)

Output:

Observations shape: (4, 2)
Rewards: [-1. -1. -1. -1.]
Dones: [False False False False]

The observations are stacked into a (4, 2) array because you have 4 parallel environments, each with 2 observations (position and velocity).

Training with Vectorized Environments

Training with vectorized environments is straightforward - just pass the vectorized environment to your algorithm:

from stable_baselines3 import DQN
from stable_baselines3.common.env_util import make_vec_env

# Create 8 parallel environments
vec_env = make_vec_env('MountainCar-v0', n_envs=8)

# Train with vectorized environments
model = DQN("MlpPolicy", vec_env, verbose=1)
model.learn(total_timesteps=200000)
model.save("./models/dqn_mountaincar_vectorized")

print("Training complete with 8 parallel environments")

The algorithm automatically handles collecting experience from all environments in parallel. This typically speeds up training significantly.

DummyVecEnv vs SubprocVecEnv

There are two types of vectorized environments:

DummyVecEnv - Runs environments sequentially in the same process:

from stable_baselines3.common.vec_env import DummyVecEnv
import gymnasium as gym

def make_env():
    return gym.make('MountainCar-v0')

# Create 4 environments in same process
envs = [make_env for _ in range(4)]
vec_env = DummyVecEnv(envs)

SubprocVecEnv - Runs environments in separate processes (true parallelization):

from stable_baselines3.common.vec_env import SubprocVecEnv
import gymnasium as gym

def make_env():
    return gym.make('MountainCar-v0')

# Create 4 environments in separate processes
envs = [make_env for _ in range(4)]
vec_env = SubprocVecEnv(envs)

Performance Comparison

Contrary to intuition, SubprocVecEnv is NOT always faster. Let's compare:

from stable_baselines3.common.vec_env import DummyVecEnv, SubprocVecEnv
import gymnasium as gym
import time

def make_env():
    return gym.make('MountainCar-v0')

# Test DummyVecEnv
start = time.time()
vec_env_dummy = DummyVecEnv([make_env for _ in range(8)])
vec_env_dummy.reset()  # Must reset before stepping
for _ in range(1000):
    vec_env_dummy.step([vec_env_dummy.action_space.sample() for _ in range(8)])
dummy_time = time.time() - start
vec_env_dummy.close()

# Test SubprocVecEnv
start = time.time()
vec_env_subproc = SubprocVecEnv([make_env for _ in range(8)])
vec_env_subproc.reset()  # Must reset before stepping
for _ in range(1000):
    vec_env_subproc.step([vec_env_subproc.action_space.sample() for _ in range(8)])
subproc_time = time.time() - start
vec_env_subproc.close()

print(f"DummyVecEnv time: {dummy_time:.2f}s")
print(f"SubprocVecEnv time: {subproc_time:.2f}s")
print(f"Speedup: {dummy_time/subproc_time:.2f}x")

Output:

DummyVecEnv time: 0.10s
SubprocVecEnv time: 5.15s
Speedup: 0.02x

For simple environments like MountainCar, DummyVecEnv is 50x faster! The overhead of managing separate processes far outweighs any benefit from parallelization.

When to use SubprocVecEnv:

  • Each environment step is computationally expensive (complex physics simulations)
  • Processing images or running neural networks in the environment
  • Environment steps take significantly longer than inter-process communication overhead
  • You're training on Atari games with image observations

When to use DummyVecEnv:

  • Simple environments with fast physics (MountainCar, CartPole, Pendulum)
  • Small observation spaces (vectors rather than images)
  • Debugging (easier to track errors in single process)
  • The computation is so fast that communication overhead dominates

Using make_vec_env (Recommended)

The easiest way is to use make_vec_env, which handles the details:

from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import SubprocVecEnv

# Automatically uses DummyVecEnv by default
vec_env = make_vec_env('MountainCar-v0', n_envs=8)

# Or explicitly choose SubprocVecEnv for computationally intensive environments
vec_env = make_vec_env('MountainCar-v0', n_envs=8, vec_env_cls=SubprocVecEnv)

Key Takeaway: For simple environments like MountainCar, always use DummyVecEnv (the default). Only use SubprocVecEnv when each environment step is expensive enough to justify the inter-process communication overhead.


Summary

You've learned:

  1. How to install and set up Gymnasium
  2. The core concepts: environments, observations, actions, rewards
  3. How to interact with environments using reset() and step()
  4. Different rendering modes (especially important for macOS users)
  5. Understanding Box and Discrete spaces
  6. The basics of environment wrappers
  7. How to train an agent using Stable-Baselines3
  8. That MountainCar requires patience and possibly extended training

The key difference between Gymnasium and the older Gym library:

  • Import name: import gymnasium as gym
  • reset() returns: obs, info = env.reset() (two values instead of one)
  • step() returns: obs, reward, terminated, truncated, info = env.step(action) (five values, with "done" split into "terminated" and "truncated")
  • Rendering: Must specify render_mode when creating environment

Next steps would include learning more about wrappers, custom environments, and comparing different RL algorithms for your specific tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment