Skip to content

Instantly share code, notes, and snippets.

@kardesyazilim
Created November 8, 2025 14:41
Show Gist options
  • Select an option

  • Save kardesyazilim/c491c87167ce30fb868d348aad199a8c to your computer and use it in GitHub Desktop.

Select an option

Save kardesyazilim/c491c87167ce30fb868d348aad199a8c to your computer and use it in GitHub Desktop.
Feature PPO TRPO DDPG A2C (Advantage Actor-Critic) SAC (Soft Actor-Critic)
Algorithm Type On-policy On-policy Off-policy On-policy Off-policy
Core Idea Clipped surrogate objective Trust region constraint (KL divergence) Actor-Critic + Q-learning (for continuous actions) Synchronous advantage estimation Maximum entropy (exploration) + off-policy
Stability Very stable Very stable Can be unstable Stable but can be sensitive to hyperparams Very stable
Sample Efficiency Moderate Moderate High (due to replay buffer) Moderate (on-policy) High (off-policy, replay buffer)
Complexity Simple to implement Complex (requires conjugate gradient) Moderate to Complex Complex (requires conjugate gradient) Moderate to Complex
Action Space Both discrete & continuous Both discrete & continuous Continuous only Both discrete & continuous Both discrete & continuous (SAC is especially good for continuous)
Computation Multiple epochs per batch Solves constrained optimization per update Requires target networks Simpler, single update per step More complex, but very powerful
Use Case General-purpose, good baseline High-stakes environments where stability is critical Continuous control (robotics, autonomous driving) Simpler on-policy tasks Complex continuous control, real-world robotics
Hyperparameter Sensitivity Low Low High Medium Medium to High
Exploration Through entropy or noise Through entropy or noise Through noise (e.g., Ornstein-Uhlenbeck) Through entropy bonus Intrinsic (via entropy maximization)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment