kardesyazilim/ppo_compare.md

## ppo_compare.md

      
    Raw
  

              ppo_compare.md
            
          
Feature
PPO
TRPO
DDPG
A2C (Advantage Actor-Critic)
SAC (Soft Actor-Critic)


Algorithm Type
On-policy
On-policy
Off-policy
On-policy
Off-policy


Core Idea
Clipped surrogate objective
Trust region constraint (KL divergence)
Actor-Critic + Q-learning (for continuous actions)
Synchronous advantage estimation
Maximum entropy (exploration) + off-policy


Stability
Very stable
Very stable
Can be unstable
Stable but can be sensitive to hyperparams
Very stable


Sample Efficiency
Moderate
Moderate
High (due to replay buffer)
Moderate (on-policy)
High (off-policy, replay buffer)


Complexity
Simple to implement
Complex (requires conjugate gradient)
Moderate to Complex
Complex (requires conjugate gradient)
Moderate to Complex


Action Space
Both discrete & continuous
Both discrete & continuous
Continuous only
Both discrete & continuous
Both discrete & continuous (SAC is especially good for continuous)


Computation
Multiple epochs per batch
Solves constrained optimization per update
Requires target networks
Simpler, single update per step
More complex, but very powerful


Use Case
General-purpose, good baseline
High-stakes environments where stability is critical
Continuous control (robotics, autonomous driving)
Simpler on-policy tasks
Complex continuous control, real-world robotics


Hyperparameter Sensitivity
Low
Low
High
Medium
Medium to High


Exploration
Through entropy or noise
Through entropy or noise
Through noise (e.g., Ornstein-Uhlenbeck)
Through entropy bonus
Intrinsic (via entropy maximization)
Feature	PPO	TRPO	DDPG	A2C (Advantage Actor-Critic)	SAC (Soft Actor-Critic)
Algorithm Type	On-policy	On-policy	Off-policy	On-policy	Off-policy
Core Idea	Clipped surrogate objective	Trust region constraint (KL divergence)	Actor-Critic + Q-learning (for continuous actions)	Synchronous advantage estimation	Maximum entropy (exploration) + off-policy
Stability	Very stable	Very stable	Can be unstable	Stable but can be sensitive to hyperparams	Very stable
Sample Efficiency	Moderate	Moderate	High (due to replay buffer)	Moderate (on-policy)	High (off-policy, replay buffer)
Complexity	Simple to implement	Complex (requires conjugate gradient)	Moderate to Complex	Complex (requires conjugate gradient)	Moderate to Complex
Action Space	Both discrete & continuous	Both discrete & continuous	Continuous only	Both discrete & continuous	Both discrete & continuous (SAC is especially good for continuous)
Computation	Multiple epochs per batch	Solves constrained optimization per update	Requires target networks	Simpler, single update per step	More complex, but very powerful
Use Case	General-purpose, good baseline	High-stakes environments where stability is critical	Continuous control (robotics, autonomous driving)	Simpler on-policy tasks	Complex continuous control, real-world robotics
Hyperparameter Sensitivity	Low	Low	High	Medium	Medium to High
Exploration	Through entropy or noise	Through entropy or noise	Through noise (e.g., Ornstein-Uhlenbeck)	Through entropy bonus	Intrinsic (via entropy maximization)
No results found