The Evolution of Policy Optimization: Understanding GRPO, DAPO, and Dr. GRPO's Theoretical Foundations
This article serves as the theoretical companion to "Bridging Theory and Practice: Understanding GRPO Implementation Details in Hugging Face's TRL Library." While the companion piece focuses on implementation specifics, here we'll explore the mathematical foundations and conceptual evolution of these cutting-edge reinforcement learning algorithms for language models.
I'll examine three key algorithms that represent the rapid advancement in this field:
-
GRPO (Group Relative Policy Optimization): The pioneering approach from DeepSeek that established a new paradigm for training reasoning capabilities in LLMs
-
DAPO (Decouple Clip and Dynamic sAmpling Policy Optimization): An open-source system that scaled reinforcement learning for LLMs while addressing key limitations in GRPO