
Young 🔜 WM🌍|Oct 09, 2025 10:01
GRPO is like PPO, but instead of chasing absolute rewards, it learns from relative performance within a group of samples.
For each prompt, the model generates several outputs → scores them → and optimizes based on who did better relative to others, not the raw reward.
@akshay_pachaar has brought us a more intuitive display📺(Young 🔜 WM🌍)
Share To
Timeline
HotFlash
APP
X
Telegram
CopyLink