๐Ÿค–

GRPO CodeReviewEnv

Reinforcement Learning ยท Bug-Fix Agent ยท Auto Difficulty Escalation

Qwen2.5-Coder-32B HF Router GRPO Training Exec + LLM Judge Rewards
๐ŸŽฏ
Current Level
HARD
๐Ÿ“Š
Total Episodes
10
๐Ÿ”ฅ
Win Streak
0
โšก
Last Reward
0.830

๐Ÿ“‹ Training Stats

Extreme
3
0.938
0.83
0.69
โœ… Mastered

๐Ÿ“ก Live Episode Feed (last 20)

10
Medium
0.830
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘
BugGenerator LLM โ†’ buggy challenge easy/medium/hard/extreme challenge RL Agent Qwen2.5-Coder-32B via HF Router agent_fix(challenge) fixed code HFRewardEvaluator ExecScore 60% โ€” run tests JudgeScore 40% โ€” LLM eval Final = 0.6ร—exec + 0.4ร—judge reward DifficultyEscalator rolling avg (window=5) win/lose streaks escalate / stay / drop โ†’ progress.json next difficulty level GRPO CodeReviewEnv โ€” Training Architecture LLM Agent ยท Execution Reward ยท LLM-as-Judge ยท Auto Difficulty Escalation
5 100

Auto-refreshes every 3s | Escalate threshold: 0.8 | Window: 5