Post-training methods (RLVR, On-policy distillation) are Episode-local Language models are getting better at learning from feedback during post-training. In reinforcement learning with verifiable rewards (RLVR), a model tries a problem, a verifier checks…
Travel has always been a story of how much friction we’re willing to tolerate to get somewhere. A century ago, a long-distance journey began at a train station thick with smoke and shouting…
What if launching an RL training run felt less like operating a GPU cluster and more like talking to a sharp research engineer in Slack? Training AI models is still strangely artisanal, involving…