Blog
Sporadic posts about machine learning, math, and things I find interesting.
-
Mar 27, 2026
Length-controlled Baselines in Reinforcement Learning with Language Models
We propose a modification to the GRPO reward baseline that might lead to more stable training of reasoning models, with initial empirical evidence.
-
Mar 26, 2026
From Decision Theories to Multi-Agent Reinforcement Learning
Notes on rational decision-making in predictor environments and how it relates to RL.
-
Mar 24, 2026
Newcomb's Paradox, Free Will and Superintellignce
What could Newcomb's Paradox on near-perfect prediction of human behavior mean for alignment?