Blog

Sporadic posts about machine learning, math, and things I find interesting.

Mar 27, 2026

Length-controlled Baselines in Reinforcement Learning with Language Models

We propose a modification to the GRPO reward baseline that might lead to more stable training of reasoning models, with initial empirical evidence.

RL
Mar 26, 2026

From Decision Theories to Multi-Agent Reinforcement Learning

Notes on rational decision-making in predictor environments and how it relates to RL.

AlignmentRL
Mar 24, 2026

Newcomb's Paradox, Free Will and Superintellignce

What could Newcomb's Paradox on near-perfect prediction of human behavior mean for alignment?

Alignment