Test-time training (TTT) adapts model parameters on unlabeled test instances to extend reasoning capabilities beyond the original training distribution. Despite initial gains, existing TTT methods for large reasoning models (LRMs) consistently plateau and fail to scale with additional test-time compute. We trace this failure to a single structural cause: prior methods optimize the policy against a self-generated reward signal that drifts without external calibration, resulting in both performance plateaus and output diversity collapse.
We propose TEMPO, a TTT framework that interleaves policy refinement on unlabeled questions with periodic critic recalibration on a small labeled dataset. By formalizing this alternating procedure through the Expectation-Maximization (EM) algorithm, we reveal that prior methods are incomplete variants that omit the crucial recalibration step. Reintroducing this step tightens the ELBO and enables sustained improvement.
TEMPO consistently outperforms all baselines across model scales and reasoning benchmarks, while maintaining high output diversity where baselines collapse.
Mean@16 accuracy over training steps (OLMO3-7B)
TEMPO breaks beyond converged RLVR baseline (OLMO3-7B)
TEMPO preserves diversity; baselines collapse
OLMO3-7B-Base average accuracy across 3 benchmarks
TEMPO treats response correctness as a latent variable and maximizes the Evidence Lower Bound (ELBO) by alternating between an E-step and M-step — an actor-critic design grounded in the EM algorithm.
\(\mathcal{L}(q,\theta) = \sum_{x}\sum_y q(y|x)\log\frac{P(\text{Correct}|y,x)\,\pi_\theta(y|x)}{q(y|x)}\)
Since the critic is trained to predict outcome correctness, its last-token value \(V_\phi(x, y_T)\) directly reflects the likelihood of a correct response. Periodic recalibration on \(\mathcal{D}_L\) grounds the reward signal in external supervision, preventing drift as the policy evolves.
The actor generates reasoning trajectories on unlabeled test questions \(\mathcal{D}_u\) and is optimized via policy gradient using critic-derived token-level advantages. Tokens leading to higher value predictions receive positive reinforcement; deviations are penalized.
Through the EM lens, prior TTT methods are revealed to be degenerate variants that only execute the M-step:
Self-generated reward signals are bounded by the model's initial capability. As the model grows confident in a narrow set of patterns, signals become self-reinforcing and stop providing new gradient information.
TTRL and EMPO favor the most common reasoning path regardless of quality. The model collapses to a single mode — pass@k degrades even as mean accuracy slightly improves.
Without periodic grounding, the critic's evaluations drift from true correctness as the policy evolves. The resulting mismatch introduces noise into policy gradients and stalls further improvement.
| Method | AIME24 avg@16 | AIME24 pass@8 | AIME25 avg@16 | AIME25 pass@8 | BeyondAIME |
|---|---|---|---|---|---|
| OLMO3-7B-Base | |||||
| Zero-RL (PPO) | 33.0 | 56.1 | 26.3 | 41.1 | 17.6 |
| TTRL | 40.8 | 45.6 | 27.1 | 30.7 | 21.8 |
| EMPO | 41.6 | 43.3 | 26.7 | 29.5 | 21.3 |
| TEMPO (Ours) | 51.1 | 61.6 | 37.0 | 52.5 | 24.5 |
| Qwen3-14B-Base | |||||
| Zero-RL (PPO) | 42.3 | 69.1 | 37.1 | 59.0 | 24.9 |
| TTRL | 53.1 | 56.7 | 40.8 | 45.8 | 25.5 |
| EMPO | 55.6 | 59.7 | 44.6 | 46.7 | 27.6 |
| TEMPO (Ours) | 65.8 | 73.3 | 44.6 | 60.0 | 29.3 |
TEMPO surpasses strong domain-specific frontier models on general reasoning tasks.
| Method | BigBench Hard | AGI Eval | Zebra Logic | Average |
|---|---|---|---|---|
| Frontier Models (reference) | ||||
| Olmo-3-7B-RL-Zero-General | 56.5 | 51.9 | 25.7 | 44.7 |
| MiMo-Zero-RL-7B | 61.4 | 53.6 | 30.3 | 48.4 |
| OLMO3-7B-Base | ||||
| Zero-RL (PPO) | 46.8 | 37.9 | 22.2 | 35.6 |
| TTRL | 45.4 | 38.2 | 22.2 | 35.3 |
| EMPO | 52.9 | 50.2 | 23.5 | 42.2 |
| TEMPO (Ours) | 68.2 | 62.4 | 35.1 | 55.2 |
| Gain via TTT | +21.4 | +24.5 | +12.9 | +19.6 |
We propose TEMPO, a test-time training framework achieving sustained performance gains through alternating actor-critic optimization, avoiding the diversity collapse and performance plateaus of prior LRM self-training methods.
We provide a unified EM analysis characterizing existing TTT methods (TTRL, EMPO) as incomplete EM procedures that omit the crucial posterior recalibration. Identifying this missing E-step as the root cause of scalability failures yields a principled remedy.
We conduct extensive experiments across three model families (7B, 8B, 14B) and five reasoning benchmarks spanning math, logic puzzles, and STEM — demonstrating both superior accuracy and preserved output diversity.
@article{zhang2026tempo, title = {TEMPO: Scaling Test-time Training for Large Reasoning Models}, author = {Zhang, Qingyang and Kong, Xinke and Wu, Haitao and Hu, Qinghua and Wu, Minghao and Yang, Baosong and Cheng, Yu and Luo, Yun and Cui, Ganqu and Zhang, Changqing}, journal = {arXiv preprint arXiv:2604.19295}, year = {2026} }