Preprint · 2026.04

TEMPO: Scaling Test-time Training for LRMs

Tianjin University · Tongyi Lab · Shanghai AI Lab

Work done during an internship at Shanghai AI Lab

Abstract

Why do existing TTT methods plateau?

TL;DR: Prior test-time training methods for LRMs plateau because they skip critic calibration. TEMPO fixes this by alternating between critic recalibration (E-step) and policy refinement (M-step) under an EM framework, achieving sustained scaling with additional test-time compute.

Test-time training (TTT) adapts model parameters on unlabeled test instances to extend reasoning capabilities beyond the original training distribution. Despite initial gains, existing TTT methods for large reasoning models (LRMs) consistently plateau and fail to scale with additional test-time compute. We trace this failure to a single structural cause: prior methods optimize the policy against a self-generated reward signal that drifts without external calibration, resulting in both performance plateaus and output diversity collapse.

We propose TEMPO, a TTT framework that interleaves policy refinement on unlabeled questions with periodic critic recalibration on a small labeled dataset. By formalizing this alternating procedure through the Expectation-Maximization (EM) algorithm, we reveal that prior methods are incomplete variants that omit the crucial recalibration step. Reintroducing this step tightens the ELBO and enables sustained improvement.

Results

State-of-the-art across all benchmarks

TEMPO consistently outperforms all baselines across model scales and reasoning benchmarks, while maintaining high output diversity where baselines collapse.

65.7%

↑ +23.5 pp

Qwen3-14B on AIME 2024
(from 42.3%)

53.9%

↑ +21.4 pp

OLMO3-7B on AIME 2024
(from 32.5%)

55.2%

↑ +19.6 pp

OLMO3-7B avg
General Reasoning

73.0

vs. 56.7 (TTRL)

Qwen3-14B pass@8
on AIME 2024

Scalability on AIME 2024

Mean@16 accuracy over training steps (OLMO3-7B)

RLVR Ceiling vs. Test-time Training

TEMPO breaks beyond converged RLVR baseline (OLMO3-7B)

Output Diversity: pass@8 Comparison

TEMPO preserves diversity; baselines collapse

Generalization to Non-Math Reasoning

OLMO3-7B-Base average accuracy across 3 benchmarks

Method

The TEMPO Framework

TEMPO treats response correctness as a latent variable and maximizes the Evidence Lower Bound (ELBO) by alternating between an E-step and M-step — an actor-critic design grounded in the EM algorithm.

\(\mathcal{L}(q,\theta) = \sum_{x}\sum_y q(y|x)\log\frac{P(\text{Correct}|y,x)\,\pi_\theta(y|x)}{q(y|x)}\)

E-Step

Critic Recalibration

Since the critic is trained to predict outcome correctness, its last-token value \(V_\phi(x, y_T)\) directly reflects the likelihood of a correct response. Periodic recalibration on \(\mathcal{D}_L\) grounds the reward signal in external supervision, preventing drift as the policy evolves.

M-Step

Policy Refinement

The actor generates reasoning trajectories on unlabeled test questions \(\mathcal{D}_u\) and is optimized via policy gradient using critic-derived token-level advantages. Tokens leading to higher value predictions receive positive reinforcement; deviations are penalized.

Motivation

Root Cause: the Missing E-Step

Through the EM lens, prior TTT methods are revealed to be degenerate variants that only execute the M-step:

📉

Performance Plateau

Self-generated reward signals are bounded by the model's initial capability. As the model grows confident in a narrow set of patterns, signals become self-reinforcing and stop providing new gradient information.

🔄

Diversity Collapse

TTRL and EMPO favor the most common reasoning path regardless of quality. The model collapses to a single mode — pass@k degrades even as mean accuracy slightly improves.

🧩

Reward Drift

Without periodic grounding, the critic's evaluations drift from true correctness as the policy evolves. The resulting mismatch introduces noise into policy gradients and stalls further improvement.

Experiments

Main Results on Mathematical Reasoning

Method	AIME24 avg@16	AIME24 pass@8	AIME25 avg@16	AIME25 pass@8	BeyondAIME
OLMO3-7B-Base
Zero-RL (PPO)	33.0	56.1	26.3	41.1	17.6
TTRL	40.8	45.6	27.1	30.7	21.8
EMPO	41.6	43.3	26.7	29.5	21.3
TEMPO (Ours)	51.1	61.6	37.0	52.5	24.5
Qwen3-14B-Base
Zero-RL (PPO)	42.3	69.1	37.1	59.0	24.9
TTRL	53.1	56.7	40.8	45.8	25.5
EMPO	55.6	59.7	44.6	46.7	27.6
TEMPO (Ours)	65.8	73.3	44.6	60.0	29.3

Generalization to Non-Math Reasoning

TEMPO surpasses strong domain-specific frontier models on general reasoning tasks.

Method	BigBench Hard	AGI Eval	Zebra Logic	Average
Frontier Models (reference)
Olmo-3-7B-RL-Zero-General	56.5	51.9	25.7	44.7
MiMo-Zero-RL-7B	61.4	53.6	30.3	48.4
OLMO3-7B-Base
Zero-RL (PPO)	46.8	37.9	22.2	35.6
TTRL	45.4	38.2	22.2	35.3
EMPO	52.9	50.2	23.5	42.2
TEMPO (Ours)	68.2	62.4	35.1	55.2
Gain via TTT	+21.4	+24.5	+12.9	+19.6

Contributions

What we contribute

1

We propose TEMPO, a test-time training framework achieving sustained performance gains through alternating actor-critic optimization, avoiding the diversity collapse and performance plateaus of prior LRM self-training methods.
2

We provide a unified EM analysis characterizing existing TTT methods (TTRL, EMPO) as incomplete EM procedures that omit the crucial posterior recalibration. Identifying this missing E-step as the root cause of scalability failures yields a principled remedy.
3

We conduct extensive experiments across three model families (7B, 8B, 14B) and five reasoning benchmarks spanning math, logic puzzles, and STEM — demonstrating both superior accuracy and preserved output diversity.

Citation

BibTeX

@article{zhang2026tempo,
  title   = {TEMPO: Scaling Test-time Training for Large Reasoning Models},
  author  = {Zhang, Qingyang and Kong, Xinke and Wu, Haitao and Hu, Qinghua and Wu, Minghao and Yang, Baosong and Cheng, Yu and Luo, Yun and Cui, Ganqu and Zhang, Changqing},
  journal = {arXiv preprint arXiv:2604.19295},
  year    = {2026}
}