Preprint · 2026.04

TEMPO: Scaling Test-time Training for LRMs

Tianjin University  ·  Tongyi Lab  ·  Shanghai AI Lab

Work done during an internship at Shanghai AI Lab

Why do existing TTT methods plateau?

TL;DR: Prior test-time training methods for LRMs plateau because they skip critic calibration. TEMPO fixes this by alternating between critic recalibration (E-step) and policy refinement (M-step) under an EM framework, achieving sustained scaling with additional test-time compute.

Test-time training (TTT) adapts model parameters on unlabeled test instances to extend reasoning capabilities beyond the original training distribution. Despite initial gains, existing TTT methods for large reasoning models (LRMs) consistently plateau and fail to scale with additional test-time compute. We trace this failure to a single structural cause: prior methods optimize the policy against a self-generated reward signal that drifts without external calibration, resulting in both performance plateaus and output diversity collapse.

We propose TEMPO, a TTT framework that interleaves policy refinement on unlabeled questions with periodic critic recalibration on a small labeled dataset. By formalizing this alternating procedure through the Expectation-Maximization (EM) algorithm, we reveal that prior methods are incomplete variants that omit the crucial recalibration step. Reintroducing this step tightens the ELBO and enables sustained improvement.

State-of-the-art across all benchmarks

TEMPO consistently outperforms all baselines across model scales and reasoning benchmarks, while maintaining high output diversity where baselines collapse.

65.7%
↑ +23.5 pp
Qwen3-14B on AIME 2024
(from 42.3%)
53.9%
↑ +21.4 pp
OLMO3-7B on AIME 2024
(from 32.5%)
55.2%
↑ +19.6 pp
OLMO3-7B avg
General Reasoning
73.0
vs. 56.7 (TTRL)
Qwen3-14B pass@8
on AIME 2024

Scalability on AIME 2024

Mean@16 accuracy over training steps (OLMO3-7B)

RLVR Ceiling vs. Test-time Training

TEMPO breaks beyond converged RLVR baseline (OLMO3-7B)

Output Diversity: pass@8 Comparison

TEMPO preserves diversity; baselines collapse

Generalization to Non-Math Reasoning

OLMO3-7B-Base average accuracy across 3 benchmarks

The TEMPO Framework

TEMPO treats response correctness as a latent variable and maximizes the Evidence Lower Bound (ELBO) by alternating between an E-step and M-step — an actor-critic design grounded in the EM algorithm.

\(\mathcal{L}(q,\theta) = \sum_{x}\sum_y q(y|x)\log\frac{P(\text{Correct}|y,x)\,\pi_\theta(y|x)}{q(y|x)}\)

STAGE 1 · ONCE RLVR Pre-training PPO on labeled data 𝒟_L Labeled data 𝒟_L STAGE 2 · REPEATING E-STEP Critic Recalibration q(y|x) ∝ V_φ · π_θ₀ Labeled data 𝒟_L M-STEP Policy Refinement A_t = V_φ(y_T) − V_φ(y_1:t) Unlabeled data 𝒟_u Alternating E/M steps tighten the ELBO — sustained self-improvement beyond the RLVR ceiling
E-Step

Critic Recalibration

Since the critic is trained to predict outcome correctness, its last-token value \(V_\phi(x, y_T)\) directly reflects the likelihood of a correct response. Periodic recalibration on \(\mathcal{D}_L\) grounds the reward signal in external supervision, preventing drift as the policy evolves.

M-Step

Policy Refinement

The actor generates reasoning trajectories on unlabeled test questions \(\mathcal{D}_u\) and is optimized via policy gradient using critic-derived token-level advantages. Tokens leading to higher value predictions receive positive reinforcement; deviations are penalized.

Root Cause: the Missing E-Step

Through the EM lens, prior TTT methods are revealed to be degenerate variants that only execute the M-step:

📉

Performance Plateau

Self-generated reward signals are bounded by the model's initial capability. As the model grows confident in a narrow set of patterns, signals become self-reinforcing and stop providing new gradient information.

🔄

Diversity Collapse

TTRL and EMPO favor the most common reasoning path regardless of quality. The model collapses to a single mode — pass@k degrades even as mean accuracy slightly improves.

🧩

Reward Drift

Without periodic grounding, the critic's evaluations drift from true correctness as the policy evolves. The resulting mismatch introduces noise into policy gradients and stalls further improvement.

Main Results on Mathematical Reasoning

Method AIME24 avg@16 AIME24 pass@8 AIME25 avg@16 AIME25 pass@8 BeyondAIME
OLMO3-7B-Base
Zero-RL (PPO)33.056.126.341.117.6
TTRL40.845.627.130.721.8
EMPO41.643.326.729.521.3
TEMPO (Ours)51.161.637.052.524.5
Qwen3-14B-Base
Zero-RL (PPO)42.369.137.159.024.9
TTRL53.156.740.845.825.5
EMPO55.659.744.646.727.6
TEMPO (Ours)65.873.344.660.029.3

Generalization to Non-Math Reasoning

TEMPO surpasses strong domain-specific frontier models on general reasoning tasks.

MethodBigBench HardAGI EvalZebra LogicAverage
Frontier Models (reference)
Olmo-3-7B-RL-Zero-General56.551.925.744.7
MiMo-Zero-RL-7B61.453.630.348.4
OLMO3-7B-Base
Zero-RL (PPO)46.837.922.235.6
TTRL45.438.222.235.3
EMPO52.950.223.542.2
TEMPO (Ours)68.262.435.155.2
Gain via TTT+21.4+24.5+12.9+19.6

What we contribute

BibTeX

@article{zhang2026tempo,
  title   = {TEMPO: Scaling Test-time Training for Large Reasoning Models},
  author  = {Zhang, Qingyang and Kong, Xinke and Wu, Haitao and Hu, Qinghua and Wu, Minghao and Yang, Baosong and Cheng, Yu and Luo, Yun and Cui, Ganqu and Zhang, Changqing},
  journal = {arXiv preprint arXiv:2604.19295},
  year    = {2026}
}