Parallel Token Prediction for Language Models

ICLR '26
*Equal Contribution
AR vs PTP comparison

By the time our PTP model (right) has generated an entire function, its autoregressive teacher (left) is still busy with the function header. Green tokens are generated during one PTP call; gray tokens show the remaining steps the autoregressive model would still need.

TLDR

PTP is a general-purpose framework for predicting multiple tokens in a single model call. Instead of sampling post-hoc, PTP feeds the source of randomness (auxiliary variables) directly into the model, making future tokens deterministic and thus jointly predictable.

Abstract

Autoregressive decoding in language models is inherently slow, generating only one token per forward pass. We propose Parallel Token Prediction (PTP), a general-purpose framework for predicting multiple tokens in a single model call. PTP moves the source of randomness from post-hoc sampling to random input variables, making future tokens deterministic functions of those inputs and thus jointly predictable in a single forward pass. We prove that a single PTP call can represent arbitrary dependencies between tokens. PTP is trained by distilling an existing model or through inverse autoregressive training without a teacher. Experimentally, PTP achieves a 2.4× wall-clock speedup on a diverse-task speculative decoding benchmark.

How It Works

Parallel Token Prediction

Classical autoregressive decoding generates one token per forward pass: each token must be sampled before the next can be predicted. PTP breaks this dependency by introducing auxiliary uniform random variables u1, …, un as additional model inputs. Since each ui uniquely determines the sampled token ti, the model can predict all future tokens jointly in a single forward pass. We prove that this holds for arbitrary dependencies between tokens. PTP is as expressive as any autoregressive model.

PTP sampling diagram

Partial Quadratic Decoding

To produce text identical to a base model, PTP is used as a draft model in speculative decoding: it proposes several tokens in one call and uses the base model to verify them. We introduce Partial Quadratic Decoding — a strategy that allows for token proposal and verification in parallel with a single model call, leveraging PTP’s confidence estimates to allocate compute efficiently and maximize accepted tokens per step.

Partial Quadratic Decoding

Why Auxiliary Variables Matter

Prior approaches that predict tokens independently are fundamentally limited: later tokens are averaged over incompatible earlier choices, producing spurious combinations like def numpy or import find. By conditioning on auxiliary variables, PTP natively coordinates all predictions and yields coherent token sequences.

Independent vs coordinated prediction

Results

We finetune a Vicuna-7B student using a gated LoRA adapter on ShareGPT data and evaluate on SpecBench, a diverse benchmark spanning multi-turn conversation, translation, summarization, question answering, math, and retrieval-augmented generation. O-PTP achieves the best overall wall-clock speedup of 2.4× and 4.2 accepted tokens per step.

@inproceedings{draxler2026parallel,
  title={Parallel Token Prediction for Language Models},
  author={Felix Draxler and Justus Will and Farrin Marouf Sofian and Theofanis Karaletsos and Sameer Singh and Stephan Mandt},
  booktitle={International Conference on Learning Representations},
  year={2026}
}