PTP is a general-purpose framework for predicting multiple tokens in a single model call. Instead of sampling post-hoc, PTP feeds the source of randomness (auxiliary variables) directly into the model, making future tokens deterministic and thus jointly predictable.
Autoregressive decoding in language models is inherently slow, generating only one token per forward pass. We propose Parallel Token Prediction (PTP), a general-purpose framework for predicting multiple tokens in a single model call. PTP moves the source of randomness from post-hoc sampling to random input variables, making future tokens deterministic functions of those inputs and thus jointly predictable in a single forward pass. We prove that a single PTP call can represent arbitrary dependencies between tokens. PTP is trained by distilling an existing model or through inverse autoregressive training without a teacher. Experimentally, PTP achieves a 2.4× wall-clock speedup on a diverse-task speculative decoding benchmark.
Classical autoregressive decoding generates one token per forward pass: each token must be sampled before the next can be predicted. PTP breaks this dependency by introducing auxiliary uniform random variables u1, …, un as additional model inputs. Since each ui uniquely determines the sampled token ti, the model can predict all future tokens jointly in a single forward pass. We prove that this holds for arbitrary dependencies between tokens. PTP is as expressive as any autoregressive model.
To produce text identical to a base model, PTP is used as a draft model in speculative decoding: it proposes several tokens in one call and uses the base model to verify them. We introduce Partial Quadratic Decoding — a strategy that allows for token proposal and verification in parallel with a single model call, leveraging PTP’s confidence estimates to allocate compute efficiently and maximize accepted tokens per step.
Prior approaches that predict tokens independently are fundamentally limited: later tokens are averaged over
incompatible earlier choices, producing spurious combinations like def numpy or
import find. By conditioning on auxiliary variables, PTP natively coordinates all predictions and yields
coherent token sequences.
We finetune a Vicuna-7B student using a gated LoRA adapter on ShareGPT data and evaluate on SpecBench, a diverse benchmark spanning multi-turn conversation, translation, summarization, question answering, math, and retrieval-augmented generation. O-PTP achieves the best overall wall-clock speedup of 2.4× and 4.2 accepted tokens per step.
@inproceedings{draxler2026parallel,
title={Parallel Token Prediction for Language Models},
author={Felix Draxler and Justus Will and Farrin Marouf Sofian and Theofanis Karaletsos and Sameer Singh and Stephan Mandt},
booktitle={International Conference on Learning Representations},
year={2026}
}