PTP

TLDR

PTP is a general-purpose framework for predicting multiple tokens in a single model call. Instead of sampling post-hoc, PTP feeds the source of randomness (auxiliary variables) directly into the model, making future tokens deterministic and thus jointly predictable.

Abstract

Autoregressive decoding in language models is inherently slow, generating only one token per forward pass. We propose Parallel Token Prediction (PTP), a general-purpose framework for predicting multiple tokens in a single model call. PTP moves the source of randomness from post-hoc sampling to random input variables, making future tokens deterministic functions of those inputs and thus jointly predictable in a single forward pass. We prove that a single PTP call can represent arbitrary dependencies between tokens. PTP is trained by distilling an existing model or through inverse autoregressive training without a teacher. Experimentally, PTP achieves a 2.4× wall-clock speedup on a diverse-task speculative decoding benchmark.

How It Works

Parallel Token Prediction

Classical autoregressive decoding generates one token per forward pass: each token must be sampled before the next can be predicted. PTP breaks this dependency by introducing auxiliary uniform random variables u₁, …, u_n as additional model inputs. Since each u_i uniquely determines the sampled token t_i, the model can predict all future tokens jointly in a single forward pass. We prove that this holds for arbitrary dependencies between tokens. PTP is as expressive as any autoregressive model.

Partial Quadratic Decoding

To produce text identical to a base model, PTP is used as a draft model in speculative decoding: it proposes several tokens in one call and uses the base model to verify them. We introduce Partial Quadratic Decoding — a strategy that allows for token proposal and verification in parallel with a single model call, leveraging PTP’s confidence estimates to allocate compute efficiently and maximize accepted tokens per step.

Why Auxiliary Variables Matter

Prior approaches that predict tokens independently are fundamentally limited: later tokens are averaged over incompatible earlier choices, producing spurious combinations like def numpy or import find. By conditioning on auxiliary variables, PTP natively coordinates all predictions and yields coherent token sequences.

Results

We finetune a Vicuna-7B student using a gated LoRA adapter on ShareGPT data and evaluate on SpecBench, a diverse benchmark spanning multi-turn conversation, translation, summarization, question answering, math, and retrieval-augmented generation. O-PTP achieves the best overall wall-clock speedup of 2.4× and 4.2 accepted tokens per step.

@inproceedings{draxler2026parallel,
  title={Parallel Token Prediction for Language Models},
  author={Felix Draxler and Justus Will and Farrin Marouf Sofian and Theofanis Karaletsos and Sameer Singh and Stephan Mandt},
  booktitle={International Conference on Learning Representations},
  year={2026}
}

Parallel Token Prediction for Language Models