Token Superposition Training: How Nous Research Cuts LLM Pre-Training Time by 2.5x Without Changing Architecture

The Growing Cost of Training Large Language Models

Pre-training large language models (LLMs) has become a resource-intensive endeavor. Even modest improvements in efficiency can translate into significant savings in both time and money. Nous Research has unveiled a novel method, Token Superposition Training (TST), that dramatically reduces the wall-clock time required for pre-training without altering the underlying architecture, optimizer, tokenizer, parallelism strategy, or training data. This breakthrough achieves up to a 2.5x reduction in pre-training time at a fixed compute budget, as demonstrated across models ranging from 270 million to 10 billion parameters.

Token Superposition Training: How Nous Research Cuts LLM Pre-Training Time by 2.5x Without Changing Architecture — Source: www.marktechpost.com

The Efficiency Bottleneck in LLM Pre-Training

Modern LLM pre-training is heavily driven by data volume. Recent training regimes routinely overtrain far beyond compute-optimal estimates, making raw text throughput a crucial leverage point. Subword tokenizers like BPE already improve throughput by compressing sequences into shorter token streams, allowing models to process more text per FLOP. However, research suggests that much of BPE’s advantage over byte-level models stems simply from shorter sequences—meaning the model sees more text per unit of compute. TST asks whether this throughput lever can be pulled even further during training, independently of the tokenizer and without permanently changing the model.

Introducing Token Superposition Training (TST)

TST modifies the standard pre-training loop in two sequential phases, each designed to maximize data throughput while maintaining computational cost. The method is grounded in a simple yet powerful insight: by grouping tokens into “superposed” representations, the model can ingest significantly more text per training step without increasing FLOPs. Crucially, TST does not require any new special hardware, custom kernels, or changes to the model architecture—it works seamlessly with existing training infrastructure.

Phase 1 – Superposition: Collapsing Tokens into S-Bags

During the first fraction r of total training steps (typically r between 0.2 and 0.4), the model no longer receives individual tokens. Instead, the input sequence of length L is segmented into non-overlapping bags of s contiguous tokens. In the embedding layer, each bag is collapsed into a single latent “s-token” by averaging the embeddings of the s tokens. The transformer then processes a sequence of length L/s. To keep each TST step equal in FLOPs to a standard training step, the data sequence length is increased by a factor of s during the superposition phase. Because each latent position corresponds to s source tokens, the model ingests s times as much text per unit of compute—this is the primary driver of throughput gains.

On the output side, each latent position predicts the next bag of s tokens rather than a single next token. The standard cross-entropy loss is replaced with a multi-hot cross-entropy (MCE) loss, which assigns equal probability mass 1/s to each token in the target bag. The MCE loss reduces to a simple mean of standard cross-entropy terms over the s targets—it can be implemented using the existing fused CE kernels already present in any major pre-training library, without writing new code or adding an auxiliary head.

Phase 2 – Recovery: Returning to Standard Next-Token Prediction

After the superposition phase, training resumes from the saved checkpoint with standard next-token prediction for the remaining 1 – r steps. At this boundary, the TST code is fully removed, and the model continues training exactly as it would in a conventional pipeline. This recovery phase ensures that the final model retains the performance of a fully standardly trained model, while still benefiting from the accelerated throughput achieved during superposition.

Empirical Results: Up to 2.5x Faster Pre-Training

Nous Research tested TST across a range of model scales and architectures. At the 10B-A1B mixture-of-experts (MoE) scale, TST reached a lower final training loss than a matched-FLOPs baseline while consuming only 4,768 B200-GPU-hours—compared to the baseline’s 12,311 B200-GPU-hours. This represents roughly a 2.5x reduction in total pre-training time. Similar gains were observed across models from 270M to 10B parameters, demonstrating the method’s scalability.

The optimal fraction r was found to be between 0.2 and 0.4 across all tested scales, with higher values providing greater speedups but requiring careful tuning to avoid degradation in final model quality. The bag size s can be adjusted to balance throughput and recovery efficiency; typical values range from 2 to 4.

Implications for Future LLM Development

TST offers a practical, drop-in solution for reducing the cost and carbon footprint of training large language models. By accelerating pre-training without necessitating changes to architectures or hyperparameters, it lowers the barrier to experimentation and deployment. The method is particularly valuable for organizations operating at scale, where even a 20% reduction in training time can translate into millions of dollars in savings. Moreover, because TST is orthogonal to other efficiency improvements—such as better parallelism strategies or more efficient attention mechanisms—it can be combined with existing optimizations for even greater gains.

As the demand for larger and more capable LLMs continues to grow, innovations like Token Superposition Training will play a crucial role in making pre-training more accessible and sustainable. The paper, available on arXiv (ID 2605.06546), provides full details for implementation and reproducibility.

For more on superposing tokens, revisit Phase 1 details, or learn about the empirical results. The recovery phase is described in Phase 2.

Tags: