DeepSeek's R2 and SPCT: Scaling LLM Inference with Reward Models

DeepSeek AI has recently shared insights into a new method for scaling general reward models (GRMs) during inference, while also hinting at the upcoming R2 model. This development highlights a shift from pre-training scaling to post-training and inference-time optimization in large language models (LLMs). Below, we explore the key aspects of this breakthrough through a series of questions and answers.

What is DeepSeek's novel approach to scaling reward models during inference?

DeepSeek introduced a technique called Self-Play Collaborative Training (SPCT) for generalist reward models (GRMs). This method allows GRMs to dynamically generate principles and critiques during inference, rather than relying solely on static pre-trained rewards. By using rejection fine-tuning and rule-based online reinforcement learning, SPCT optimizes reward generation on the fly, making the model more adaptable and efficient when evaluating complex reasoning tasks. This inference-time scaling is a departure from traditional scaling approaches that focus on model size or training data volume, aiming instead to improve reasoning capability through additional computational effort at test time.

DeepSeek's R2 and SPCT: Scaling LLM Inference with Reward Models — Source: syncedreview.com

How does the shift from pre-training to inference-time scaling benefit LLMs?

Historically, LLM performance gains came from increasing model size and training data. However, models like OpenAI's o1 demonstrated that scaling at inference time—by allowing the model more thinking time and iterative reasoning—can yield significant improvements. DeepSeek's SPCT leverages this paradigm: instead of just expanding pre-training, the model uses extra computational effort during inference to refine its reasoning, explore alternatives, and correct mistakes. This approach helps LLMs overcome their inherent short-sightedness (caused by next-token prediction) by simulating long-term outcomes, leading to more robust and systematic problem-solving.

What is the significance of DeepSeek's R2 model announcement?

DeepSeek has signaled that the next-generation R2 model is forthcoming, building anticipation in the AI community. While details remain limited, R2 is expected to incorporate the SPCT methodology and inference-time scaling advances. This suggests that DeepSeek is moving beyond pure pre-training scaling to focus on post-training and inference optimization, aligning with the broader trend seen in models like o1 and DeepSeek's own R1 series. R2 likely aims to achieve stronger reasoning capabilities by combining a robust pre-trained foundation with enhanced reinforcement learning during inference, potentially setting a new standard for generalist LLMs.

How does reinforcement learning complement large language models?

Reinforcement learning (RL) provides LLMs with an internal world model that helps them simulate the consequences of different reasoning paths. While LLMs excel at breadth of knowledge via next-token prediction, they often lack deep planning and foresight. RL fills this gap by evaluating the quality of potential reasoning steps and selecting superior solutions. This synergy enables long-term planning and more systematic problem-solving. As Tsinghua professor Wu Yi noted, the relationship is multiplicative: RL amplifies the value of a strong pre-trained model, as decision-making optimization depends on the quality of the foundational understanding built during pre-training.

What did Wu Yi say about the relationship between LLMs and reinforcement learning?

Wu Yi, an assistant professor at Tsinghua's IIIS, described the relationship as a multiplicative one in a recent podcast. He explained that RL excels at decision-making but lacks inherent understanding. That understanding must come from the pre-trained model's vast knowledge. Only when a solid foundation of reasoning, memory, and logic is established during pre-training can RL fully unlock its potential to create a complete intelligent agent. This means that scaling RL without a strong base model yields limited gains. DeepSeek's SPCT and R2 are likely designed to strengthen this base, ensuring that RL can operate effectively at inference time.

How does the o1 model by OpenAI relate to DeepSeek's approach?

OpenAI's o1 model pioneered the idea of inference-time scaling by generating a lengthy internal chain of thought before responding—refining reasoning, exploring strategies, and detecting errors. DeepSeek's SPCT and R2 adopt a similar philosophy but focus more on reward model generalization. Where o1 emphasizes self-reflection through chain-of-thought, DeepSeek's method uses dynamic reward generation to guide reasoning. Both represent a shift from pre-training-centric scaling to post-training and inference optimization. This convergence suggests that the future of LLM advancement lies not just in bigger models, but in smarter use of compute during the reasoning process.