Teaching Machines to Reason

Explanation of DeepSeek’s R1 paper

Jan 24, 2025

Hi Folks,

I couldn’t wait until next week to release the newsletter because I’m too excited about DeepSeek’s new paper and have to share it with you now :)

The team at DeepSeek has developed a revolutionary approach to teaching AI systems how to reason. In this blog post, I’ll dive into what they did and how they achieved it—from the core algorithms to the final results.

Let's begin with some context.

The State of AI Reasoning

The AI world has been rapidly evolving, with companies like Anthropic, Google, and OpenAI pushing the boundaries of what's possible. One crucial aspect of this evolution has been "post-training" - refining AI models after their initial training. Think of it like graduate school for AI: after learning the basics (pre-training), these systems undergo specialized training to develop specific skills.

OpenAI made a breakthrough in this area with their "o1" series of models. Their key insight was to let the AI spend more time thinking through problems step by step - similar to how a student might work through a complex math problem by writing out each step. While this approach worked well, it left an important question unanswered: what's the best way to teach an AI system how to do this reasoning in the first place?

The Technical Innovation: Group Relative Policy Optimization (GRPO)

Before diving into DeepSeek's complete solution, we need to understand the core algorithm they developed. They call it GRPO, and it's quite clever. Here's how it works:

The Basic Setup When the AI faces a question, it doesn't just generate one answer. Instead, it creates a group of different potential answers. Think of it like a student writing down several possible ways to solve a problem.
Scoring and Comparison The system then looks at all these answers as a group and calculates how much better or worse each answer is compared to the average of the group. This is called the "advantage" of each answer. The mathematical formula looks like this:

Advantage = (Answer's Score - Average Group Score) / Standard Deviation of Scores

This normalization helps the system understand which approaches are truly better, regardless of how difficult the problem is.

Learning from the Group The system then updates its approach based on these comparisons, but with some clever safeguards:

It won't change its approach too drastically at once (using what they call a "clip" function)
It keeps track of how much it's changing from its original behavior (using something called KL divergence)
It balances between exploring new approaches and sticking to what works

This group-based approach has two huge advantages:

It's more efficient than traditional methods that require a separate "critic" model
It provides more stable learning because it's comparing answers within the same context

The Two-Stage Breakthrough

Using this GRPO algorithm as their foundation, DeepSeek developed their solution in two major stages:

Stage 1: DeepSeek-R1-Zero - The Pure Learning Approach

The first stage was ambitious: could they teach an AI to reason using only reinforcement learning, without any examples to follow? Here's how they set it up:

The Training Template They created a simple format for the AI to follow:

A conversation between User and Assistant. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags.

The Reward System They developed two types of rewards:

Accuracy rewards: Points for getting the right answer, verified through automated checking
Format rewards: Points for properly organizing the response with the correct tags

The Learning Process As the system worked through problems, several fascinating behaviors emerged naturally:

It learned to spend more time on harder problems
It developed self-verification strategies
It would pause and restart its reasoning when stuck
It discovered the value of breaking problems into smaller pieces

Here's an example of what they call an "aha moment" in the system's learning. During training, the AI was solving a math problem and suddenly stopped itself:

<think> To solve the equation √(a - √(a + x)) = x, let's start by squaring both.... (some other calculations)
Wait, wait. Wait. That's an aha moment I can flag here. Let's reevaluate this step-by-step to identify if the correct sum can be...
</think>

The system had learned to recognize when its initial approach might not be optimal and to step back and reconsider - without ever being explicitly taught to do this!

Stage 2: DeepSeek-R1 - Refining the Diamond

While DeepSeek-R1-Zero proved that AI could learn to reason through pure trial and error, it was like a brilliant student who had developed their own unique way of solving problems - effective, but sometimes hard for others to follow. The team knew they needed to refine this raw talent into something more polished and practical.

Their solution was an elegant four-stage process that transformed their initial breakthrough into a more sophisticated system. Let's walk through how they did it.

Stage 1: The Cold Start

First came the cold start phase. Instead of letting the AI figure everything out from scratch, they gave it a foundation of good examples. Imagine teaching a new programming language - while you could figure it out through trial and error, having some well-written example code helps tremendously. The team gathered thousands of clear, well-explained solutions using various methods: they had AI systems generate detailed explanations, selected the best outputs from R1-Zero, and had human experts refine the examples. This gave their system a strong foundation in clear, effective reasoning.

Stage 2: Intensive Reasoning Training

With this foundation in place, they moved to intensive reasoning practice. Using their GRPO algorithm, they let the system solve countless problems across mathematics, coding, and scientific reasoning. But this time, they added an interesting twist: the system wasn't just rewarded for getting the right answer - it also got points for explaining its thinking clearly and consistently. Think of it like a math teacher who grades not just the final answer, but also the quality of the student's work.

Stage 3: Data Generation and Filtering

The next phase was all about quality control and expansion. The team had their system generate hundreds of thousands of solutions to various problems. For each problem, they generated multiple solutions and kept only the best ones - similar to a writer drafting several versions of a story and choosing the best one. They ended up with about 600,000 examples of high-quality reasoning across different types of problems. To ensure the system maintained its general capabilities, they also included 200,000 examples of other tasks like writing and general question-answering.

Stage 4: Balancing Capabilities

The final phase was about balance. Using reinforcement learning one last time, they trained the system to maintain its powerful reasoning abilities while also being helpful, safe, and clear in its communication. It's like teaching a brilliant mathematician to also be a good teacher - technical excellence combined with clear communication.

From Breakthrough to Benchmark

The results of this refined approach were remarkable. When tested on the American Invitational Mathematics Examination (AIME), one of the most challenging high school math competitions in the United States, DeepSeek-R1 solved nearly 80% of the problems correctly - surpassing even OpenAI's leading models.

But mathematics was just the beginning. In competitive programming, the system achieved a rating that would place it in the top 4% of human competitors. On tests of general knowledge, it showed deep understanding across various subjects, with particular strength in STEM fields, achieving over 90% accuracy on standardized tests.

Making Intelligence Efficient

Perhaps the most exciting part of their research came next. The team discovered that they could create smaller, more efficient versions of their system through a process called distillation. Think of it like creating a concentrated essence of the original model's capabilities.

They created several versions, ranging from very small (1.5 billion parameters) to very large (70 billion parameters). What's fascinating is that even their smaller models maintained impressive capabilities. Their 32 billion parameter model, for instance, could still solve over 72% of AIME problems correctly and handle complex coding tasks with remarkable accuracy.

This wasn't just about making smaller copies of their system. The team discovered something surprising: trying to train these smaller models directly with reinforcement learning didn't work as well as teaching them to mimic the larger model's reasoning patterns. It's like the difference between trying to become a chess master through trial and error versus studying the games of grandmasters.

The Road Ahead

While the results are impressive, the research team is clear about the challenges that remain. The system still struggles with certain types of tasks - particularly those requiring back-and-forth interaction or working with multiple languages. It's like a brilliant scholar who can solve complex problems but sometimes struggles with casual conversation.

The technical challenges are significant: the system needs better ways to handle ongoing conversations, more efficient training methods for software engineering tasks, and improved abilities to work with languages beyond English and Chinese. These aren't just technical limitations - they represent the next frontiers in AI development.

A Glimpse of the Future

This research represents more than just technical achievements - it shows us a new way of thinking about AI development. Instead of just teaching AI systems to follow examples, we're developing ways for them to learn through exploration and refinement, much like human learning.

The success of this approach, combined with the ability to create more efficient versions through distillation, suggests a future where sophisticated AI reasoning could become both more powerful and more accessible. It's a future where AI systems don't just process information, but truly reason about it - and do so in ways that humans can understand and work with effectively.

💎DiamantAI

Discussion about this post