Meta researchers distill System 2 thinking into LLMs, improving performance on complex reasoning

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


Large language models (LLMs) are very good at answering simple questions but require special prompting techniques to handle complex tasks that need reasoning and planning. Often referred to as “System 2” techniques, these prompting schemes enhance the reasoning capabilities of LLMs by forcing them to generate intermediate steps toward solving a problem.

While effective, System 2 techniques make LLM applications slow and computationally expensive. In a new paper, researchers at Meta FAIR present “System 2 distillation,” a technique that teaches LLMs complex tasks without requiring intermediate steps. 

System 1 and System 2 in cognitive science and LLMs

In cognitive science, System 1 and System 2 refer to two distinct modes of thinking. System 1 thinking is fast, intuitive and automatic. It is what we use when recognizing patterns, making quick judgments, or understanding familiar symbols. For example, we use System 1 thinking to identify traffic signs, recognize faces, and associate basic symbols with their meanings.

System 2 thinking, on the other hand, is slow, deliberate and analytical. It requires conscious effort and is used for complex problem-solving, such as manipulating abstract symbols, solving mathematical equations or planning a trip. 

LLMs are usually considered analogous to System 1 thinking. They can generate text very quickly, but they struggle with tasks that require deliberate reasoning and planning. 

In recent years, AI researchers have shown that LLMs can be made to mimic System 2 thinking by prompting them to generate intermediate reasoning steps before providing their final answer. For example, “Chain of Thought” is a prompting technique that instructs the LLM to explain its reasoning process step by step, which often leads to more accurate results for logical reasoning tasks. Several System 2 prompting techniques are tailored for different tasks.

“Many of these methods are shown to produce more accurate results due to this explicit reasoning, but typically do so at much higher inference cost and latency for a response,” the Meta AI researchers write. “Due to the latter, many of these approaches are not used in production systems, which mostly use System 1 generations.”

System 2 distillation

An interesting observation about System 2 thinking in humans is that when we repeatedly perform a task that requires deliberate effort, it gradually becomes ingrained in our System 1. For example, when you learn to drive, you use a lot of conscious effort to control the car, follow traffic rules and navigate. But as you gain more experience, driving becomes second nature. You no longer need to think about each step, and you can perform them intuitively and automatically.

This phenomenon inspired the Meta AI researchers to develop “System 2 distillation” for LLMs. 

Distillation is a common technique in machine learning (ML), where a larger model, referred to as the “teacher,” is used to train a smaller model, or the “student.” For example, developers often use frontier models such as GPT-4 and Claude to generate training examples for smaller models such as Llama-2 7B.

However, System 2 distillation does not use a separate teacher model. Instead, the researchers found a way to distill the knowledge gained from the model’s own System 2 reasoning capabilities into its fast-paced and compute-efficient System 1 generation.

System 2 distillation (source: arxiv)

The process starts by prompting the LLM to solve a problem using System 2 prompting techniques. The responses are then verified for correctness through an unsupervised mechanism. For example, they use “self-consistency,” where the model is given the same prompt multiple times. Its answers are then compared, and the one that shows up most often is considered the correct answer and is chosen for the distillation dataset. If the answers are too inconsistent, then the example and its answers are discarded.

Next, they discard the intermediate steps generated by System 2 reasoning and only keep the final answers. Finally, they fine-tuned the model on the initial question and the answer. This allows the model to skip the reasoning steps and jump straight to the answer.

System 2 distillation in action

The researchers evaluated their method on a range of reasoning tasks and four different System 2 prompting techniques. For the base model, they used Llama-2-70B, which is large enough to have the capacity for internalizing new knowledge.

The System 2 approaches they used in their experiments include Chain-of-Thought, System 2 Attention, Rephrase and Respond and Branch-Solve-Merge. Some of these techniques require the model to be prompted several times, which makes them both slow and expensive. For example, Rephrase and Respond first prompts the model to rephrase the original query with elaboration, and then it re-prompts the model with the rephrased question. Branch-Solve-Merge is even more complicated and requires multiple back-and-forths with the model.

The results show that System 2 distillation can significantly improve the performance of LLMs on complex reasoning tasks, often matching or exceeding the accuracy of the original System 2 methods. Additionally, the distilled models can generate responses much faster and with less compute because they don’t have to go through the intermediate reasoning steps.

For example, they found that distillation was successful for tasks that use System 2 Attention to deal with biased opinions or irrelevant information. It also showed impressive results in some reasoning tasks, where Rephrase and Respond is used to clarify and improve responses, and for fine-grained evaluation and processing of tasks through Branch-Solve-Merge.

“We have shown that in many cases it is possible to distill this System 2 reasoning into the outputs of the LLM without intermediate generations while maintaining, or sometimes even improving, performance,” the researchers write. 

However, the researchers also found that, like humans, LLMs can’t distill all types of reasoning skills into their fast-paced inference mechanism. For example, they were unable to successfully distill complex math reasoning tasks that required Chain-of-Thought prompting. This suggests that some tasks might always require deliberate reasoning.

There is much more to be learned about System 2 distillation, such as how well it works on smaller models and how distillation affects the model’s broader performance on tasks that were not included in the distillation training dataset. It is also worth noting that LLM benchmarks are often prone to contamination, where the model already has some kind of knowledge of the test examples, resulting in bloated outcomes on test sets. 

However, distillation will surely be a powerful optimization tool for mature LLM pipelines that perform specific tasks at each step.

“Looking forward, systems that can distill useful tasks in this way free up more time to spend on reasoning about the tasks that they cannot yet do well, just as humans do,” the researchers write.