Key takeaways:
- OpenAI’s new model, “o1,” code-named “Strawberry,” delivers unprecedented performance in STEM, coding, and logic tasks, surpassing state-of-the-art benchmarks.
- o1 leverages “chain-of-thought reasoning,” enhancing accuracy by breaking down complex problems step by step.
- The o1 model is not a replacement for OpenAI’s GPT-4o but rather a complement to it. Enterprises should evaluate use cases that benefit from o1’s precision, particularly in research and problem-solving scenarios.
What is OpenAI’s new model “o1”?
Following months of speculation about powerful, secretive AI under the codenames “q*” and “Strawberry,” OpenAI recently announced its new large language model, “o1.”
The o1 model was trained with a new reinforcement learning approach that effectively taught the model how to “think.” The result was capabilities that shatter state-of-the-art results in benchmarks related to STEM, coding, logic and more. Perhaps most impressively, o1 outperforms OpenAI’s GPT-4o by a factor of 6x on the AIME math competition (83.3% accuracy vs 13.4%). The model also boasts improved safety and is much more resistant to jailbreaking than GPT-4o.
A preview version of the model (“o1-preview”) was released on September 12, 2024; it’s unclear when the full “o1” version will be generally available.
How does “o1” work? What is chain-of-thought reasoning?
Much of o1’s capability improvement comes from its enhanced chain-of-thought reasoning, a style of prompting that asks the LLM to think through its answer step by step.
We’ll use an example to explain chain-of-thought reasoning and demonstrate its effectiveness.
Consider the following question:
“A juggler can juggle 16 balls. Half of the balls are golf balls, and half of the golf balls are blue. How many blue golf balls are there?”
When you ask an LLM (without requesting chain-of-thought reasoning), it might jump right to an answer that may or may not be correct. And it won’t explain its thinking, so it’s harder to validate the answer or understand where the thinking went wrong if it’s incorrect.
On the other hand, o1-preview builds a response from the ground up, calculating that there are 8 golf balls (half of the 16 balls), and 4 blue golf balls (half of the 8 golf balls), which is the correct answer.
The researchers behind the technique found that chain-of-thought prompting consistently increased LLM accuracy, and results from many LLM performance benchmarks confirm this point.
Chain-of-thought prompting has always been a tool that the user of an LLM can apply, but the o1 model does it automatically and has learned to do it better via OpenAI’s reinforcement learning training. This technique trained o1 to reason with chain-of-thought more effectively, providing human feedback on whether its reasoning was correct or not. This helps o1 identify and correct mistakes, break challenging problems into smaller subproblems, and pivot when its current approach is not working.
Implications of OpenAI’s “o1” model (code-named “Strawberry”)
The o1 model is not a replacement for OpenAI’s GPT-4o but rather a complement to it. Its performance lifts focused more on STEM topics, and it can’t access the internet like GPT-4o.
It makes different trade-offs, too: o1 delivers better performance at the expense of higher cost and processing time. However, this is actually a good thing because it provides option value.
Use cases that require speed and efficiency can use one of the many existing LLMs, many of which are fast and cost-effective but still performant.
Use cases like a researcher thinking through a challenging physics problem can benefit from the better accuracy and the explainability that o1’s chain-of-thought reasoning provides.
After seeing shocking improvements over the past few years, LLM capabilities might be hitting an asymptote where the current scaling approach (adding more data and more computational power) is reaching diminishing marginal returns.
There will need to be a paradigm shift that takes us beyond the current pattern recognition approach. One such new technique is to combine logic-based systems with the massive amounts of data and computational power available today.
The impressive performance of o1 suggests this new paradigm may hold potential.
What should a business leader or executive do next?
- Evaluate use cases: Look for areas in your business—like R&D or technical problem-solving—where o1’s detailed, step-by-step reasoning could deliver value.
- Consider the cost: o1’s benefits come with higher resource demands, so make sure to evaluate whether those trade-offs make sense for your specific use cases.
- Plan for integration: Think about how o1 fits into your broader AI strategy. Strategically evaluate if and how o1 can complement your existing AI models. Consider using o1 for complex tasks while still leveraging faster, lower-cost models for routine operations.
Keep Learning