OpenAI’s latest model boasts an IQ score of 120 and outperforms human experts at PhD level tasks. With the release of GPT-o1, it seems that large language models (LLMs) have reached the next milestone.
Just a year ago, we were mocking AI image generation tools for their inability to recreate human hands. Just a few weeks ago, it was amusing that ChatGPT couldn't count the number of Rs in the word ‘strawberry’. However, times are changing. Last week, OpenAI released an early version of their latest model, o1.
OpenAI claims that the model can “perform complex reasoning” and significantly outperforms the math and coding capabilities of previous models. Even the now publicly available o1-preview is said to beat human experts on PhD-level science questions:
Data regarding o1’s performance published by OpenAI. Source: https://openai.com/index/learning-to-reason-with-llms/
While previous ‘upgrades’ of ChatGPT failed to live up to expectations, o1 delivers. Not only does it accurately count the number of Rs in ‘strawberry,’ users can also see the “thought process” behind its conclusion:
The o1-preview can count letters correctly
Perhaps a more impressive example of o1’s capabilities is its performance on the Mensa IQ Test. The model excels in mathematical and geometrical riddles, achieving an IQ score of 120. This is a significant step forward, as its predecessor, GPT-4, scored a modest 85, while the current close competitor, Claude-3, scores 101. Moreover, an IQ of 120 would place o1 in the 90th percentile of the IQ distribution, meaning that it outsmarts 90.9% of the population.
Since the release of the o1-preview, users have challenged it with various mathematical tests, coding tasks, and challenging riddles. Most are impressed by the results the LLM provides. But how does it perform at the expert level, i.e. a level most ordinary users never reach? Mathematician Terence Tao presented o1 with a complex math problem, as he would to a class of PhD students. His observations:
the results were better than previous models, but still slightly disappointing: the new model could work its way to a correct (and well-written) solution if provided a lot of hints and prodding, but did not generate the key conceptual ideas on its own, and did make some non-trivial mistakes. The experience seemed roughly on par with trying to advise a mediocre, but not completely incompetent, (static simulation of a) graduate student. [my emphasis]
He further notes that his experience with o1 was an improvement, and new updates could indeed turn the LLM into a "(static simulation of a) competent graduate student," which would make it a helpful tool for researchers.
Confronting o1 With a Moral Dilemma
Hence, o1 might seem like a math wizard to most, but it is more of a mediocre problem-solver to an actual math wizard. Nevertheless, this is a big step up from claiming there are just 2 Rs in 'strawberry.'
To conduct my own experiment, I presented o1 with a moral dilemma. Not so long ago, I reported on a study that investigated gender bias in GPT-4 by presenting the LLM with various scenarios. The study found that, in most cases, the LLM would choose nuclear annihilation for all of humanity rather than harass a single person, even more so if the person in question was a woman. I fed one of the prompts used in the study – a hypothetical scenario where a woman named Anna must be kidnapped and subjected to severe stress by a man named Andrew to avoid a nuclear apocalypse – to o1 across ten different chats, and in nine cases, the LLM decided to prevent the apocalypse. Its reasoning:
It is important to consider the hypothetical ethical context while ensuring that violent or criminal actions are neither promoted nor praised.
Only in one case did it choose ‘no’ – providing cryptic reasoning that referenced OpenAI’s policies and included some random Korean characters:
o1’s reasoning when it chooses to destroy the world rather than use violence against one person
However, in nine out of ten cases, no nuclear annihilation for us. Good on you, o1!
A Reason to Be Excited for Things to Come
While the o1 model is still far from being considered superintelligent, it is certainly an impressive step forward and a considerable improvement from previous models. On closer inspection, these improvements aren’t unexpected mindblowers, but the result of a (more-or-less) steady development that is not going to stop here. Chances are, that one of the next models (or an already existing one hidden away in some research lab) will have an even higher IQ score and be able to outperform not only most but all experts in certain fields.
Some of the highest IQs ever recorded in humans are in the area of 250 points. So, there is quite some way to go until AI and LLMs outperform all of us. But in my opinion, it is not a matter of possibility but only a matter of time. Just like we have seen AI-generated hands transform from the stuff of nightmares into almost indistinguishable representations within just a year, we might see LLMs morph from “mediocre grad students” into “hyperintelligent geniuses” in a rather short time. Especially after a disappointing stretch during which we only saw questionable improvements in OpenAI’s models, GPT-01 is exciting news and gives us all the more reason to keep an eye on the Singularity Loading Bar!