Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

OpenAI claims that its new model has reached the human level on a test for “General Intelligence”. What does that mean?


A new artificial intelligence (AI) model is just around the corner achieve results on a human level on a test designed to measure “general intelligence.”

On December 20, OpenAI’s o3 system scored 85% on the ARC-AGI benchmarkwell above the first best AI score of 55% and on par with the average human score. He also scored well on a very difficult math test.

The creation of artificial general intelligence, or AGI, is the stated goal of all major AI research labs. At first glance, OpenAI seems to have at least taken a significant step towards this goal.

While skepticism remains, many AI researchers and developers feel something has just changed. For many, the prospect of AGI now seems more real, urgent and closer than anticipated. Are they right?

Generalization and intelligence

To understand what the o3 result means, you need to understand what the ARC-AGI test is all about. In technical terms, it’s a test of an AI system’s “sample efficiency” in adapting to something new – how many examples of a new situation the system needs to see to understand how it works.

An AI system like ChatGPT (GPT-4) is not very effective at showing. It has been “trained” on millions of examples of human text, building probabilistic “rules” on which combinations of words are most likely.

The result is good enough for common tasks. It is bad at uncommon tasks, because it has less data (fewer samples) on those tasks.

Until AI systems can learn from a small number of examples and adapt with more sample efficiency, they will only be used for highly repetitive tasks and those where occasional failure is tolerable.

The ability to accurately solve unknown or new problems from limited samples of data is known as the ability to generalize. It is widely considered a necessary, even fundamental, element of intelligence.

Grids and patterns

The ARC-AGI benchmark tests for efficient sample fitting using least square grid problems as below. The AI ​​needs to understand the pattern that turns the grid on the left into the grid on the right.

Different patterns of colorful squares on a black grid background.
A working example from the ARC-AGI benchmark test.
ARC Award

Each question gives three examples to learn from. The AI ​​system needs to understand the rules that “generalize” from the three examples to the fourth.

These are very similar to the IQ tests you may sometimes remember from school.

Weak rules and adaptation

We don’t know exactly how OpenAI did, but the results suggest that the o3 model is highly adaptable. From a few examples, find rules that can be generalized.

To understand a model, we should not make any unnecessary assumptions, or be more specific than we really have to be. In theoryIf you can identify the “weaker” rules that do what you want, then you have maximized your ability to adapt to new situations.

What do we mean by weaker rules? The technical definition is complicated, but the weakest rules are usually the ones that can be described in simpler statements.

In the example above, an English expression of the rule might be something like: “Any shape with a protruding line will move to the end of that line and ‘cover’ any other shapes that overlap.”

Looking for chains of thought?

Although we don’t know how OpenAI achieved this fair result, it seems unlikely that they deliberately optimized the o3 system to find weak rules. However, to succeed in the tasks of the ARC-AGI, we need to find them.

We know that OpenAI started with a general version of the o3 model (which differs from most other models, because it can spend more time “thinking” about difficult questions) and then trained it specifically for testing ARC-AGI.

French AI researcher François Chollet, who designed the benchmark, I believe o3 searches through various “chains of thought” that describe the steps to solve the task. It could choose the “best” according to a freely defined rule, or “heuristic”.

This would be “not dissimilar” to how Google’s AlphaGo system searched through various possible sequences of moves to beat the Go world champion.

You can think of these chains of thought as programs that fit the examples. Of course, if it is like the AI ​​Go-playing, then we need a heuristic rule, or loose, to decide which program is better.

There could be thousands of different apparently equally valid programs generated. This heuristic could be “choose the weakest” or “choose the simplest”.

However, if it is like AlphaGo, then they simply had an AI to create a heuristic. This was the process for AlphaGo. Google has formed a model to evaluate different sequences of movements as better or worse than others.

What we don’t know yet

The question then is, is this really closer to AGI? If this is how the o3 works, then the underlying model might not be much better than previous models.

The concepts that the model learns from language might not be more suitable for generalization than before. Instead, we can just see a more generalizable “chain of thought” found through the extra steps of training a specialized heuristic in this test. The proof, as always, will be in the pudding.

Almost everything about o3 remains unknown. OpenAI has limited disclosure to a few media presentations and initial testing to a handful of AI security researchers, labs, and institutions.

To really understand the potential of o3 needs extensive work, including evaluations, an understanding of the distribution of its capabilities, how often it fails and how often it succeeds.

When o3 is finally released, we’ll have a much better idea of ​​whether it’s roughly as adaptable as the average human.

If so, it could have a huge, revolutionary, economic impact, ushering in a new era of self-improving accelerated intelligence. We will need new benchmarks for AGI itself and a serious consideration of how it should be governed.

If not, then this will still be an impressive result. However, everyday life remains much the same.The Conversation

Michael Timothy BennettPhD Student, School of Computer Science, Australian National University and Choose PerrierResearch Fellow, Stanford Center for Responsible Quantum Technology, Stanford University

This article is republished from The Conversation under a Creative Commons license. Read the original article.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *