AI’s intelligence test dilemma: why we still don’t know how to measure human-level thinking in machines

Despite impressive progress in artificial intelligence, the field is still missing a definitive way to measure whether a machine truly thinks like a human. From prize challenges to philosophical debates, researchers are wrestling with the limits of current benchmarks—and the daunting prospect of testing superintelligence.

The challenge of designing the ultimate AI test

Two major San Francisco AI players—Scale AI and the Center for AI Safety (CAIS)—recently launched “Humanity’s Last Exam,” a global call to devise questions capable of truly testing large language models (LLMs) such as Google Gemini and OpenAI’s o1. Offering $5,000 prizes for the top 50 questions, the challenge aims to probe how close we are to “expert-level AI systems.”

Manikin with an intelligence cloud obscuring their head

The problem is that today’s LLMs excel at many existing benchmarks in intelligence, mathematics, and law—sometimes to a degree that raises suspicion. These models are trained on colossal amounts of data, possibly including the very questions used to assess them. As the AI analytics group Epoch predicts, by 2028 AIs may have effectively “read” everything humans have ever written, leaving almost no established test safe from contamination. That looming moment forces researchers to rethink what testing should look like once AIs know all the answers.

Complicating matters further, the internet’s rapid growth is now intertwined with “model collapse,” a phenomenon where AI-generated content recirculates in training data, degrading model quality. Some companies are countering this by collecting fresh data from human interactions or real-world devices—from Tesla’s sensor-rich vehicles to Meta’s smart glasses—to ensure diversity in both training and evaluation datasets.

Why current intelligence benchmarks fall short

Magnus Carlsen thinking about a chess move

Defining intelligence has always been contentious. Human IQ tests have been criticized for capturing only narrow cognitive skills, and AI assessments suffer from a similar problem. Many benchmarks—summarizing text, interpreting gestures, recognizing images—test specific tasks but fail to measure adaptability or reasoning across domains.

A striking example is the chess engine Stockfish. It surpasses world champion Magnus Carlsen by a wide Elo margin, yet cannot interpret a sentence or navigate a room. This illustrates why task-specific dominance cannot be equated with general intelligence, and why AI researchers are searching for more holistic measures.

As AI systems now exhibit broader competencies, the challenge is to create tests that capture this expansion. Benchmarks must go beyond scoring isolated abilities to gauge how well a model can generalize knowledge and adapt to new, unseen problems—skills considered the hallmark of human intelligence.

The rise of reasoning-based evaluation

French Google engineer François Chollet’s “abstraction and reasoning corpus” (ARC) is one of the most ambitious attempts to move testing in this direction. Launched in 2019, ARC presents AIs with visual puzzles that require inferring and applying abstract rules with minimal examples—far from the brute-force pattern recognition of traditional image classification.

ARC remains a steep challenge for machines. Humans regularly score above 90%, but leading LLMs like OpenAI’s o1 preview and Anthropic’s Sonnet 3.5 hover around 21%. Even GPT-4o, which reached 50% through an extensive trial-and-error approach, stayed far from the $600,000 prize threshold of 85%. This gap reassures researchers that despite hype, no AI has yet matched the flexible reasoning humans demonstrate with ease.

However, ARC is not the only experiment in play. The Humanity’s Last Exam project deliberately keeps its winning questions offline to avoid AI models “studying” the answers, a move that reflects how high the stakes are for ensuring a test’s integrity in a world where machines might pre-learn anything published.

What happens when we approach superintelligence?

The urgency to develop robust tests goes beyond academic curiosity. The arrival of human-level AI—let alone superintelligence—would bring profound safety, ethical, and governance challenges. Detecting that moment requires tools capable of identifying not just task mastery, but deep reasoning, adaptability, and value alignment with human norms.

Yet, as history shows, each time AI reaches parity in a narrow domain, the goalposts shift. What once counted as “intelligent” becomes routine automation. This means that measuring superintelligence will likely demand frameworks that we cannot fully conceive yet—ones that might integrate philosophical inquiry, social sciences, and long-term behavioral observation.

For now, the race is on to design the next generation of tests before AI’s knowledge eclipses our ability to measure it. And when machines finally pass “Humanity’s Last Exam,” the next question might be even more daunting: how do we examine something smarter than ourselves?

Explore more

spot_img

Bad Bunny Takes Down Super Bowl Halftime Backlash in Fiery ‘SNL’...

In a bold and unmissable start to the new season of Saturday Night Live (SNL), musician Bad Bunny used his hosting monologue to directly...

Ronaldo and Rodríguez Confirm Engagement in Emotional Social Media Post

The world of football and celebrity has a new piece of exciting news as soccer legend Cristiano Ronaldo and his long-time partner, Georgina Rodríguez,...

Ugly Scenes Taint Birmingham City’s Return to the Championship After Fan-Player...

The celebratory return of Birmingham City to the EFL Championship, English soccer's second tier, was marred by ugly scenes following a late-game altercation between...

The Agony of a Promise: Why a Manchester United Fan is...

For any fan enduring the painful slump of Manchester United, the impulse might be to tear their hair out—but one supporter has done the...

From Agony to Amity: Mariona Caldentey on Facing Her Euro Final...

The emotional whiplash of elite club football means players must often reconcile the deepest highs and lows of the international game with the daily...

Spitting Incident Lands Luis Suárez with Three-Match MLS Ban

Inter Miami star forward Luis Suárez has been hit with an additional three-match suspension by Major League Soccer (MLS) for a spitting incident that...

Ticket to History: Your Essential Guide to Buying 2026 FIFA World...

With the 2026 FIFA World Cup—set to be the largest edition ever with 48 teams—fast approaching, the race to secure a seat at the...

Scandal at Stamford Bridge: Chelsea Hit with 74 Charges for Breaches...

The scrutiny of Chelsea FC’s past has intensified dramatically after the club was formally charged with an astonishing 74 breaches of English soccer’s rules....