AI’s intelligence test dilemma: why we still don’t know how to measure human-level thinking in machines

Despite impressive progress in artificial intelligence, the field is still missing a definitive way to measure whether a machine truly thinks like a human. From prize challenges to philosophical debates, researchers are wrestling with the limits of current benchmarks—and the daunting prospect of testing superintelligence.

The challenge of designing the ultimate AI test

Two major San Francisco AI players—Scale AI and the Center for AI Safety (CAIS)—recently launched “Humanity’s Last Exam,” a global call to devise questions capable of truly testing large language models (LLMs) such as Google Gemini and OpenAI’s o1. Offering $5,000 prizes for the top 50 questions, the challenge aims to probe how close we are to “expert-level AI systems.”

Manikin with an intelligence cloud obscuring their head

The problem is that today’s LLMs excel at many existing benchmarks in intelligence, mathematics, and law—sometimes to a degree that raises suspicion. These models are trained on colossal amounts of data, possibly including the very questions used to assess them. As the AI analytics group Epoch predicts, by 2028 AIs may have effectively “read” everything humans have ever written, leaving almost no established test safe from contamination. That looming moment forces researchers to rethink what testing should look like once AIs know all the answers.

Complicating matters further, the internet’s rapid growth is now intertwined with “model collapse,” a phenomenon where AI-generated content recirculates in training data, degrading model quality. Some companies are countering this by collecting fresh data from human interactions or real-world devices—from Tesla’s sensor-rich vehicles to Meta’s smart glasses—to ensure diversity in both training and evaluation datasets.

Why current intelligence benchmarks fall short

Magnus Carlsen thinking about a chess move

Defining intelligence has always been contentious. Human IQ tests have been criticized for capturing only narrow cognitive skills, and AI assessments suffer from a similar problem. Many benchmarks—summarizing text, interpreting gestures, recognizing images—test specific tasks but fail to measure adaptability or reasoning across domains.

A striking example is the chess engine Stockfish. It surpasses world champion Magnus Carlsen by a wide Elo margin, yet cannot interpret a sentence or navigate a room. This illustrates why task-specific dominance cannot be equated with general intelligence, and why AI researchers are searching for more holistic measures.

As AI systems now exhibit broader competencies, the challenge is to create tests that capture this expansion. Benchmarks must go beyond scoring isolated abilities to gauge how well a model can generalize knowledge and adapt to new, unseen problems—skills considered the hallmark of human intelligence.

The rise of reasoning-based evaluation

French Google engineer François Chollet’s “abstraction and reasoning corpus” (ARC) is one of the most ambitious attempts to move testing in this direction. Launched in 2019, ARC presents AIs with visual puzzles that require inferring and applying abstract rules with minimal examples—far from the brute-force pattern recognition of traditional image classification.

ARC remains a steep challenge for machines. Humans regularly score above 90%, but leading LLMs like OpenAI’s o1 preview and Anthropic’s Sonnet 3.5 hover around 21%. Even GPT-4o, which reached 50% through an extensive trial-and-error approach, stayed far from the $600,000 prize threshold of 85%. This gap reassures researchers that despite hype, no AI has yet matched the flexible reasoning humans demonstrate with ease.

However, ARC is not the only experiment in play. The Humanity’s Last Exam project deliberately keeps its winning questions offline to avoid AI models “studying” the answers, a move that reflects how high the stakes are for ensuring a test’s integrity in a world where machines might pre-learn anything published.

What happens when we approach superintelligence?

The urgency to develop robust tests goes beyond academic curiosity. The arrival of human-level AI—let alone superintelligence—would bring profound safety, ethical, and governance challenges. Detecting that moment requires tools capable of identifying not just task mastery, but deep reasoning, adaptability, and value alignment with human norms.

Yet, as history shows, each time AI reaches parity in a narrow domain, the goalposts shift. What once counted as “intelligent” becomes routine automation. This means that measuring superintelligence will likely demand frameworks that we cannot fully conceive yet—ones that might integrate philosophical inquiry, social sciences, and long-term behavioral observation.

For now, the race is on to design the next generation of tests before AI’s knowledge eclipses our ability to measure it. And when machines finally pass “Humanity’s Last Exam,” the next question might be even more daunting: how do we examine something smarter than ourselves?

Explore more

spot_img

NTMT Minh Phương: “Kiến trúc sư” của những mái tóc giàu...

Gắn bó với nghề từ năm 1991, NTMT Minh Phương bắt đầu với định hướng biến mái tóc thành phương tiện thể hiện cá...

Bold Softness: SmithMatthias’s Curve Chair Reimagines the Contemporary Lounge

In a compelling demonstration of design ingenuity, SmithMatthias, in collaboration with the renowned British furniture brand Allermuir, has unveiled the "Curve" lounge chair, now...

Sweet Comfort: Jason Ju’s Sundae Modular Lounge Redefines Flexible Seating

In a delightful expansion of its popular collection, DesignByThem, in collaboration with designer Jason Ju, introduces the Sundae modular lounge system—a whimsical yet highly...

Japa Valley Tokyo: Pharrell Williams and Nigo Unveil a Cultural Epicenter...

In an ambitious fusion of global creative vision and deep Japanese reverence, music icon Pharrell Williams and renowned fashion designer Nigo are set to...

Jagged Brilliance: Karamuk Kuo’s Bold Addition Reinvigorates Houston Architecture School

In a striking architectural debut on American soil, Swiss studio Karamuk Kuo has unveiled a captivating new extension to the Rice School of Architecture...

Classical Commentary: Robert Adam Weighs In on the White House’s Neoclassical...

In the hallowed halls of architectural discourse, few voices carry as much weight in the realm of classical design as that of Robert Adam....

Cultural Weave: Spaced Agency’s “Welcome to Chinatown” Hub Champions Community Resilience

In the vibrant heart of New York City's Chinatown, a groundbreaking community hub designed by Spaced Agency is redefining the role of architectural intervention...

The Unfinished Symphony: Ghana Considers Terminating Adjaye’s National Cathedral Amidst Controversy

In a significant development that underscores a growing global scrutiny of large-scale public projects, the government of Ghana is taking decisive legal steps towards...