AI’s intelligence test dilemma: why we still don’t know how to measure human-level thinking in machines

Despite impressive progress in artificial intelligence, the field is still missing a definitive way to measure whether a machine truly thinks like a human. From prize challenges to philosophical debates, researchers are wrestling with the limits of current benchmarks—and the daunting prospect of testing superintelligence.

The challenge of designing the ultimate AI test

Two major San Francisco AI players—Scale AI and the Center for AI Safety (CAIS)—recently launched “Humanity’s Last Exam,” a global call to devise questions capable of truly testing large language models (LLMs) such as Google Gemini and OpenAI’s o1. Offering $5,000 prizes for the top 50 questions, the challenge aims to probe how close we are to “expert-level AI systems.”

Manikin with an intelligence cloud obscuring their head

The problem is that today’s LLMs excel at many existing benchmarks in intelligence, mathematics, and law—sometimes to a degree that raises suspicion. These models are trained on colossal amounts of data, possibly including the very questions used to assess them. As the AI analytics group Epoch predicts, by 2028 AIs may have effectively “read” everything humans have ever written, leaving almost no established test safe from contamination. That looming moment forces researchers to rethink what testing should look like once AIs know all the answers.

Complicating matters further, the internet’s rapid growth is now intertwined with “model collapse,” a phenomenon where AI-generated content recirculates in training data, degrading model quality. Some companies are countering this by collecting fresh data from human interactions or real-world devices—from Tesla’s sensor-rich vehicles to Meta’s smart glasses—to ensure diversity in both training and evaluation datasets.

Why current intelligence benchmarks fall short

Magnus Carlsen thinking about a chess move

Defining intelligence has always been contentious. Human IQ tests have been criticized for capturing only narrow cognitive skills, and AI assessments suffer from a similar problem. Many benchmarks—summarizing text, interpreting gestures, recognizing images—test specific tasks but fail to measure adaptability or reasoning across domains.

A striking example is the chess engine Stockfish. It surpasses world champion Magnus Carlsen by a wide Elo margin, yet cannot interpret a sentence or navigate a room. This illustrates why task-specific dominance cannot be equated with general intelligence, and why AI researchers are searching for more holistic measures.

As AI systems now exhibit broader competencies, the challenge is to create tests that capture this expansion. Benchmarks must go beyond scoring isolated abilities to gauge how well a model can generalize knowledge and adapt to new, unseen problems—skills considered the hallmark of human intelligence.

The rise of reasoning-based evaluation

French Google engineer François Chollet’s “abstraction and reasoning corpus” (ARC) is one of the most ambitious attempts to move testing in this direction. Launched in 2019, ARC presents AIs with visual puzzles that require inferring and applying abstract rules with minimal examples—far from the brute-force pattern recognition of traditional image classification.

ARC remains a steep challenge for machines. Humans regularly score above 90%, but leading LLMs like OpenAI’s o1 preview and Anthropic’s Sonnet 3.5 hover around 21%. Even GPT-4o, which reached 50% through an extensive trial-and-error approach, stayed far from the $600,000 prize threshold of 85%. This gap reassures researchers that despite hype, no AI has yet matched the flexible reasoning humans demonstrate with ease.

However, ARC is not the only experiment in play. The Humanity’s Last Exam project deliberately keeps its winning questions offline to avoid AI models “studying” the answers, a move that reflects how high the stakes are for ensuring a test’s integrity in a world where machines might pre-learn anything published.

What happens when we approach superintelligence?

The urgency to develop robust tests goes beyond academic curiosity. The arrival of human-level AI—let alone superintelligence—would bring profound safety, ethical, and governance challenges. Detecting that moment requires tools capable of identifying not just task mastery, but deep reasoning, adaptability, and value alignment with human norms.

Yet, as history shows, each time AI reaches parity in a narrow domain, the goalposts shift. What once counted as “intelligent” becomes routine automation. This means that measuring superintelligence will likely demand frameworks that we cannot fully conceive yet—ones that might integrate philosophical inquiry, social sciences, and long-term behavioral observation.

For now, the race is on to design the next generation of tests before AI’s knowledge eclipses our ability to measure it. And when machines finally pass “Humanity’s Last Exam,” the next question might be even more daunting: how do we examine something smarter than ourselves?

Explore more

spot_img

BADBISS quy tụ dàn mẫu đa quốc gia trình diễn tại...

Mang theo hơi thở của mỹ thuật thời Lý đến với "thánh đường" thời trang DDP Dongdaemun Design Plaza, thương hiệu BADBISS chính thức...

Mẫu nhí Nhã Hân góp mặt trong bộ sưu tập “Vườn...

Sàn diễn Dongdaemun Design Plaza tại Hàn Quốc vào tháng 3 tới sẽ đón nhận sự góp mặt của nhiều tài năng nhí châu...

Người mẫu Bảo Châu xác nhận trình diễn cho thương hiệu...

Từng đạt giải cao nhất tại Junior Model International 2017 ở Ấn Độ, Đinh Ngọc Bảo Châu tiếp tục khẳng định bản lĩnh khi...

Quán quân “Tinh hoa Nhí Việt Nam” Chiêu Thư sẵn sàng...

Tiếp nối thành công rực rỡ tại Đài Bắc, mẫu nhí Hán Yu Triệu Vy (nghệ danh Chiêu Thư) chính thức trở thành gương...

Shin Seo Young góp mặt trong BST “Vườn địa đàng” của...

Tuần lễ thời trang Asia Open Runway Seoul The 16th LBMA 2026 chính thức diễn ra từ ngày 6 đến 8/3 tại Dongdaemun Design...

Ji Eun Yul: Mẫu nhí xứ Hàn góp mặt tại sàn...

Trong không gian của tuần lễ Asia Kids Open Runway 2026 sắp diễn ra tại Seoul, mẫu nhí Ji Eun Yul sẽ mang đến...

Mẫu nhí Koka Bình Nguyên sải bước mở màn tại Taipei...

Đảm nhận vị trí First Face mở màn cho bộ sưu tập "Việt Nam, viết tiếp câu chuyện hòa bình" của thương hiệu Đắc...

“Búp bê màn ảnh” Cherry An Nhiên gây ấn tượng mạnh...

Ghi dấu ấn qua hàng loạt vai diễn nhí trong các bộ phim giờ vàng của VFC, Cherry An Nhiên vừa có màn tái...