AI’s elusive exam: The quest to measure human-level intelligence in machines

Artificial intelligence has dazzled with feats from mastering chess to drafting legal briefs, yet one puzzle remains unsolved: how to tell if it truly thinks like us. As large language models consume more of the world’s knowledge, their ability to ace traditional tests might be less a sign of brilliance than a reflection of perfect recall. The challenge is not just inventing harder exams—it’s redefining what “intelligence” means in an era when machines may soon read everything we’ve ever written.

The race to design the ultimate test

Two of San Francisco’s leading AI players—Scale AI and the Center for AI Safety—have thrown down a high-stakes challenge. Their initiative, Humanity’s Last Exam, invites the public to craft questions capable of probing the limits of large language models like Google Gemini and OpenAI’s o1. The reward is enticing: US$5,000 for each of the 50 best questions selected, with the promise of contributing to what could become the definitive benchmark for “expert-level” AI systems.

Colourful abstract image

The push for new tests arises because current benchmarks may be misleading. Many language models now ace exams in mathematics, law, and reasoning, yet much of this success could stem from having already “seen” the answers during training. As these models are fed ever-expanding datasets—approaching, by some estimates, the sum total of human-written text by 2028—traditional assessments risk becoming obsolete. In this landscape, crafting questions that a model cannot anticipate becomes a formidable challenge.

The problem of pre-learned intelligence

In AI, the leap from conventional programming to machine learning was driven by data—not just vast in scale, but carefully structured for both training and testing. Developers rely on “test datasets” that, in theory, are untouched by the model during learning. But when a system has effectively consumed the internet, finding genuinely novel material is no simple feat.

Adding complexity is the looming threat of “model collapse,” where AI-generated content begins to dominate the web. If this synthetic material cycles back into training datasets, the quality of future models could deteriorate. To combat this, developers are turning to real-world interactions, harvesting fresh human input from tools like smart glasses, autonomous vehicles, and other sensor-rich devices. The hope is to keep AI’s understanding grounded in genuine, lived experience rather than an echo chamber of its own creations.

Why narrow success isn’t true intelligence

One of AI’s most dazzling feats—the chess dominance of Stockfish over Magnus Carlsen—illustrates a core problem in measuring machine intelligence. Stockfish is unmatched in its domain, yet utterly incapable of holding a conversation or recognising a human face. Its brilliance is narrow, not general. This mirrors a limitation in many AI benchmarks: they measure discrete skills without capturing adaptability or creativity.

To address this, Google engineer François Chollet introduced the Abstraction and Reasoning Corpus (ARC) in 2019. The ARC tests strip away massive datasets and instead present puzzles requiring the model to infer rules from minimal examples. Humans routinely score over 90% on ARC challenges. The best AI systems—such as GPT-4o or Anthropic’s Sonnet 3.5—still fall short, with scores ranging from 21% to 50%, often relying on brute-force solution generation rather than elegant reasoning.

The road toward measuring superintelligence

ARC may be the most credible benchmark yet for testing general reasoning, but it is not the endgame. Initiatives like Humanity’s Last Exam suggest that the community is still searching for diverse, robust ways to assess intelligence without giving away the answers. In some cases, the most incisive questions may never be made public, ensuring that future AIs cannot “study” for them in advance.

This secrecy hints at the deeper stakes: the moment we can reliably detect human-level reasoning, we also edge closer to the harder question—how to recognise, and perhaps contain, superintelligence. Testing for such an entity will require more than clever puzzles. It will demand an understanding of intelligence that transcends human experience, forcing us to define capabilities we have never encountered before. Until then, the exam papers remain unwritten, and the race to craft them continues.

Explore more

spot_img

Hoa hậu sinh thái Lê Thị Phương Thùy trở thành Đại...

Vừa qua, tại thành phố Nam Ninh (Quảng Tây, Trung Quốc), Hoa hậu Lê Thị Phương Thùy đã chính thức được trao danh hiệu...

Hoa hậu sinh thái Lê Thị Phương Thùy trở thành Đại...

Vừa qua, tại thành phố Nam Ninh (Quảng Tây, Trung Quốc), Hoa hậu Lê Thị Phương Thùy đã chính thức được trao danh hiệu...

Thái Nguyễn Trung Kiên và hành trình định vị giá trị...

Chủ động tìm kiếm cơ hội và xây dựng các mối quan hệ xã hội là cách Thái Nguyễn Trung Kiên định vị giá...

Nam sinh ngành Báo chí và nỗ lực định vị giá...

Trong mắt bạn bè, Nhật Thành là một chàng trai luôn tràn đầy năng lượng với "vũ khí" lợi hại nhất chính là nụ...

Ma Đặng Ánh Thư: Cơ duyên với thiết kế của nữ...

Học khoa Giáo dục Chính trị tại Trường Đại học Sư phạm Hà Nội 2 nhưng lại bén duyên với nghề thiết kế thời...

Song Hỷ Lâm Nguy: Khi thời trang cưới tối giản lên...

Trước khi chính thức bước vào cuộc chiến "Song Hỷ" đầy bi hài giữa lòng Sài Gòn, dàn cast của dự án điện ảnh...

Á hậu Minh Ngọc sải bước kiêu sa trong đêm diễn...

Xuất hiện đầy sang trọng tại "thánh đường" thời trang Dongdaemun Design Plaza (DDP) vừa qua, Á hậu Minh Ngọc đã có màn trình...

Nam vương Đoàn Trường Đạt và bước ngoặt rực rỡ tại...

Khởi đầu từ nét mộc mạc đặc trưng của người con đất Mũi, Đoàn Trường Đạt vừa gặt hái thành công vang dội với...