AI’s intelligence test dilemma: why we still don’t know how to measure human-level thinking in machines

Despite impressive progress in artificial intelligence, the field is still missing a definitive way to measure whether a machine truly thinks like a human. From prize challenges to philosophical debates, researchers are wrestling with the limits of current benchmarks—and the daunting prospect of testing superintelligence.

The challenge of designing the ultimate AI test

Two major San Francisco AI players—Scale AI and the Center for AI Safety (CAIS)—recently launched “Humanity’s Last Exam,” a global call to devise questions capable of truly testing large language models (LLMs) such as Google Gemini and OpenAI’s o1. Offering $5,000 prizes for the top 50 questions, the challenge aims to probe how close we are to “expert-level AI systems.”

Manikin with an intelligence cloud obscuring their head

The problem is that today’s LLMs excel at many existing benchmarks in intelligence, mathematics, and law—sometimes to a degree that raises suspicion. These models are trained on colossal amounts of data, possibly including the very questions used to assess them. As the AI analytics group Epoch predicts, by 2028 AIs may have effectively “read” everything humans have ever written, leaving almost no established test safe from contamination. That looming moment forces researchers to rethink what testing should look like once AIs know all the answers.

Complicating matters further, the internet’s rapid growth is now intertwined with “model collapse,” a phenomenon where AI-generated content recirculates in training data, degrading model quality. Some companies are countering this by collecting fresh data from human interactions or real-world devices—from Tesla’s sensor-rich vehicles to Meta’s smart glasses—to ensure diversity in both training and evaluation datasets.

Why current intelligence benchmarks fall short

Magnus Carlsen thinking about a chess move

Defining intelligence has always been contentious. Human IQ tests have been criticized for capturing only narrow cognitive skills, and AI assessments suffer from a similar problem. Many benchmarks—summarizing text, interpreting gestures, recognizing images—test specific tasks but fail to measure adaptability or reasoning across domains.

A striking example is the chess engine Stockfish. It surpasses world champion Magnus Carlsen by a wide Elo margin, yet cannot interpret a sentence or navigate a room. This illustrates why task-specific dominance cannot be equated with general intelligence, and why AI researchers are searching for more holistic measures.

As AI systems now exhibit broader competencies, the challenge is to create tests that capture this expansion. Benchmarks must go beyond scoring isolated abilities to gauge how well a model can generalize knowledge and adapt to new, unseen problems—skills considered the hallmark of human intelligence.

The rise of reasoning-based evaluation

French Google engineer François Chollet’s “abstraction and reasoning corpus” (ARC) is one of the most ambitious attempts to move testing in this direction. Launched in 2019, ARC presents AIs with visual puzzles that require inferring and applying abstract rules with minimal examples—far from the brute-force pattern recognition of traditional image classification.

ARC remains a steep challenge for machines. Humans regularly score above 90%, but leading LLMs like OpenAI’s o1 preview and Anthropic’s Sonnet 3.5 hover around 21%. Even GPT-4o, which reached 50% through an extensive trial-and-error approach, stayed far from the $600,000 prize threshold of 85%. This gap reassures researchers that despite hype, no AI has yet matched the flexible reasoning humans demonstrate with ease.

However, ARC is not the only experiment in play. The Humanity’s Last Exam project deliberately keeps its winning questions offline to avoid AI models “studying” the answers, a move that reflects how high the stakes are for ensuring a test’s integrity in a world where machines might pre-learn anything published.

What happens when we approach superintelligence?

The urgency to develop robust tests goes beyond academic curiosity. The arrival of human-level AI—let alone superintelligence—would bring profound safety, ethical, and governance challenges. Detecting that moment requires tools capable of identifying not just task mastery, but deep reasoning, adaptability, and value alignment with human norms.

Yet, as history shows, each time AI reaches parity in a narrow domain, the goalposts shift. What once counted as “intelligent” becomes routine automation. This means that measuring superintelligence will likely demand frameworks that we cannot fully conceive yet—ones that might integrate philosophical inquiry, social sciences, and long-term behavioral observation.

For now, the race is on to design the next generation of tests before AI’s knowledge eclipses our ability to measure it. And when machines finally pass “Humanity’s Last Exam,” the next question might be even more daunting: how do we examine something smarter than ourselves?

Explore more

spot_img

Trần Đại Quang đồng hành cùng Global Fashion Week All Stars...

Bên cạnh việc sải bước tại các kinh đô thời trang lớn, Trần Đại Quang (Bell) đang trở thành gương mặt mẫu nhí nhận...

Cận cảnh BST swimsuit “Be Shining Stars” dành cho đêm chung...

Những hình ảnh đầu tiên của BST swimsuit “Be Shining Stars” được công bố đã nhanh chóng thu hút sự quan tâm từ cộng...

Dương Minh Tuệ và ước mơ trở thành Event Planner chuyên...

Dương Minh Tuệ hiện là sinh viên năm nhất chuyên ngành Truyền thông đa phương tiện tại Đại học FPT Hà Nội. Nhằm tích...

Hoàng Đình Duy: Nuôi dưỡng ước mơ đưa những sự kiện...

Bước chân vào giảng đường đại học chưa lâu, Hoàng Đình Duy đã sớm lựa chọn cho mình một lối đi riêng đầy năng...

Chàng trai Gen Z năng nổ trong vai trò điều hành...

Trần Văn Hoàng, sinh viên chuyên ngành Văn hóa Đối ngoại (Trường Đại học Văn hóa Hà Nội), sở hữu một bảng thành tích...

Trần Chí Nghĩa: Gương mặt truyền thông năng động của Trường...

Từng bỏ lỡ những cơ hội bản thân yêu thích do tâm lý chờ đợi, Trần Chí Nghĩa - sinh viên trường Đại học...

Mẫu nhí 6 tuổi Bảo Ngọc thử sức tại Global Fashion...

Xuất phát từ mong muốn của gia đình trong việc giúp con rèn luyện sự tự tin, Bảo Ngọc được tạo điều kiện tham...

Diễn viên nhí phim “Không Giới Hạn” góp mặt tại Global...

Mẫu nhí 6 tuổi Koka Bình Nguyên vừa chính thức xác nhận sẽ góp mặt tại tuần lễ thời trang quốc tế Global Fashion...