AI Poised to Ace Humanity's Toughest Knowledge Test Within Months

Artificial intelligence is on the verge of scoring full marks on one of the world's most challenging knowledge assessments, known as Humanity's Last Exam (HLE), within just a few months, according to developers. This groundbreaking test was established by technology leaders to rigorously evaluate the intelligence of their systems, featuring 2,500 meticulously selected questions that span approximately one hundred diverse topics, ranging from rocket science and mythology to physiology.

The Formidable Challenge of Humanity's Last Exam

Each question on the HLE demands at least a PhD-level understanding, and achieving a score even remotely close to 100 percent would earn an individual the prestigious title of a 'universal expert'. Just two years ago, the highly acclaimed ChatGPT system from OpenAI managed a mere 3 percent on this exam, with competitors from Google and Anthropic performing only marginally better. Initially, this test helped alleviate concerns about the growing dominance of AI, as researchers pointed to a 'marked gap' between large language models (LLMs) and the world's leading academics.

Rapid Progress in AI Performance

However, the seemingly insurmountable HLE may soon become another milestone in AI's relentless ascent. Google's Gemini system achieved an impressive 45.9 percent on the exam last month, having skyrocketed from an 18.8 percent score within months of its initial attempt. Calvin Zhang, the research lead at Scale, the AI company behind HLE, confidently predicts that full marks are imminent. He explained, 'We aimed to create this close-ended academic benchmark, set to the frontier of expert humans, that only a handful of people on earth can truly solve.'

—

Wide Pickt banner — collaborative shopping lists app for Telegram, phone mockup with grocery list

Zhang further noted, 'We've witnessed insane progress on these language models over the past few years. It's impressive; model builders have done a phenomenal job at enhancing these reasoning models.' Kate Olszewska, a product manager at Google DeepMind, echoed this sentiment, stating, 'If we genuinely prioritized this as the sole focus, I believe we could reach it quite rapidly.' Meanwhile, Anthropic, the company behind the Claude AI system, has attained a score of 34.2 percent on the HLE and is improving at a swift pace.

Significance of a Perfect Score

An AI achieving 100 percent on the exam would represent a monumental development, given that the test is 'designed to be the final closed-ended academic benchmark of its kind', as per its authors. This implies that if the technology conquers the HLE, future evaluations will necessitate questions that no human knows the answer to, pushing AI beyond existing human knowledge boundaries.

Creation and Curation of the Exam

The HLE was developed by researchers at Scale and the Center for AI Safety, a non-profit organization, to assess both the breadth of knowledge and depth of reasoning in AI systems. In response to a global appeal in September 2024, which offered a $500,000 prize pool, experts from around 50 countries submitted 70,000 questions for consideration. These questions had to require short, unambiguous answers and be difficult to locate on the internet.

The initial list was refined to 13,000 after eliminating questions that any existing model could answer. Ultimately, 2,500 questions were chosen, though some have since been removed or edited based on user feedback. These questions demand a wide array of expertise, from biology to language proficiency, and a significant number remain confidential to prevent systems from gaining an advantage through online discussions of answers.

Historical Context and Future Implications

Success in the HLE would recall IBM's supercomputer Deep Blue defeating world chess champion Garry Kasparov in 1997, defying most expert predictions. Since then, numerous major AI benchmarks have been surpassed, including the multi-disciplinary Massive Multitask Language Understanding, released in 2020, which was discontinued after systems began finding it too easy, often scoring above 90 percent.

Pickt after-article banner — collaborative shopping lists app with family illustration

As AI nears the ability to master human-created tests, expanding beyond the current limits of human knowledge has increasingly become a primary focus for developers, as Ms Olszewska highlighted. Nevertheless, Zhang asserts that there will always be space for human specialization, with physical fields like surgery and decision-based skills such as judgment and creativity posing greater challenges for AI to master.