Companies

Challenging the Limits of AI: Humanity’s Last Exam

Published December 27, 2024

In an intriguing move, two prominent entities in the world of artificial intelligence, Scale AI and the Center for AI Safety (CAIS), are inviting the public to pose questions that can effectively test the capabilities of large language models (LLMs) such as Google Gemini and OpenAI's o1. This initiative, dubbed Humanity’s Last Exam, hopes to encourage widespread participation to identify the limits and potential of these advanced AI systems.

To add an incentive, prizes totaling US$5,000 (£3,800) will be awarded to individuals who contribute the top 50 questions deemed suitable for this exam. According to Scale and CAIS, the objective is to determine how close we are to reaching “expert-level AI systems” by gathering insights from the widest range of experts ever assembled.

The motivation for this initiative is straightforward. Major LLMs have demonstrated their ability to excel in various established intelligence tests, including those focused on mathematics and law. However, there's uncertainty about how significant these performances are. This caution arises because LLMs often learn from vast datasets that may already include answers to many of these tests, as their training data consist largely of publicly available information from the internet.

Data plays a crucial role in the transition from traditional computing methods to artificial intelligence. It allows machines to learn by observing rather than just following explicit commands. This shift necessitates not only quality training datasets but also effective testing methods. Developers commonly evaluate AI via datasets not included in their training, known as “test datasets.”

Looking ahead, if LLMs do not currently possess the capability to pre-learn answers for standard assessments like bar exams, they will likely gain that ability soon. Predictions suggest that by 2028, AI systems may have processed nearly all written human material. Another pressing concern will be how to continue evaluating AI once they reach this milestone.

The internet's constant growth, with millions of new pages added every day, might serve as a solution to these challenges. Yet, this also introduces a troubling issue called “model collapse.” As the internet becomes saturated with AI-generated content, this may result in future AI models performing less effectively. To combat this, many developers are now integrating data from real human interactions with AIs for ongoing training and testing.

Beyond Basic Testing

Despite the potential for constant future data generation, defining and measuring intelligence remains a puzzle—especially concerning artificial general intelligence (AGI), which refers to AI that can match or surpass human intellect.

Long-standing traditional human IQ tests have drawn criticism for failing to encompass the diverse aspects of intelligence, such as language proficiency, mathematical skills, empathy, and spatial awareness. The same issue applies to the tests formulated for AIs. While several established assessments exist for tasks like text summarization and understanding, these often measure narrow aspects of intelligence and can quickly become outdated. For example, Stockfish, a chess-playing AI, is significantly better than the top human player Magnus Carlsen, but it cannot perform tasks outside chess, demonstrating that mastery in one area doesn't equate to overall intelligence.

As AI systems begin to exhibit broader intelligent behaviors, there is an urgent need to establish new benchmarks for assessing their progress. One innovative approach comes from François Chollet, a French engineer at Google. He proposed the “abstraction and reasoning corpus” (ARC), a set of puzzles using simple visual grids aimed at testing an AI's capability to infer and apply abstract concepts.

Unlike previous benchmarks that evaluate visual recognition based on extensive training datasets, ARC presents minimal examples and challenges the AI to deduce the puzzle's logic without having learned all potential answers. While solving these puzzles is relatively straightforward for humans, a significant reward of US$600,000 is set aside for the first AI to achieve an 85% score. Currently, top models like OpenAI's o1 preview and Anthropic's Sonnet 3.5 score around 21% on the public leaderboard, known as ARC-AGI-Pub.

OpenAI's GPT-4o has scored 50% under particular circumstances, although critics argue that it achieved this by generating numerous potential solutions before selecting the best one, still far short of challenging the prize and human performance scores of over 90%.

While ARC remains a leading method for evaluating true intelligence in artificial systems, the collaborative effort between Scale and CAIS illustrates that the search for effective alternatives is ongoing. Notably, some selected prize-winning questions may never be available online, ensuring that AIs cannot access the exam content in advance.

Understanding when machines approach human-level reasoning raises numerous safety, ethical, and moral concerns. As we delve deeper into the AI landscape, we are inevitably confronted with an even more daunting question: how do we evaluate the emergence of superintelligence? This task presents a profound intellectual challenge that merits urgent attention.

AI, Testing, Intelligence