If you’re looking for a new reason to be nervous about artificial intelligence, try this: Some of the smartest humans in the world are struggling to create tests that A.I.systems can’t pass.For years, A.I.
systems were measured by giving new models a variety of standardized benchmark tests.Many of these tests consisted of challenging, S.A.T.-caliber problems in areas like math, science and logic.
Comparing the models’ scores over time served as a rough measure of A.I.progress.But A.I.
systems eventually got too good at those tests, so new, harder tests were created — often with the types of questions graduate students might encounter on their exams.Those tests aren’t in good shape, either.New models from companies like OpenAI, Google and Anthropic have been getting high scores on many Ph.D.-level challenges, limiting those tests’ usefulness and leading to a chilling question: Are A.I.
systems getting too smart for us to measure?This week, researchers at the Center for AI Safety and Scale AI are releasing a possible answer to that question: A new evaluation, called “Humanity’s Last Exam,” that they claim is the hardest test ever administered to A.I.systems.
Humanity’s Last Exam is the brainchild of Dan Hendrycks, a well-known A.I.safety researcher and director of the Center for AI Safety.
(The test’s original name, “Humanity’s Last Stand,” was discarded for being overly dramatic.)We are having trouble retrieving the article content.Please enable JavaScript in your browser settings.Thank you for your patience while we verify access.If you are in Reader mode please exit and log into your Times account, or subscribe for all of The Times.Thank you for your patience while we verify access.Already a subscriber? Log in.Want all of The Times? Subscribe....