Forscher stellen FrontierMath vor, einen Benchmark mit Hunderten von originalen und unveröffentlichten Mathematikproblemen, die von erfahrenen Mathematikern erstellt und überprüft wurden. Aktuelle hochmoderne KI-Modelle können nur weniger als 2 % der Probleme lösen. Dies bietet eine strenge Prüfumgebung, mit der der Fortschritt von KI-Systemen quantifiziert werden kann.
https://arstechnica.com/ai/2024/11/new-secret-math-benchmark-stumps-ai-models-and-phds-alike/
3 Comments
Seed Statement:
Research organization Epoch AI released FrontierMath, a new mathematics benchmark that has been turning heads in the AI world because it contains hundreds of expert-level problems that leading AI models solve less than 2 percent of the time, according to Epoch AI. The benchmark tests AI language models (such as GPT-4o, which powers ChatGPT) against original mathematics problems that typically require hours or days for specialist mathematicians to complete.
The design of FrontierMath differs from many existing AI benchmarks because the problem set remains private and unpublished to prevent data contamination.
I looked through the examples. I am a CS major. I couldn’t even understand what half of them were asking for. They are DIFFICULT.
We need this for every field to kill the myth of “AI now scores 80% in physics/chemistry/biology blablabla…”, which so far has really just meant “we fed it all the answers and just like wikipedia it is able to rehash it”. These benchmarks combined with the ARC challenge will force innovation, as it exposes silicon valley hype as the cash grab that it is