The variation between the claims of OpenAI and Epoch AI findings sparked concerns about the transparency and model testing practices of OpenAI. 

In December, with the launch of OpenAI’s o3 model, the company claimed that the model could answer over a fourth of questions on FrontierMath. However, it turned out that the figures were an upper bound, achieved by a version of o3 with more computing behind it than the AI company publicly launched last week. 

However, on Friday, Epoch AI, the research institute behind FrontierMath, released results of its independent benchmark tests of o3. The research institution found that o3 scored around 10%, well below OpenAI’s highest claimed score. 

This stark distinction between the claims of OpenAI and FrontierMath results sparked concerns about the credibility of the leading AI company. However, to settle down the heat, Epoch wrote

“The difference between our results and OpenAI’s might be due to OpenAI evaluating with a more powerful internal scaffold, using more test-time [computing], or because those results were run on a different subset of FrontierMath (the 180 problems in frontiermath-2024-11-26 vs the 290 problems in frontiermath-2025-02-28-private),”

Although there is no final verdict from OpenAI about the rising concerns, the credibility of its testing practices has been challenged. It may also leave a lasting impact on the company until the company provides a proper justification for the variation.