OpenAI’s O3 Scores Less Than Half Of Claimed Score In FrontierMath Test-OxBig News Network

OpenAI’s o3 artificial intelligence (AI) model, which was released last week, is underperforming on a specific benchmark. Epoch AI, the company behind the FrontierMath benchmark, highlighted that the publicly available version of the o3 AI model scored 10 percent on the test, a much lower value than the company’s claim at launch. The San Francisco-based AI firm’s chief research officer, Mark Chen, had said that the model scored 25 percent on the test, creating a new record. However, the discrepancy does not mean that OpenAI lied about the metric.

OpenAI’s o3 AI Model Scores 10 Percent on FrontierMath

In December 2024, OpenAI held a livestream on YouTube and other social media platforms, announcing the o3 AI model. At the time, the company highlighted the improved set of capabilities in the large language model (LLM), in particular, its improved performance in reasoning-based queries.

One of the ways the company exemplified the claim was by sharing the model’s benchmark scores across different popular tests. One of these tests was FrontierMath, created by Epoch AI. The mathematical test is known for being challenging and tamper-proof, as more than 70 mathematicians developed the test, and the problems are all new and unpublished. Notably, till December, no AI model has solved more than nine percent of the questions in a single attempt.

However, at the time of launch, Chen claimed that o3 was able to set a new record by scoring 25 percent on the test. External verification of the performance was not possible at the time, as the model was not available in the public domain. After o3 and o4-mini were launched last week, Epoch AI made a post on X (formerly known as Twitter), claiming that the o3 model, in fact, scored 10 percent on the test.

While a score of 10 percent also makes the AI model the highest ranking on the test, the number is less than half of what the company claimed. The post has led to several AI enthusiasts talking about the validity of the benchmark scores.

The discrepancy does not mean that OpenAI lied about the performance of its AI model. Instead, the AI firm’s unreleased model likely used higher compute to get that score. However, the commercial version of the model was likely fine-tuned to be more power efficient, and in that process, some of its performance was toned down.

Separately, ARC Prize, an organisation behind the ARC-AGI benchmark test, which tests an AI model’s general intelligence, also posted on X about the discrepancy. The post confirmed, “The released o3 is a different model from what we tested in December 2024.” The company claimed that the released o3 model’s compute tiers are smaller than the version it tested. However, it did confirm that o3 was not trained on ARC-AGI data, even at the pre-training stage.

ARC Prize said that it will re-test the released o3 AI model and publish the updated results. The company will also re-test the o4-mini model, and label the prior scores as “preview”. It is not certain that the released version of o3 will underperform on this test as well.

#OpenAIs #Scores #Claimed #Score #FrontierMath #Test

openai o3 ai model frontier benchmark score lower than claim openai,openai o3,ai,artificial intelligence,ai model,llm

latest news today, news today, breaking news, latest news today, english news, internet news, top news, oxbig, oxbig news, oxbig news network, oxbig news today, news by oxbig, oxbig media, oxbig network, oxbig news media

HINDI NEWS

News Source

OXBIG NEWS NETWORK

About OxBig News Nework

OpenAI’s o3 Scores Less Than Half of Claimed Score in FrontierMath Test-OxBig News Network

OpenAI’s o3 AI Model Scores 10 Percent on FrontierMath

Amazfit Active 2 With Up to 10 Days Battery Life Debuts in India-OxBig News Network

Terrorist Happy Passia planned multiple attacks on India, US: FBI chief Kash Patel

EMU train coach derails at Chennai Beach railway station-OxBig News Network

I-T notice for paying rent to relative? You can take these 3 steps to protect yourself – An explainer | Mint

Meet 7 most beloved celebs that featured their belly bumps on the walkway

Amazfit Active 2 With Up to 10 Days Battery Life Debuts in India-OxBig News Network

Race to outrun humans: How humanoid robots are closing the gap-OxBig News Network

Vivo T4 5G With Snapdragon 7s Gen 3 SoC, 7,300mAh Battery Debuts in India-OxBig News Network

Vivo T4 5G with 7,300mAh battery launched in India, price starts at ₹21,999 | Mint-OxBig News Network

US sets tariffs of up to 3,521% on South East Asia solar panels-OxBig News Network

More like this
Related

Amazfit Active 2 With Up to 10 Days Battery Life Debuts in India-OxBig News Network

Terrorist Happy Passia planned multiple attacks on India, US: FBI chief Kash Patel

EMU train coach derails at Chennai Beach railway station-OxBig News Network

I-T notice for paying rent to relative? You can take these 3 steps to protect yourself – An explainer | Mint

About us

Company

The Latest News

Amazfit Active 2 With Up to 10 Days Battery Life Debuts in India-OxBig News Network

Terrorist Happy Passia planned multiple attacks on India, US: FBI chief Kash Patel

EMU train coach derails at Chennai Beach railway station-OxBig News Network

Subscribe

OXBIG NEWS NETWORK

About OxBig News Nework

OpenAI’s o3 Scores Less Than Half of Claimed Score in FrontierMath Test-OxBig News Network

OpenAI’s o3 AI Model Scores 10 Percent on FrontierMath

More like thisRelated

About us

Company

The Latest News

Subscribe

More like this
Related