Close Menu
MMJ News NetworkMMJ News Network
  • Home
  • Cannabis
  • Psychedelics
  • Tech
  • Crypto & Web3
  • Wellness & Counterculture
  • CBD
  • Business
  • News
    • AI
    • NFT
    • Sports

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

What's Hot

How Justin Herzig Became a Millionaire Playing Fantasy Football

June 1, 2025

Microsoft starts testing Copilot for Gaming in Xbox app for iOS and Android

June 1, 2025

Bitcoin Taker Buy Volume Witnesses Notable Spike — Is BTC Price Next?

June 1, 2025
Facebook X (Twitter) Instagram
MMJ News NetworkMMJ News Network
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
  • Home
  • Cannabis
  • Psychedelics
  • Tech
  • Crypto & Web3
  • Wellness & Counterculture
  • CBD
  • Business
  • News
    • AI
    • NFT
    • Sports
MMJ News NetworkMMJ News Network
Home » Did xAI lie about Grok 3’s benchmarks?
Tech

Did xAI lie about Grok 3’s benchmarks?

EditorBy EditorMarch 22, 2025No Comments3 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email Copy Link


Debates over AI benchmarks — and how they’re reported by AI labs — are spilling out into public view.

This week, an OpenAI employee accused Elon Musk’s AI company, xAI, of publishing misleading benchmark results for its latest AI model, Grok 3. One of the co-founders of xAI, Igor Babuschkin, insisted that the company was in the right.

The truth lies somewhere in between.

In a post on xAI’s blog, the company published a graph showing Grok 3’s performance on AIME 2025, a collection of challenging math questions from a recent invitational mathematics exam. Some experts have questioned AIME’s validity as an AI benchmark. Nevertheless, AIME 2025 and older versions of the test are commonly used to probe a model’s math ability.

xAI’s graph showed two variants of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, beating OpenAI’s best-performing available model, o3-mini-high, on AIME 2025. But OpenAI employees on X were quick to point out that xAI’s graph didn’t include o3-mini-high’s AIME 2025 score at “cons@64.”

What is cons@64, you might ask? Well, it’s short for “consensus@64,” and it basically gives a model 64 tries to answer each problem in a benchmark and takes the answers generated most frequently as the final answers. As you can imagine, cons@64 tends to boost models’ benchmark scores quite a bit, and omitting it from a graph might make it appear as though one model surpasses another when in reality, that isn’t the case.

Grok 3 Reasoning Beta and Grok 3 mini Reasoning’s scores for AIME 2025 at “@1” — meaning the first score the models got on the benchmark — fall below o3-mini-high’s score. Grok 3 Reasoning Beta also trails ever so slightly behind OpenAI’s o1 model set to “medium” computing. Yet xAI is advertising Grok 3 as the “world’s smartest AI.”

Babuschkin argued on X that OpenAI has published similarly misleading benchmark charts in the past — albeit charts comparing the performance of its own models. A more neutral party in the debate put together a more “accurate” graph showing nearly every model’s performance at cons@64:

Hilarious how some people see my plot as attack on OpenAI and others as attack on Grok while in reality it’s DeepSeek propaganda
(I actually believe Grok looks good there, and openAI’s TTC chicanery behind o3-mini-*high*-pass@”””1″”” deserves more scrutiny.) https://t.co/dJqlJpcJh8 pic.twitter.com/3WH8FOUfic

— Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex) February 20, 2025

But as AI researcher Nathan Lambert pointed out in a post, perhaps the most important metric remains a mystery: the computational (and monetary) cost it took for each model to achieve its best score. That just goes to show how little most AI benchmarks communicate about models’ limitations — and their strengths.





Source link

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Editor
  • Website
  • Facebook
  • Instagram

Related Posts

Microsoft starts testing Copilot for Gaming in Xbox app for iOS and Android

June 1, 2025

AMD buys silicon photonics startup Enosemi to fuel its AI ambitions

June 1, 2025

Google Photos debuts redesigned editor with new AI tools

June 1, 2025

Comments are closed.

Don't Miss
Wellness & Counterculture

How Justin Herzig Became a Millionaire Playing Fantasy Football

A rielle Herzig is a normal person. She does pilates, avoids gluten, works in tech,…...

Free Membership Required

You must be a Free member to access this content.

Join Now

Already a member? Log in here

Microsoft starts testing Copilot for Gaming in Xbox app for iOS and Android

June 1, 2025

Bitcoin Taker Buy Volume Witnesses Notable Spike — Is BTC Price Next?

June 1, 2025

Coinbase joining S&P 500, replacing Discover Financial

June 1, 2025
Top Posts

Minnesota Lawmakers Agree on Cannabis Law Changes, Send Bill to Governor

June 1, 2025

Minnesota Governor Signs Tribal Cannabis Compact Allowing Off-Reservation Dispensaries

June 1, 2025

Game Over in Texas? House Approves Bill to Ban Intoxicating Hemp Products

June 1, 2025

5 Talks to Have Now to Avoid a Bad Cannabis Business Breakup Later

May 30, 2025

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

About Us
About Us

Welcome to MMJ News Network, your premier source for cutting-edge insights into cannabis, psychedelics, crypto & Web3, wellness, counterculture, and market trends. We are dedicated to bringing you the latest news, research, and developments shaping these fast-evolving industries.

Facebook X (Twitter) Pinterest YouTube WhatsApp
Our Picks

How Justin Herzig Became a Millionaire Playing Fantasy Football

June 1, 2025

Microsoft starts testing Copilot for Gaming in Xbox app for iOS and Android

June 1, 2025

Bitcoin Taker Buy Volume Witnesses Notable Spike — Is BTC Price Next?

June 1, 2025
Most Popular

Ethereum Falls as Crypto Exchange Bybit Confirms $1.4 Billion Hack

February 21, 2025

Florida Woman Accused of $850K Trump Solana Meme Coin Theft, Faces Deportation

February 21, 2025

Bitcoin, XRP and Dogecoin Sink Amid Inflation Fears and Bybit Hack Fallout

February 23, 2025
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 mmjnewsnetwork. Designed by mmjnewsnetwork.

Type above and press Enter to search. Press Esc to cancel.