Benchmark LLM Models - Search News

MLCommons releases new AILuminate benchmark for measuring AI model safety

MLCommons today released AILuminate, a new benchmark test for evaluating the safety of large language models. Launched in 2020, MLCommons is an industry consortium backed by several dozen tech firms.

LLM Consensus Matches or Outperforms the Best AI Models in Expert Evaluation Without Performance Degradation

Claude Opus 4.6 and Gemini 3.1 Pro across 100 expert-level questions infinance, law, medicine and technology, with no ...

PrismML Introduces The First Commercially Viable 1-Bit LLM

A Caltech Lab at PrismML Just Fit an 8 Billion Parameter AI Model Into 1.15 GB. Announcing a Breakthrough in AI Compression: ...

Are We Overestimating AI’s Abilities? New Study Questions How Models Are Tested

According to the study, current testing being done for AI and LLM’s work by assigning scores to its results. These results ...

Microsoft's new Harrier models top benchmarks, outperforming Google's Gemini Embedding 20 0

With the surprise release of their new "Harrier" family of embedding models, Microsoft is making a massive play for the ...

PrismML debuts energy-sipping 1-bit LLM in bid to free AI from the cloud

PrismML's approach is based on work done by Caltech electrical engineering professor Babak Hassibi and colleagues. The ...

VentureBeat

Beyond generic benchmarks: How Yourbench lets enterprises evaluate AI models against actual data

Every AI model release inevitably includes charts touting how it outperformed its competitors in this benchmark test or that evaluation matrix. However, these benchmarks often test for general ...

Qehwa AI: Pakistani Developer Creates World’s First Pashto AI LLM and Chatbot

A new large language model, Qehwa, has been developed by Junaid Ahmed, in a solo effort, to serve more than 60 million Pashto ...

InfoQ

Hugging Face Upgrades Open LLM Leaderboard v2 for Enhanced AI Model Comparison

Unlock the full InfoQ experience by logging in! Stay updated with your favorite authors and topics, engage with content, and download exclusive resources. Soroosh Khodami discusses why we aren't ready ...

IEEE Spectrum on MSN

Why are large language models so terrible at video games?

AI models code simple games, but struggle to play them ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results