New AI benchmarks could help developers reduce bias in AI models, potentially making them fairer and less likely to cause ...
The researchers said the industry hasn't had a sufficient method of evaluating honesty in AI models until now. "Many benchmarks claiming to measure honesty in fact simply measure accuracy -- the ...
Here at TC, we often reluctantly report benchmark figures because they're one of the few (relatively) standardized ways the AI industry measures model improvements. Popular AI benchmarks tend to ...
With the growth of AI agents likely to continue in 2025, specialized benchmarks will follow. AI agents are autonomous systems capable of interpreting their surroundings, making informed decisions ...
Patronus AI launches the first multimodal LLM-as-a-Judge for evaluating AI systems that process images, with Etsy already implementing the technology to validate product image captions across its ...
Thought Pokémon was a tough benchmark for AI? One group of researchers argues that Super Mario Bros. is even tougher. It wasn’t quite the same version of Super Mario Bros. as the original 1985 ...
Today we're launching SWE-Lancer—a new, more realistic benchmark to evaluate the coding performance of AI models. SWE-Lancer includes over 1,400 freelance software engineering tasks from Upwork ...
Debates over AI benchmarks — and how they’re reported by AI labs — are spilling out into public view. This week, an OpenAI employee accused Elon Musk’s AI company, xAI, of publishing ...
Chinese technology giant Baidu released two new artificial intelligence (AI) models ... which cover images, audio and video, outperformed OpenAI's GPT-4o on several benchmark platforms including ...
Results that may be inaccessible to you are currently showing.
Hide inaccessible results