When it comes to real-world evaluation, appropriate benchmarks need to be carefully selected to match the context of AI ...
Researchers behind the MASK benchmark found that more knowledge doesn't mean more 'moral virtue.' See which model lies the ...
To measure the success of their work, companies cite industry-standard benchmark tests whenever they release a new model. The ...
Anthropic used Pokémon to benchmark its newest AI model. Yes, really. In a blog post published Monday, Anthropic said that it tested its latest model, Claude 3.7 Sonnet, on the Game Boy classic ...
Thought Pokémon was a tough benchmark for AI? One group of researchers argues that Super Mario Bros. is even tougher.
Contextual AI launches its Grounded Language Model (GLM) that achieves 88% factual accuracy, outperforming major competitors while minimizing hallucinations for enterprise applications.
Some results have been hidden because they may be inaccessible to you
Show inaccessible results