MMLU-Pro holds steady at 85.0, AIME 2025 slightly improves to 89.3, while GPQA-Diamond dips from 80.7 to 79.9. Coding and agent benchmarks tell a similar story, with Codeforces ratings rising from ...
Anthropic's Claude Sonnet 4.5 now scores 77% on a key software engineering benchmark and can work autonomously for over 30 ...
The IB SA Exam analysis 2025 had held on 29 September saw massive participation. Candidates faced questions from English, ...
Meta has released Code World Model (CWM), a 32-billion-parameter AI model for researchers that simulates code execution to ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results