Every week a new model drops with a blog post claiming state of the art on some benchmark. But if you look at the full picture across all evaluations, no model wins everything. I spent months pulling data from different sources: one site for MMLU scores, another for pricing, another for context windows.

The data was scattered, inconsistent, and often outdated by the time I compiled it. What Actua