I've been building a memory layer for AI agents (MnemoPay) and LongMemEval is the public benchmark I've been beating my head against for the last two weeks. Started at 62-64% (Sonnet-4 answerer, GPT-4o judge). Ended today at 82.8%.
Here's what actually moved the number and what didn't. Scoreboard 500-question oracle variant, GPT-4o as judge. Run Overall Notes Baseline 62-64% Sonnet-4 answ