Technology & Science

How I took LongMemEval oracle from 62% to 82.8% without touching the retriever

t49qnsx7qt-kpanks·Dev.to·2h ago·1 min read

How I took LongMemEval oracle from 62% to 82.8% without touching the retriever

t49qnsx7qt-kpanks·Dev.to·2h ago · Tuesday, April 21, 2026·1 min read

I've been building a memory layer for AI agents (MnemoPay) and LongMemEval is the public benchmark I've been beating my head against for the last two weeks. Started at 62-64% (Sonnet-4 answerer, GPT-4o judge). Ended today at 82.8%.

Here's what actually moved the number and what didn't. Scoreboard 500-question oracle variant, GPT-4o as judge. Run Overall Notes Baseline 62-64% Sonnet-4 answ

Continue reading on Dev.to

This article was sourced from Dev.to's RSS feed. Visit the original for the complete story.

Read full article

How I took LongMemEval oracle from 62% to 82.8% without touching the retriever — FeedCast