The thing that bugs me about most "agent memory" benchmarks is they test retrieval on short histories. Your agent handles 10 sessions, you confirm it remembers your name, you ship. LongMemEval tests something harder: given a long multi-session history with conflicting updates and temporal gaps, does the agent retrieve the right fact at the right time? 500 questions, judged by GPT-4o against oracle