Xiaomi Releases MiMo-V2.5-Pro and MiMo-V2.5: Matching Frontier Model Benchmarks at Significantly Lower Token Cost

Xiaomi MiMo team publicly released two new models: MiMo-V2.5-Pro and MiMo-V2.5. The benchmarks, combined with some genuinely striking real-world task demos, make a compelling case that open agentic AI is catching up to the frontier faster than most expected. Both models are available immediately via API, and priced competitively.

What is an Agentic Model, and Why Does It Matter? Most LLM benchmarks test a model’s ability to answer a single, self-contained question. Agentic benchmarks test something much harder — whether a model can complete a multi-step goal autonomously, using tools (web search, code execution, file I/O, API calls) over many turns, without losing track of the original objective.

Think of it as the difference between a model that can answer “how do I write a lexer?” versus one that can actually write a complete compiler, run tests against it, catch regressions, and fix them — all without a human in the loop. The latter is exactly what Xiaomi MiMo team is demonstrating here. MiMo-V2.5-Pro: The Flagship MiMo-V2.5-Pro is Xiaomi’s most capable model to date, delivering significant improvements over its predecessor, MiMo-V2-Pro, in general agentic capabilities, complex software engineering, and long-horizon tasks.

The key benchmark numbers are competitive with top closed-source models: SWE-bench Pro 57.2, Claw-Eval 63.8, and τ3-Bench 72.9 — placing it alongside Claude Opus 4.6 and GPT-5.4 across most evaluations. V2.5-Pro can sustain complex, long-horizon tasks spanning more than a thousand tool calls, demonstrating substantial improvements in instruction following within agentic scenarios, reliably adhering to subtle requirements embedded in context and maintaining strong coherence across ultra-long contexts. One behavioral property that distinguishes V2.5-Pro from earlier models is what Xiaomi MiMo team calls “harness awareness”: it makes full use of the affordances of its harness environment, manages its memory, and shapes how its own context is populated toward the final objective.

This means the model doesn’t just execute instructions mechanically. It actively optimizes its own working environment to stay on track across very long tasks. The three real-world task demos Xiaomi published illustrate exactly what “long-horizon agentic capability” means in practice.

Demo 1 — SysY Compiler in Rust: Referred from Peking University’s Compiler Principles course project, this task asks the model to implement a complete SysY compiler in Rust from scratch: lexer, parser, AST, Koopa IR codegen, RISC-V assembly backend, and performance optimization. The reference project typically takes a PKU CS major student several weeks. MiMo-V2.5-Pro finished in 4.3 hours across 672 tool calls, scoring a perfect 233/233 against the course’s hidden test suite.

What’s notable isn’t just the final score — it’s the architecture of execution. Rather than thrashing through trial and error, the model built the compiler layer by layer: scaffold the full pipeline first, perfect Koopa IR (110/110), then the RISC-V backend (103/103), then performance (20/20). The first compile alone passed 137/233 tests, a 59% cold start that suggests the architecture was designed correctly before a single test was run.

When a refactoring step later caused regressions, the model diagnosed the failures, recovered, and pushed on. This is structured, self-correcting engineering behavior — not pattern-matched code generation. Demo 2 — Full-Featured Desktop Video Editor: With just a few simple prompts, MiMo-V2.5-Pro delivered a working desktop app: multi-track timeline, clip trimming, cross-fades, audio mixing, and export pipeline.

The final build is 8,192 lines of code, produced over 1,868 tool calls across 11.5 hours of autonomous work. Demo 3 — Analog EDA- FVF-LDO Design: This is the most technically specialized demo: a graduate-level analog-circuit EDA task requiring the design and optimization of a complete FVF-LDO (Flipped-Voltage-Follower low-dropout regulator) from scratch in the TSMC 180nm CMOS process. The model had to size the power transistor, tune the compensation network, and pick bias voltages so that six metrics land within spec simultaneously — phase margin, line regulation, load regulation, quiescent current, PSRR, and transient response.

Wired into an ngspice simulation loop, in about an hour of closed-loop iteration — calling the simulator, reading waveforms, tweaking parameters — the model produced a design where every target metric is met, with four key metrics improved by an order of magnitude over its own initial attempt. Token Efficiency: Intelligence at frontier level is only useful if it’s cost-effective. On ClawEval, V2.5-Pro lands at 64% Pass^3 using only ~70K tokens per trajectory — roughly 40–60% fewer tokens than Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.4 at comparable capability levels. For engineers building production agent pipelines, this is a material cost reduction, not just a marketing stat. https://mimo.xiao