Moonshot AI, the Chinese AI lab behind the Kimi assistant, today open-sourced Kimi K2.6 — a native multimodal agentic model that pushes the boundaries of what an AI system can do when left to run autonomously on hard software engineering problems. The release targets practical deployment scenarios: long-running coding agents, front-end generation from natural language, […] The post Moonshot AI Rel
Moonshot AI, the Chinese AI lab behind the Kimi assistant, today open-sourced Kimi K2.6 — a native multimodal agentic model that pushes the boundaries of what an AI system can do when left to run autonomously on hard software engineering problems. The release targets practical deployment scenarios: long-running coding agents, front-end generation from natural language, massively parallel agent swarms coordinating hundreds of specialized sub-agents simultaneously, and a new open ecosystem where humans and agents from any device collaborate on the same task. The model is available now on Kimi.com, the Kimi App, the API, and Kimi Code CLI.
Weights are published on Hugging Face under a Modified MIT License. What Kind of Model is This, Technically? Kimi K2.6 is a Mixture-of-Experts (MoE) model — an architecture that’s become increasingly dominant at frontier scale.
Instead of activating all of a model’s parameters for every token it processes, a MoE model routes each token to a small subset of specialized ‘experts.’ This allows you to build a very large model while keeping inference compute tractable. Kimi K2.6 has 1 trillion total parameters, but only 32 billion are activated per token. It has 384 experts in total, with 8 selected per token, plus 1 shared expert that is always active.
The model has 61 layers (including one dense layer), uses an attention hidden dimension of 7,168, a MoE hidden dimension of 2,048 per expert, and 64 attention heads. Beyond text, K2.6 is a native multimodal model — meaning vision is baked in architecturally, not bolted on. It uses a MoonViT vision encoder with 400M parameters and supports image and video input natively.
Other architectural details: it uses Multi-head Latent Attention (MLA) as its attention mechanism, SwiGLU as the activation function, a vocabulary size of 160K tokens, and a context length of 256K tokens. For deployment, K2.6 is recommended to run on vLLM, SGLang, or KTransformers. It shares the same architecture as Kimi K2.5, so existing deployment configurations can be reused directly.
The required transformers version is >=4.57.1, <5.0.0. The Long-Horizon Coding Headline Numbers The metric that will likely get the most attention from dev teams is SWE-Bench Pro — a benchmark testing whether a model can resolve real-world GitHub issues in professional software repositories. Kimi K2.6 scores 58.6 on SWE-Bench Pro, compared to 57.7 for GPT-5.4 (xhigh), 53.4 for Claude Opus 4.6 (max effort), 54.2 for Gemini 3.1 Pro (thinking high), and 50.7 for Kimi K2.5.
On SWE-Bench Verified it scores 80.2, sitting within a tight band of top-tier models. On Terminal-Bench 2.0 using the Terminus-2 agent framework, K2.6 achieves 66.7, compared to 65.4 for both GPT-5.4 and Claude Opus 4.6, and 68.5 for Gemini 3.1 Pro. On LiveCodeBench (v6), it scores 89.6 vs.
Claude Opus 4.6’s 88.8. Perhaps the most striking number for agentic workloads is Humanity’s Last Exam (HLE-Full) with tools: K2.6 scores 54.0 — leading every model in the comparison, including GPT-5.4 (52.1), Claude Opus 4.6 (53.0), and Gemini 3.1 Pro (51.4). HLE is widely considered one of the hardest knowledge benchmarks, and the with-tools variant specifically tests how well a model can leverage external resources autonomously.
Internally, Moonshot evaluates long-horizon coding gains using their Kimi Code Bench, an internal benchmark covering diverse, complicated end-to-end tasks across languages and domains, where K2.6 demonstrates significant improvements over K2.5. https://www.kimi.com/blog/kimi-k2-6 What 13 Hours of Autonomous Coding Actually Looks Like Two engineering case studies in the release document what ‘long-horizon coding’ means in practice. In the first, Kimi K2.6 successfully downloaded and deployed the Qwen3.5-0.8B model locally on a Mac, then implemented and optimized model inference in Zig — a highly niche programming language — demonstrating exceptional out-of-distribution generalization.
Across 4,000+ tool calls, over 12 hours of continuous execution, and 14 iterations, K2.6 improved throughput from approximately 15 to approximately 193 tokens/sec, ultimately achieving speeds approximately 20% faster than LM Studio. In the second, Kimi K2.6 autonomously overhauled exchange-core, an 8-year-old open-source financial matching engine. Over a 13-hour execution, the model iterated through 12 optimization strategies, initiating over 1,000 tool calls to precisely modify more than 4,000 lines of code.
Acting as an expert systems architect, K2.6 analyzed CPU and allocation flame graphs to pinpoint hidden bottlenecks and reconfigured the core thread topology from 4ME+2RE to 2ME+1RE — extracting a 185% medium throughput leap (from 0.43 to 1.24 MT/s) and a 133% performance throughput gain (from 1.23 to 2.86 MT/s). Agent Swarms: Scaling Horizontally, Not Just Vertically One of K2.6’s most architecturally interesting capabilities is its Agent Swarm — an approach to parallelizing complex tasks across many specialized sub-
