The Self-Evolving Paradigm
MiniMax M2.7 is not just another large language model release. It is the first commercially deployed model we know of that deeply participated in its own training process — executing over 100 rounds of scaffold optimisation autonomously during development, managing 30 to 50 percent of its own reinforcement learning research workflows, and submitting itself to 22 ML competitions with a 66.6 percent medal rate.
That last number deserves to settle in for a moment. A model that can evaluate and improve its own performance represents a fundamentally different paradigm from the standard train-and-deploy cycle. M2.7 is not a static artefact deployed and left alone — it is a system designed to keep getting better at the tasks that matter most in production.
Key fact: M2.7 achieves 56.22 percent on SWE-Pro, nearly matching Claude Opus 4.6 at 57 percent, and significantly outpaces it on SWE-bench Verified at 78 percent versus Opus's 55 percent. On VIBE-Pro — end-to-end project delivery rather than isolated patches — it scores 55.6 percent.
Benchmark Performance
Size-wise, M2.7 activates only 10 billion parameters — making it the smallest model in the Tier-1 performance class. Despite this efficiency, it competes head-to-head with models orders of magnitude larger. The headline numbers:
- SWE-Pro: 56.22% (vs Claude Opus 4.6 ~57%, GPT-5.3 Codex 56.2%)
- SWE-bench Verified: 78% (vs Opus 4.6 55%)
- VIBE-Pro (end-to-end delivery): 55.6%
- Terminal Bench 2: 57.0%
- MLE-Bench Lite medal rate: 66.6% (ties with Google Gemini 3.1)
- GDPval-AA (office tasks): ELO 1495 — highest among all open-source models
Speed and Pricing: The Real Story
Raw benchmark scores tell one story. Cost-adjusted performance tells another entirely. M2.7 runs at 100 tokens per second — 3x faster than Opus's ~33 TPS. On pricing, the comparison is stark:
- Input cost: $0.30/M tokens (vs Claude Opus $15/M — 50x cheaper)
- Output cost: $1.20/M tokens (vs Opus $75/M — 62x cheaper)
- Blended cost with cache: $0.06/M tokens
For teams running high-volume agent workloads, coding assistants, or document processing pipelines, this cost structure fundamentally changes the economics of what is feasible to run in production.
Core Capabilities
Agentic Workflows
Built on the OpenClaw framework, M2.7 has native support for multi-agent collaboration, role boundary management, adversarial reasoning, and protocol adherence. It can participate in execution and decision-making rather than passive response generation.
Software Engineering Beyond Benchmarks
On real-world engineering tasks — end-to-end project delivery, log analysis and debugging, code security review, and ML pipeline development — M2.7 demonstrates consistent, reliable execution on complex multi-step workflows. Its 97 percent skill adherence rate across 40+ complex tasks (each exceeding 2,000 tokens) is the most practically relevant benchmark number.
Office Suite Excellence
M2.7 scores highest among open-source models on office productivity tasks, handling complex Excel operations and formula generation, PowerPoint creation and editing, Word document manipulation, and multi-turn modification — iterating on documents through conversation rather than static prompts.
What This Means for the Industry
The arrival of a 10B-parameter model that matches frontier-class performance at 50-60x lower cost is significant not just as a technical achievement but as an economic signal. The assumption that cutting-edge AI capability requires cutting-edge compute budgets is being directly challenged.
M2.7 is available now via the Kilo AI platform, the OpenClaw CLI, and directly via API. It represents a meaningful shift in what teams can afford to deploy at scale — and a reminder that the most capable model is not always the most expensive one.
Benchmark data sourced from Wavespeed AI, Kilo AI blog, MLWorks analysis, and OpenRouter. [Source] [Source] [Source]


