Last updated: June 8, 2026
If you’ve been watching the AI model space lately, you already know the story: everyone promises a million-token context window, but using one in production feels like setting your wallet on fire. MiniMax M3, launched June 1, 2026, takes a different approach — it rebuilt the attention mechanism from scratch to make long context genuinely cheap. And on top of that, it packs frontier-level coding and native multimodality into one open-weight package. Let’s break down whether it lives up to the hype.
TL;DR
MiniMax M3 is the first open-weight model to combine three frontier capabilities — coding & agentic performance, a 1M-token context window, and native multimodality — in a single release. It beats GPT-5.5 and Gemini 3.1 Pro on SWE-Bench Pro, surpasses Claude Opus 4.7 on BrowseComp, and costs roughly 15x less per input token than Opus. If you’re running agentic workflows or processing massive documents, M3 deserves a serious look.
| Feature | Detail |
|---|---|
| Context Window | Up to 1M tokens (512K guaranteed minimum) |
| Max Output | 512,000 tokens |
| Architecture | MiniMax Sparse Attention (MSA) + MoE (256 experts, 9.8B active per token) |
| Total Parameters | 229.9 billion |
| Modalities | Text, image, video input; desktop computer operation |
| Open-Weight | Yes |
| Output Speed | ~100 tokens/second |
| API Input Price | $0.30/M tokens (promotional) / $0.60/M tokens (standard) |
| API Output Price | $1.20/M tokens (promotional) / $2.40/M tokens (standard) |
What Makes M3 Different: The MSA Architecture
Here’s the thing about long context — standard transformer attention is quadratic. Double the context, quadruple the compute. At a million tokens, that math becomes brutal. Most models that claim 1M context either throttle your speed or charge you a premium for it.
M3’s solution is MSA (MiniMax Sparse Attention). Instead of computing attention between every pair of tokens, MSA uses a lightweight index branch to scan incoming tokens and select which blocks of past key-values actually deserve attention. Then it runs full attention only on those relevant blocks.
The key design choice: MSA operates on real, uncompressed key-values rather than compressed representations. Previous sparse attention approaches (like DeepSeek’s Multi-head Latent Attention) compressed keys and values into a low-dimensional latent space, which introduced accuracy loss. MSA skips the compression entirely — you get the efficiency gains without the quality degradation.
The numbers at 1M context are striking:
| Metric | MSA vs Previous Generation |
|---|---|
| Per-token compute | 1/20th |
| Prefill speed | 9.7x faster |
| Decoding speed | 15.6x faster |
| vs Flash-Sparse-Attention | 4x faster |
| vs flash-moba | 4x faster |
Analysts who studied MiniMax’s architecture estimate that at 1M tokens, each query effectively touches only about 6-7% of the blocks — an effective receptive field around 60,000-70,000 tokens. That sparsity ratio lands in the sweet spot that sparse-attention research has identified as optimal at this scale.
Coding & Agent Benchmarks: Where M3 Actually Lands
Let’s get to the numbers that developers actually care about:
| Benchmark | MiniMax M3 | GPT-5.5 | Claude Opus 4.7 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-Bench Pro | 59.0% | 58.6% | 64.3% | 54.2% |
| Terminal-Bench 2.1 | 66.0% | 78.2% | 66.1% | 70.3% |
| MCP Atlas | 74.2% | 75.3% | 79.1% | 78.2% |
| BrowseComp | 83.5 | – | 79.3 | – |
| SVG-Bench | 63.7 | 58.2 | 62.3 | 59.2 |
Sources: MiniMax M3 technical report, SiliconFlow blog, AIUnpacker review
M3 edges past GPT-5.5 and Gemini 3.1 Pro on SWE-Bench Pro. It beats Claude Opus 4.7 on BrowseComp (83.5 vs 79.3) and SVG-Bench. On Terminal-Bench 2.1, it’s solid but GPT-5.5 leads by a significant margin (78.2% vs 66.0%).
The pattern is clear: M3 doesn’t dominate every benchmark, but it’s competitive with the frontier across the board — and frequently beats models that cost 10-50x more per token.
Agent Performance: M3’s Sweet Spot
MiniMax built M3 specifically for agentic workloads, and it shows. The most impressive demos:
CUDA Kernel Optimization: M3 ran for ~24 hours autonomously, making 1,959 tool calls and 147 benchmark submissions to optimize an FP8 GEMM kernel on NVIDIA Hopper GPUs. It pushed hardware utilization from 7.6% to 71.3% — a 9.4x speedup — with zero human intervention.
Research Paper Reproduction: M3 autonomously reproduced an ICLR 2025 best paper over 12 hours, producing 18 code commits and 23 experimental figures.
PostTrainBench: Given four pretrain-only base models, M3 autonomously ran data synthesis, training, evaluation, and iteration within 12 hours to make them usable for math, code, and knowledge tasks — no human in the loop.
Why does long context matter so much for agents? Because a real agent task might involve reading dozens of source files, executing commands and analyzing output, maintaining a growing operation log, and keeping state consistent across many tool calls. Without enough context, the agent keeps “forgetting” what it was doing. M3’s 1M window with MSA’s efficiency means agents can work for hours without hitting a memory wall.
Native Multimodality: Not Bolted On
M3 was trained on text, image, and video data from day one of pretraining — over 100 trillion tokens of interleaved multimodal data. This isn’t a text model with a vision adapter strapped on after the fact.
The model also supports native desktop computer operation, which is what powers MiniMax Code when you ask it to interact with applications on your machine. Think: voice command on your phone triggers your PC to open an ERP program and fill in Excel data.
Market Impact: OpenRouter Numbers Don’t Lie
M3 isn’t just winning benchmarks — it’s winning users. According to OpenRouter data for the first week of June 2026, M3 hit 2.5 trillion tokens in weekly API calls, landing it in the global top three. MiniMax’s entire model family reached 3.05 trillion weekly tokens, surpassing both Xiaomi and Tencent. Chinese models as a group have now outpaced US models for six consecutive weeks, with DeepSeek-V4-Flash leading the pack and M3 right behind.
The pricing is clearly a major driver. On lmmarketcap.com, M3 scores a pricing index of 99 (lower is cheaper) — among the most cost-effective frontier models available.
Pricing: The Real Disruption
This is where M3 gets genuinely interesting for anyone paying API bills:
| Model | Input Price/M tokens | Output Price/M tokens |
|---|---|---|
| MiniMax M3 (promotional) | $0.30 | $1.20 |
| MiniMax M3 (standard) | $0.60 | $2.40 |
| Claude Opus 4.7 | $5.00 | $25.00 |
| GPT-5.5 | ~$8.00 | ~$24.00 |
At promotional pricing, M3 is roughly 15x cheaper than Claude Opus on input tokens and 20x cheaper on output tokens. Even at standard pricing, M3 sits at just 8-12% of the cost of leading proprietary models.
For subscription users, MiniMax offers tiered plans:
– Plus ($20/month): ~1.7 billion tokens across all modalities
– Max ($50/month): ~5.1 billion tokens
– Ultra ($120/month): ~9.8 billion tokens
With cache optimization, blended costs can drop as low as $0.06 per million tokens. That’s absurdly cheap for a frontier-capable model.
MiniMax Code: The Agent Team Architecture
Alongside M3, MiniMax upgraded its desktop product to MiniMax Code, featuring an Agent Team architecture with Leader-Worker-Verifier mode. Multiple AI agents collaborate on complex development tasks — one plans, others execute, and a third verifies the output. According to a 36Kr interview with MiniMax’s Agent R&D team, downloads and subscriptions saw significant growth after the M3 launch, validating the API-to-product commercialization path.
Where M3 Falls Short
Let’s be honest about the trade-offs:
Terminal-Bench gap: GPT-5.5’s 78.2% vs M3’s 66.0% on Terminal-Bench 2.1 is a significant difference. If terminal/command-line work is your primary use case, M3 isn’t the best pick.
SWE-Bench still trails Opus: Claude Opus 4.7 leads at 64.3% vs M3’s 59.0%. For the absolute hardest coding tasks, Opus remains the benchmark leader.
KV cache doesn’t shrink: Because MSA keeps real KV rather than compressed KV, the cache itself is still large. You’re trading some memory economy for output quality — fine for most workloads, but worth knowing if you’re GPU-constrained.
New model, evolving ecosystem: M3 launched June 1, 2026. The tooling, integrations, and community best practices are still maturing compared to established models.
Chinese lab considerations: MiniMax is headquartered in Shanghai and publicly traded on the Hong Kong Stock Exchange. Some enterprises may have compliance requirements around data residency or vendor geography.
MiniMax M3 vs The Competition
| Feature | MiniMax M3 | Claude Opus 4.7 | GPT-5.5 | DeepSeek-V4-Flash | Gemini 3.1 Pro |
|---|---|---|---|---|---|
| Context Window | 1M tokens | 200K | 256K | 128K | 2M |
| Open-Weight | ✅ | ❌ | ❌ | ✅ | ❌ |
| Native Multimodal | ✅ (text+image+video) | ✅ | ✅ | ❌ | ✅ |
| Coding (SWE-Bench Pro) | 59.0% | 64.3% | 58.6% | ~50% | 54.2% |
| Agent (BrowseComp) | 83.5 | 79.3 | – | – | – |
| Input Price/M tokens | $0.30-0.60 | $5.00 | ~$8.00 | ~$0.15 | ~$1.25 |
| Speed (tok/s) | ~100 | ~30 | ~50 | ~150 | ~80 |
Who Should Use MiniMax M3?
Best for:
– Developers running long-horizon coding agents (the 1M context + low cost is a killer combo)
– Teams processing massive documents (legal contracts, research papers, full codebases)
– Budget-conscious startups that need frontier capability without frontier pricing
– Anyone building autonomous agents that need to maintain state over hours of work
Not ideal for:
– Terminal-heavy DevOps workflows (GPT-5.5 still leads here)
– Use cases requiring the absolute best coding performance (Opus 4.7 edges it out)
– Organizations with strict data residency requirements that exclude Chinese-hosted services
Getting Started
M3 is available on multiple platforms:
– SiliconFlow: OpenAI-compatible and Anthropic-compatible APIs
– OpenRouter: Standard API access
– OpenCode: Free access for first 7 days after launch
– MiniMax Platform: Direct API access at minimaxi.com
The Bottom Line
MiniMax M3 isn’t trying to be the best model on every benchmark. It’s trying to be the best model for what most people actually need — agentic workflows, long-context processing, and multimodal understanding — at a price point that makes production deployment realistic. The OpenRouter numbers back this up: users are voting with their tokens. On that front, M3 is one of the most compelling model releases of 2026.
Our Rating: 4.5/5
| Category | Score |
|---|---|
| Coding Performance | 4/5 |
| Agent Capability | 5/5 |
| Context Length | 5/5 |
| Multimodal Quality | 4/5 |
| Value for Money | 5/5 |
| Ecosystem & Tooling | 3.5/5 |
Disclaimer: This review is based on publicly available benchmarks, technical reports, and hands-on testing. Pricing reflects promotional rates as of June 2026 and may change.