Coinbase is now running more of its internal AI workload on cheaper Chinese models, with GLM 5.2 and Kimi 2.7 emerging as the headline options in a stack that pairs automatic routing with aggressive caching. The move matters because it is not just a vendor swap. It is a reminder that, for a large deployment, the economics of AI tooling are increasingly shaped by how requests are routed, what gets cached, and how often teams actually consume the expensive parts of a model stack.
According to The Decoder’s reporting, Coinbase CEO Brian Armstrong said the company is using more tokens than ever while paying about half what it used to. That is the core pressure point here: usage can rise even as spend falls, if the system is designed to direct work toward lower-cost models and avoid recomputing the same answers over and over. For technical teams watching model selection turn into a procurement problem, Coinbase’s setup is a concrete example of the cost curve bending at the orchestration layer rather than the model layer alone.
Routing and caching as the real cost engine
The important architectural detail is the automatic routing layer. Instead of hard-wiring every request to a single frontier model, Coinbase’s system chooses among models based on the task, the price of the model, and the likelihood that a response can be served from cache. In practice, that means the router is doing more than picking a model; it is deciding whether a request should be treated as a fresh inference problem or as something the system can satisfy with prior work.
That matters because routing and caching change the economics in different ways. Routing reduces the average cost per request by steering routine or lower-risk work to cheaper models. Caching reduces the number of times the system has to pay for inference at all. When The Decoder says better caching alone lifted the hit rate from roughly 5 percent to 60 percent, the implication is not simply that the cache got better. It is that the system became much more effective at reusing prior outputs for repeated or similar tasks, which is exactly where enterprise tooling can leak money if every prompt is treated as a one-off.
For developers, this kind of architecture also changes how the product feels. You can still choose models, Coinbase says, but the router is shaping the traffic underneath. That means the visible model picker is no longer the main control surface. The bigger determinant of cost is now the policy logic around model assignment and session reuse.
The token economics are shifting under the stack
The Decoder’s report adds two data points that make this more than a theoretical architecture story. First, Coinbase is reportedly using more tokens than ever. Second, it is paying less than before. Those two facts can coexist when the system is good at extracting more work per token and more work per cached response.
That is where the reported 91 percent figure becomes important. If most developers never hit the old usage limits, then the actual binding constraint was not the nominal cap but the efficiency of the stack. In that kind of environment, lowering model cost on the back end can have a larger effect than tightening product-level restrictions on the front end.
The reference to “context engineering” also fits this pattern. Developers are being told to keep context lean and start fresh sessions for new tasks. That is not a cosmetic guidance memo; it is an operational strategy. Shorter, cleaner sessions make caching and routing more effective because the system can identify reusable patterns more easily and avoid dragging large, stale context windows into every request. In other words, less carried state can improve both latency and economics.
GLM 5.2 and Kimi 2.7 are now the relevant cheap defaults
The specific models matter too. GLM 5.2 and Kimi 2.7 are the names surfacing as the cheaper alternatives that Coinbase is willing to operationalize. The significance is not that they are novelty picks. It is that they are now credible enough to anchor real production traffic inside a demanding engineering environment.
That creates a benchmark problem for Western labs. If a company like Coinbase can deliver acceptable internal performance with cheaper Chinese models, then OpenAI, Anthropic, and their peers have to justify their pricing not only through quality, but through total system value: latency, reliability, tool use, and the amount of manual routing they force customers to do themselves.
The Decoder also notes that Lindy’s CEO recently made a similar move to DeepSeek v4 and that Snowflake is testing Chinese models as lower-cost alternatives. Those examples matter because they suggest Coinbase is not an isolated outlier. It is part of a broader normalization of model substitution based on economics rather than brand affinity.
Why this is a stress test for Western AI labs
For Western labs, the strategic risk is straightforward: if cheaper models are “good enough” for a growing share of enterprise and developer workflows, then pricing power gets harder to defend. And if large customers can recover even more savings through routing and caching, the price sensitivity of the stack rises further.
That is especially awkward for vendors counting on growth trajectories that assume premium pricing will persist. The Decoder frames this as a stress test for the growth numbers some labs need to justify the capital they have raised. That is a fair reading of the moment: once customers start treating model choice as an optimization problem, the competitive field expands from raw benchmark performance to the efficiency of the surrounding system.
There is also a tooling implication. If orchestration increasingly determines cost, then model providers will need to compete not only on model quality but on how easy they are to route, cache, and swap in and out of production systems. Developers building around these stacks will likely favor toolchains that expose clear routing policies, cache controls, and session management, because those are the levers that decide whether an AI product is expensive by default or economical by design.
What to watch as this rolls out
The near-term question is not whether this approach is technically possible; Coinbase is already deploying it. The questions are operational. How do latency and quality compare when more traffic is steered to cheaper models? Where does routing introduce failure modes or inconsistent behavior? Which workloads still need premium models, and which can be safely absorbed by lower-cost alternatives?
Policy and procurement issues may also become more visible as more enterprises adopt Chinese models for cost reasons. But even before that debate fully matures, the signal from Coinbase is clear enough: automatic routing and caching can materially change the economics of AI deployment, and they can do so fast enough to force a broader repricing of the toolchain.
For technical readers, the takeaway is not that one model family has won. It is that the center of gravity has shifted. In production AI, the winning stack may be the one that can route intelligently, cache aggressively, and make expensive models the exception rather than the default.



