Gemini 3.5 Flash is faster but much pricier to run

Google’s Gemini 3.5 Flash arrives with the kind of upgrade teams usually want: better speed, stronger capability, and the same one-million-token context window that keeps it viable for long-horizon work. But the pricing change is large enough to alter deployment calculus in a way that is hard to paper over with benchmark wins.

According to Artificial Analysis, Gemini 3.5 Flash runs about 5.5 times more expensively than Gemini 3 Flash in benchmark-style testing, and roughly 75% more than Gemini 3.1 Pro. The jump is driven by both higher unit prices and higher token consumption. Google is now charging $1.50 per million input tokens and $9.00 per million output tokens, up from $0.50 and $3.00 for Gemini 3 Flash. On paper, those rates still undercut Gemini 3.1 Pro’s $2.00 input and $12.00 output pricing. In practice, the usage pattern matters more than the sticker price.

That is the crux of the problem for teams building agents, copilots, retrieval pipelines, and other token-hungry systems. A model can be cheaper per token and still cost more to operate if it emits more tokens, reasons longer, or is used in workflows that multiply invocations. Flash models have historically been selected because they offered enough quality at a lower operating cost, making them a natural fit for high-volume tasks and latency-sensitive applications. Gemini 3.5 Flash narrows that bargain.

The difference shows up fastest in workloads that chain model calls. An agent that once used Flash for a fast first-pass answer, a summarization step, and a tool-augmented follow-up can see its spend climb sharply if each hop increases both output length and overall token throughput. The model’s bigger capability envelope may reduce failure rates or retries, but the economics are not linear. If the system generates more intermediate text, or if orchestration logic encourages longer context accumulation, the total bill can rise faster than the latency savings justify.

That is why the context window staying at one million tokens is not the whole story. A large window is useful, but it also invites larger prompts, broader retrieval inserts, and more aggressive multi-step prompting. In other words, the same feature that supports more ambitious applications can increase the probability that teams spend into the model’s upper cost bands. For large-context use cases, the economics are set less by what the model can hold than by how much of that window product teams actually allow their systems to consume.

The practical impact is immediate for operations teams. Model selection can no longer be treated as a simple quality-versus-latency decision; it has to be a routing problem. The obvious response is cost-aware orchestration: send trivial or repetitive requests to cheaper models, reserve Gemini 3.5 Flash for cases where its speed materially reduces user wait time or lowers downstream retries, and add thresholds that prevent expensive fallbacks from becoming the default path.

Batching and caching matter more in this regime as well. If teams are serving repeated prompt patterns, cache hits can blunt the effect of higher input prices. If they are processing bursts of small jobs, batching can reduce overhead and better align throughput with quota management. But both techniques depend on instrumentation. Without per-route cost visibility, teams will know that the model is faster and still miss the point that the workflow as a whole is getting more expensive.

That makes dashboards and quotas less like finance controls and more like core infrastructure. Teams need per-model spend tracking, token budgets tied to workload classes, and guardrails that make it visible when an assistant is drifting from short answers into expensive, open-ended generation. The right question is not whether Gemini 3.5 Flash is good. It is whether the incremental speed or capability is worth the extra cost for each product surface, each customer segment, and each class of automated task.

The broader market signal is also clear. Google is not moving alone here. The Decoder’s framing of Gemini 3.5 Flash fits a wider industry pattern in which newer models often arrive with materially higher pricing, even when their per-token rates remain competitive within a family. For buyers, that shifts the value proposition away from raw price and toward performance density: how much useful work a model can complete per dollar, per millisecond, and per orchestration step.

That is a more demanding standard. It favors teams that can measure real workload economics rather than rely on headline token prices. It also favors deployment architectures that are selective by design, not just capable by default. In a market where faster models can also be pricier to run, the winning strategy is increasingly to treat model choice as an operational policy, not a static vendor decision.

Gemini 3.5 Flash is faster — and the bill gets a lot harder to ignore

AI News Desk

Claude Cowork’s biggest use case is the office work nobody wants to own

Altman’s ‘pretty sure’ moment shifts the AI debate from layoffs to throughput

Brown’s 96-to-48 Split Is a Stress Test for AI-Era Assessment