When SQL starts speaking natural language, the bottleneck is no longer syntax. It is inference economics.

Google Cloud’s latest paper on AI-powered SQL functions argues that the current model is hard to sustain for operational workloads: every LLM invocation can add 10–100x to query latency and roughly 1,000x to cost. That tradeoff may be acceptable for occasional analyst tooling, but it becomes a problem fast when the query engine is asked to classify millions of rows, route support tickets, or flag product reviews in near real time. The proposed alternative is a proxy model: an ultra-lightweight classifier tuned to a specific prompt and your data, built from embeddings and LLM-labeled data rather than repeated direct calls to the frontier model.

The result, according to Google’s SIGMOD work, is a path to more than two orders of magnitude in cost and latency reductions for some AI+SQL prompts. That is not a generic efficiency claim; it is a shift in deployment math. If the workload is stable enough, the expensive model can move out of the hot path and into the data-generation pipeline, where it labels examples once, while the proxy handles the repeated inference.

How proxy models work in practice

The architecture is straightforward, but the leverage comes from where each component sits.

First, embeddings provide the semantic feature space. In Google’s example, Gemini-style embeddings encode the meaning of rows, cells, or text fields into vectors that are much cheaper to compute and far smaller to serve than full LLM reasoning. Those embeddings become inputs to a lightweight model that is trained for a narrow task: answer this prompt against this dataset.

Second, the labels come from the LLM itself. The larger model is used to generate training data, not to answer every production query. That matters because the proxy is not a generic chatbot replacement; it is a task-specific classifier optimized for a prompt such as “which product reviews are negative about durability?” or “which support tickets were resolved by a workaround?” The proxy learns from LLM-labeled data, then serves subsequent AI+SQL prompts with near-LLM fidelity at a fraction of the cost.

This is the basic pattern behind the promised speedup: front-load the expensive reasoning into offline labeling, then use a smaller model to amortize that work across repeated queries. In database settings, that can turn a token-heavy operation into a streaming inference path with much tighter latency distribution.

What changes for product rollout

For product teams, proxy models are less about a new model class than a new operating model.

The first implication is SLO design. If the query no longer depends on a remote LLM call, latency budgets become more predictable, and the performance profile starts to resemble a conventional inference service rather than an external AI dependency. That changes how teams think about caching, fallbacks, and concurrency. A proxy can sit closer to the database layer, where it can be evaluated row-by-row or batch-by-batch without paying the full tax of an LLM round trip.

The second implication is budgeting. A system that previously consumed billable LLM tokens on every request can move toward a fixed-cost or bounded-cost inference path. That does not make the workload free, but it does make it easier to forecast spend for high-volume analytics or operational use cases. For data-heavy products, that may be the difference between a feature that is technically possible and one that is economically sustainable.

The third implication is architecture. Proxy models are most compelling when the prompt is specific and repeated, the underlying data is stable enough to label, and the error tolerance is known. In those cases, the organization can treat the LLM as a teacher and the proxy as the production workhorse.

The risks are real

The same specialization that makes proxy models efficient also makes them brittle.

Their quality depends on the quality of the labeling pipeline. If the LLM-generated labels are inconsistent, biased, or overly literal, the proxy will learn those errors quickly and cheaply. That is not a feature; it is a scaling mechanism for mistakes.

Concept drift is another concern. A proxy tuned to a narrow prompt and a particular data distribution may degrade as product language changes, support workflows evolve, or customer behavior shifts. Because the model is intentionally lightweight, it may fail silently unless teams instrument it carefully.

Governance also becomes more complicated, not less. Organizations still need provenance for training data, auditability for labels, and a clear explanation of how outputs were generated. That matters in regulated environments and in any workflow where the output influences downstream decisions. A cheaper model does not reduce the need to know why a row was classified a certain way.

What teams should test next

The most useful pilots will not ask whether proxy models are universally better than LLMs. They will ask where the economics work.

Start with a narrow set of AI+SQL prompts that are repeated frequently and easy to score. Use a representative slice of production data, not a sanitized toy dataset. Compare direct LLM calls against a proxy built from the same embeddings and the same LLM-labeled data, then measure three things together: accuracy, p95 latency, and unit cost per query.

It also helps to test failure modes deliberately. Evaluate performance on edge cases, distribution shifts, and ambiguous rows. Track how often the proxy disagrees with the teacher model and how often those disagreements matter operationally. If the model is being used for triage, ranking, or filtering, the benchmark should reflect the actual business cost of false positives and false negatives.

The broader technical implication is that AI-on-data may start to split into two layers: a large model for data generation and prompt design, and a lightweight proxy for production inference. That division could reshape how teams choose between direct LLM integration and database-native AI functions. The more the query repeats, the more attractive the proxy becomes. The remaining question is not whether the speedup is large—it is whether an organization can build the labeling, monitoring, and governance machinery to trust it.