Google Updates Gemini Cache to help reduce costs when using AI

By Aayush

Google has quietly rolled out a useful update for developers: implicit caching is now active across all Gemini 2.5 models via the API, as of this past Thursday. The feature aims to reduce token usage automatically, helping developers save resources without any extra work.

Logan Kilpatrick, Senior Product Manager for Gemini and Google AI Studio, shared the news on his X (formerly Twitter) account. The company also published a detailed post on its official developer blog confirming the change.

So, what does implicit caching actually do? In simple terms, if a developer sends a request that’s identical—or very close—to one that was previously processed, the system can recycle the earlier response. That can reduce the number of tokens used by up to 75%, since the model doesn’t need to re-analyze everything from scratch.

To increase the chances of triggering the cache, Google suggests two strategies:

  • Start prompts with any large, repeated content that was part of earlier requests.
  • Send similar prompts within a short time frame to take advantage of session-based memory.

If you’re curious whether your API call used cached data, you can find that information in the usage_meters field of the response object.

Lower Token Thresholds for Cache Hits

Google also revealed that it has lowered the minimum token requirement to trigger the context cache:

  • 1,024 tokens for Gemini 2.5 Flash
  • 2,048 tokens for Gemini 2.5 Pro

These updates are designed to make the development process more efficient and cost-effective, especially for apps or workflows that rely on repeated prompts.

What is the Difference Between the Implicit Cache and the Explicit?

When working with the Gemini API, developers can choose between two types of cache handling: implicit and explicit.

The implicit cache runs in the background automatically. While it’s designed to reduce token usage, cost savings aren’t guaranteed, as it depends heavily on how the API is used. Developers leveraging this option should pay attention to their prompt structure and understand how caching works to make the most of it.

Tip: Placing repetitive or common content at the start of your prompts and keeping similar requests close together in time can help maximize cache hits.

On the other hand, the explicit cache gives the developer full control. With this method, you manually define how long the cache should be stored, and you can refer back to saved data in future calls.

This approach requires more setup, but it also offers greater control and more predictable savings, ideal for teams building complex or resource-heavy applications.

TAGGED:
Share This Article
Follow:
Aayush is a B.Tech graduate and the talented administrator behind AllTechNerd. . A Tech Enthusiast. Who writes mostly about Technology, Blogging and Digital Marketing.Professional skilled in Search Engine Optimization (SEO), WordPress, Google Webmaster Tools, Google Analytics
Leave a Comment