Google's Gemini Marks New Era in AI With Multimodal Capabilities

The way people interact with technology is about to change.
Sundar Pichai's vision for Gemini's role in reshaping digital interaction through multimodal AI.

Gemini 1.5 Pro now processes 2 million tokens, enabling unprecedented context windows for analyzing complex information across multiple formats simultaneously. Over 1.5 million developers are integrating Gemini into applications, with AI overviews rolling out in US Google Search this week for more natural queries.

  • Gemini 1.5 Pro processes 2 million tokens, equivalent to roughly 96 Cheesecake Factory menus
  • Over 1.5 million developers are integrating Gemini into applications
  • AI Overviews rolling out in US Google Search this week
  • Trillium TPU delivers 4.7x performance gains over previous generation

Google's Gemini AI model introduces multimodal capabilities processing text, images, and video, with 1.5M developers already implementing it across products like Search and Photos.

At Google's annual developer conference, Sundar Pichai stood before the crowd and made a simple promise: the way people interact with technology is about to change. The vehicle for that change is Gemini, a new class of artificial intelligence that can understand and reason across text, images, video, and code all at once—not as separate tasks, but as a unified way of thinking.

For more than a decade, Google has been building toward this moment. The company invested heavily in AI research, in the chips that run it, in the products that would carry it into people's daily lives. Gemini represents the culmination of that work. It is, by Google's measure, the most capable model the company has built to date, and it is beginning to appear everywhere: in Google Search, in Photos, in Workspace, in the hands of developers building the next generation of applications.

The numbers tell part of the story. More than 1.5 million developers are already using Gemini models to build applications. The latest version, Gemini 1.5 Pro, can process two million tokens of information—a token being roughly a word or small piece of data. To put that in perspective, Google likes to say it is equivalent to understanding the contents of roughly 96 menus from Cheesecake Factory, all at once. This is not a marginal improvement. It is a fundamental shift in what an AI system can hold in its mind and reason about simultaneously.

The practical implications are beginning to show. In Google Photos, Gemini now allows users to ask complex questions about their stored memories. Instead of scrolling through thousands of images, a person can ask: "When did Lucia learn to swim?" and receive a detailed, contextualized answer drawn from the photos themselves. In Google Search, the company is rolling out AI Overviews this week in the United States—a feature that transforms how people search by allowing longer, more natural questions and receiving synthesized answers rather than a list of links.

Behind these consumer-facing features sits a massive infrastructure challenge. Google introduced Trillium, the sixth generation of its Tensor Processing Units—the specialized chips that run AI workloads. These chips deliver 4.7 times more computing performance per chip than the previous generation. It is the kind of hardware leap that makes the software leaps possible.

Pichai framed all of this not as a technical achievement but as a responsibility. He emphasized that Google wants everyone to benefit from what Gemini can do, and that the company is approaching this era of AI with both ambition and caution. The goal, he said, is to build a future where AI is not only advanced but also inclusive, where the innovations are deployed safely and with privacy in mind.

The roadmap ahead is ambitious. Google is working toward what Pichai calls "infinite context"—a system that could theoretically understand and reason about unlimited amounts of information. For now, two million tokens is the frontier. But the direction is clear. This is not a moment of arrival; it is a moment of acceleration, and the company is betting that the next chapter of computing will be defined by AI systems that think the way humans do: across multiple types of information, with deep context, and with the ability to reason about the whole picture at once.

We have been investing in AI for more than a decade, innovating at every step, from research and product to infrastructure.
— Sundar Pichai, CEO of Google and Alphabet
We want everyone to benefit from what Gemini can do.
— Sundar Pichai
La Conversación del Hearth Otra perspectiva de la historia
Inventor

When you say Gemini is multimodal, what does that actually mean for someone using Google Photos or Search?

Model

It means the AI isn't looking at text separately from images or video. It's processing them together, understanding how they relate. In Photos, it's not just recognizing that there's a child in a picture—it's understanding the context of the whole memory, the sequence of moments, and can answer a question about when something happened by reasoning across all of that at once.

Inventor

Two million tokens sounds like a lot, but what does that actually let you do that you couldn't do before?

Model

Before, you'd hit a wall pretty quickly if you tried to feed an AI system a long document or a series of images. Now you can give it something equivalent to 96 restaurant menus and it can hold all of that in mind, find patterns, make connections. It's the difference between reading a paragraph and reading an entire book.

Inventor

Why does Google care so much about getting this into the hands of developers?

Model

Because developers are the ones who will build the applications that actually matter to people. Google can make the model, but a developer in some startup might figure out how to use it in a way that solves a real problem. 1.5 million developers already using it means 1.5 million experiments happening right now.

Inventor

There's a lot of talk about doing this "responsibly." What does that mean in practice?

Model

It means thinking about privacy—your photos stay yours, your searches aren't being used to train the next model without your knowledge. It means being honest about what the AI can and can't do, and not overselling it. It means building safeguards so the system doesn't just optimize for engagement but for actually being helpful.

Inventor

Where does this end? Is there a finish line?

Model

Pichai talks about infinite context as the goal—a system that could theoretically process any amount of information. But that's probably not the real finish line. The finish line is when AI becomes so integrated into how we work and think that we stop calling it AI. It's just how things work.

Quieres la nota completa? Lee el original en infobae ↗
Contáctanos FAQ