Google's Gemini Omni Unifies Text, Audio, Images Into AI-Generated Video

AI is moving from predicting text to simulating reality.
Google CEO Sundar Pichai describes the philosophical shift Gemini Omni represents in artificial intelligence development.

After three years of quiet convergence, Google has unveiled a system that doesn't merely process the world's media — it reasons across all of it at once. Gemini Omni, announced at the company's annual developer conference, represents a shift from AI as a text predictor toward AI as a reality simulator, capable of generating physics-aware, culturally coherent video from any combination of inputs. The first version, Omni Flash, arrives today in consumer products with deliberate constraints, while the deeper ambition — transforming how advertising, filmmaking, and human creativity itself are practiced — is only beginning to surface.

  • Google's Gemini Omni doesn't stitch media together like a timeline editor — it understands the relationships between text, image, audio, and video simultaneously, producing output that reflects genuine knowledge of physics and culture.
  • A claymation protein-folding explainer, generated entirely by the model with accurate narration and synchronized visuals, signals that the gap between human creative intent and machine execution has narrowed dramatically.
  • Consumer rollout begins today with a deliberate ten-second cap — not a technical ceiling, but a strategic choice to accelerate adoption and match real-world usage patterns on platforms like YouTube Shorts.
  • Deepfake risk looms large enough that Google built biometric onboarding and mandatory SynthID watermarking into the product before launch, acknowledging the weight of what it is releasing.
  • Advertisers and filmmakers are watching the API rollout closely — the promise of generating an entire campaign from a brief and a product image could restructure creative workflows within months.

Google spent three years converging on a single idea: an AI that could understand and create across every medium simultaneously. At its annual developer conference, CEO Sundar Pichai announced the arrival of Gemini Omni — a family of multimodal models designed to generate coherent output from any combination of text, images, audio, and video.

The distinction from what came before matters. Google already had Veo for video generation and Gemini as a multimodal system. But Omni doesn't layer inputs the way an editor layers tracks. It reasons across them, drawing on knowledge of physics, culture, and science to produce unified results. When researchers asked it to create a claymation explainer of protein folding, the model generated stop-motion visuals, accurate voice-over narration, and synchronized storytelling — not as separate tasks, but as a single act of understanding.

The version launching today, Gemini Omni Flash, renders ten seconds of video and is available in the Gemini app, YouTube Shorts, and Google's creative studio Flow. The duration limit is framed as a deliberate choice rather than a technical constraint. A digital avatars feature allows users to place themselves in scenes they never inhabited, with biometric onboarding and SynthID watermarking built in to guard against misuse.

The professional implications are already drawing attention. API access rolls out in coming weeks, opening Omni to advertisers and filmmakers who could generate entire campaigns from a brief and a product image. Google's product leadership flagged text-rendering accuracy as a particular strength for advertising use cases, while noting that editing prompts must be precise to avoid unintended changes — a lesson learned from earlier tools.

A more capable Omni Pro version is in development, targeted at professional studios and creators. The larger arc Pichai described points toward AI moving beyond text prediction and into something closer to reality simulation — and Omni, for now, is the clearest step in that direction.

Google spent three years building toward a single idea: an artificial intelligence system that could understand and create across every medium at once. Today, at its annual developer conference, the company announced it had arrived at that destination with Gemini Omni, a family of multimodal models that CEO Sundar Pichai says will be able to "create anything from any input."

The breakthrough is not merely technical. Yes, Google already had Veo, a dedicated video model that could turn text and images into finished videos. Yes, the company had built Gemini itself as a multimodal system three years prior. But Omni represents something different: a model that doesn't simply stitch inputs together the way a video editor might layer tracks. Instead, it reasons across text, images, audio, and video simultaneously, understanding the relationships between them, and produces output that reflects coherent knowledge of physics, culture, history, and science. When researchers at Google DeepMind asked Omni to create "a claymation explainer of protein folding," the system generated a stop-motion video complete with a voice-over that accurately described how amino acids chain together and fold into helical and sheet structures. The model understood not just what the words meant, but how to visualize them, how to narrate them, and how to make those elements work together.

The first version rolling out today is called Gemini Omni Flash, and it arrives in the Gemini app, YouTube Shorts, and Google's creative studio Flow. It can render ten seconds of video—a constraint that Google frames not as a limitation of the model itself, but as a deliberate choice to get the tool into more hands quickly and to match what users actually want to create. Longer durations are coming. The company is also introducing digital avatars, a feature that lets users generate videos of themselves in scenarios they choose—winning an award, standing on the moon, or simply appearing in a scene where they weren't present. To prevent deepfakes, Google requires users to record themselves speaking a series of numbers during onboarding, storing that biometric signature for future use. Every video created with Omni will carry Google's SynthID digital watermark, a machine-readable tag that allows anyone to verify whether the video was generated by Gemini.

Google is positioning Flash as a consumer product, and the examples its engineers offered reflect that focus. But the professional implications are already visible. The company will make Omni available via API in the coming weeks, opening it to advertisers and filmmakers who could use an end-to-end multimodal workflow to generate entire campaigns from a brief and a product image. Nicole Brichtova, director of product management at Google DeepMind, highlighted the model's text-rendering capabilities as particularly valuable for advertising, where accuracy matters—a slogan or product placement can't be approximate. She acknowledged that editing prompts need to be highly specific; otherwise, the model risks over-editing or changing elements the user wanted to preserve, a problem that plagued earlier tools like Nano Banana.

A more powerful version, Omni Pro, is in development and will launch when Google feels it represents a meaningful step forward from Flash. That model is expected to perform better across all tasks and will likely appeal to professional creators and studios. The broader vision Pichai articulated during the briefing suggests where this is heading: artificial intelligence is moving away from simply predicting what text should come next and toward simulating reality itself. Omni is positioned as the next step in that direction—a model that understands the world deeply enough to render it convincingly across any medium a user requests.

Create anything from any input.
— Sundar Pichai, Google CEO
They're like personalized memes.
— Gabe Barth-Maron, DeepMind research engineer, describing avatar use cases
La Conversación del Hearth Otra perspectiva de la historia
Inventor

What makes Omni different from just combining Google's existing video model with Gemini?

Model

The difference is in how it thinks. Veo stitches inputs together; Omni reasons across them. When you give it a prompt, it's not running separate processes for text understanding and video rendering. It's holding all the inputs in mind at once, understanding how they relate, and creating output that's coherent across all of them.

Inventor

So when it made that protein-folding video, it wasn't just illustrating a script?

Model

Right. It understood the science, generated appropriate visuals, wrote and performed narration that matched those visuals, and made sure the pacing and tone all worked together. That's multimodal reasoning, not assembly.

Inventor

Why limit it to ten seconds if the model can do more?

Model

Speed to market, partly. But also honesty about what people actually want. Most users aren't trying to make feature films. Ten seconds is enough for a meme, a social clip, a quick explainer. Longer videos are coming, but there's no point forcing capability nobody's asking for yet.

Inventor

The avatar feature seems like it could be abused pretty easily.

Model

That's why Google built in the onboarding—you have to record yourself speaking numbers. It's a biometric anchor. And every video gets a watermark so people can verify whether it was actually generated. It's not foolproof, but it's a real attempt to slow down the deepfake problem.

Inventor

Who's really going to use this?

Model

Consumers will make personalized videos—themselves in scenarios they weren't in. But the real money is in advertising and filmmaking. An agency could brief the model with a product image and a campaign concept, and get back a finished ad. That's transformative if it works at scale.

Inventor

Is Google worried about competition?

Model

Startup Luma AI is building something similar. But Google has the infrastructure, the training data, and the distribution. The question isn't whether Omni works—it clearly does. It's whether they can keep it ahead of what's coming next.

Quieres la nota completa? Lee el original en TechCrunch ↗
Contáctanos FAQ