Google has announced Gemini Omni, a new family of “any-to-any” AI models that can process text, images, audio and video all together. Google DeepMind has announced Gemini Omni, a powerful “world model” shown off at the Google I/O conference that can understand simulated physics, gravity and real-world logic, bringing the tech industry closer to artificial general intelligence.
Omni merges these components into a single, native, multimodal pipeline, unlike older separate-stack systems that bifurcate duties among isolated imaging and text networks. The special thing about this new architecture is conversational video editing. Users upload raw footage or an AI-generated clip and can then change specific details (environment, camera angles, objects, etc.) using standard natural language instructions that transition smoothly from one prompt to the next.
The first variant, Gemini Omni Flash, will be available worldwide to paying Google AI subscribers via Google Flow and the Gemini app, with free integration to YouTube Shorts and YouTube Create. To prioritize digital safety, Google is embedding all outputs with its proprietary, invisible SynthID watermarking technology and C2PA cryptographic credentials, which will help mitigate deepfake risks and ensure clear digital provenance.
