Gemma 3n Is Here: Google’s Bold Leap into Mobile-First, Multimodal AI

Gemma3n

The first Gemma model, launched early last year, has sparked a massive wave of innovation, amassing over 160 million downloads and creating what’s now called the “Gemmaverse.” From safety-focused AI to medical models and community-driven variants like Japanese Gemma from the Institute of Science Tokyo, it’s been a journey of rapid evolution.

And now, Google is taking the next big step.

Say Hello to Gemma 3n

After teasing us with a preview last month, Google has officially launched Gemma 3n, a powerful mobile-first AI architecture. Designed for developers, this new model is fully optimized for on-device performance while supporting multimodal capabilities—text, image, video, and audio.

You can fine-tune and deploy it using tools you already love: Hugging Face Transformers, llama.cpp, Google AI Edge, Ollama, MLX, and more.

Let’s break down what’s new.

What Makes Gemma 3n Special?

  1. Multimodal by Design: Gemma 3n can understand and generate from text, images, videos, and audio, all directly on your device.
  2. Two Sizes, Big Impact: Available in E2B (5B parameters) and E4B (8B parameters) sizes, the models are optimized to run on just 2GB or 3GB of memory, thanks to smart architectural tricks.
  3. MatFormer Architecture: Like Nested Dolls: At the heart is MatFormer—a "Matryoshka Transformer" that contains smaller models inside a larger one. You can use the full E4B model or a faster E2B variant already extracted. Want something in between? Use Mix-n-Match to custom-size your model based on hardware needs.
  4. Per-Layer Embeddings (PLE): These allow large models to run on smaller devices by offloading embeddings to the CPU, while keeping core transformer weights in GPU/TPU memory.
  5. KV Cache Sharing: For long audio/video sequences, this new technique improves inference speed by 2x, cutting delay before you get a response.

LMArena

Built-In Audio Understanding

Using a version of Google’s Universal Speech Model (USM), Gemma 3n enables.

  • Speech-to-text (ASR) directly on the device.
  • Speech translation (AST) for languages like English, Spanish, French, and Italian.
  • Support for audio clips up to 30 seconds today (longer clips coming soon).

Matformer

Meet MobileNet-V5: A Vision Powerhouse

Gemma 3n features the brand-new MobileNet-V5-300M, delivering real-time image and video analysis. Key benefits.

  • Supports resolutions from 256x256 to 768x768
  • Handles 60 FPS on a Pixel device
  • 13x faster than its predecessor with lower memory usage
  • Outperforms many cloud models in vision-language tasks

This is achieved through smart design: a deeper, wider pyramid model, distillation techniques, and a new fusion adapter for better token quality.

Developer-First, Community-Driven

From AMD to Hugging Face, Docker, RedHat, and many more—Gemma 3n is supported across your favorite open-source tools. It’s built with and for the developer community.

To celebrate this, Google is launching the Gemma 3n Impact Challenge. With $150,000 in prizes, it’s inviting devs to create on-device AI solutions that make a real-world difference.

Ready to Get Started?

Here’s how you can explore Gemma 3n right now,

  • Try instantly: Launch experiments in Google AI Studio
  • Download the models: Available on Hugging Face and Kaggle
  • Explore the docs: Fine-tuning, inference, and deployment guides are available
  • Use your favorite tools: From Transformers to llama.cpp, Ollama, MLX, and more
  • Deploy anywhere: Cloud Run, Vertex AI, GenAI API, NVIDIA API Catalog, and more

Final Thoughts

Gemma 3n isn’t just another AI release—it’s a big leap in making advanced, multimodal AI truly portable. Whether you're building the next smart app, translating real-time audio, or designing edge vision systems, this model gives you the power to build—and deploy—intelligently, right from your device.

With the open challenge, Google is making it clear: AI’s future is not just cloud-first, it's device-ready.