Inside Google’s Agentic Data Cloud Architecture for Enterprise AI

Over the last ten years, “data platform” and “AI platform” operated in separate zones. Data teams focused on creating warehouses, lakes, catalogs, and pipelines, while AI teams worked on delivering models, APIs, and agents. These two systems communicated through fragile batch processes manual tasks, and clunky exports.

At Google Cloud Next 2026, Google claimed that this separation is now the biggest limitation holding back enterprise agentic AI. To address this, they unveiled the Agentic Data Cloud. This updates their data platform design to prioritize agents as key users of enterprise data giving them the same level of precision that BigQuery once offered to BI tools. At the core of this change are six major features, including the Knowledge Catalog, Smart Storage, the Deep Research Agent, the Data Agent Kit, the Lightning Engine for Apache Spark, and the Cross-Cloud Lakehouse.

This post explains to an enterprise architect what each component does how they connect, the metrics Google shares, and where to find tangible ROI.

Why “agentic data” demands a fresh architecture

Agents often break down in enterprise settings due to common and predictable issues. Around seventy percent of companies realize their data infrastructure flaws after deploying agents. These issues include missing metadata unmanaged unstructured data scattered multi-cloud setups sluggish analytics tools, and no reliable method for agents to locate or use data assets . The Agentic Data Cloud is built to tackle those problems.

Every element follows three key design ideas:

  1. Context served as a tool. Agents require ongoing updates to have a clear, machine-readable understanding of available data, its meaning, and how entities are connected.
  2. Default to open and multi-cloud setups. A successful agent strategy can’t depend on a data layer locked to a single vendor. Solutions like Iceberg, MCP, and cross-cloud queries form its foundation.
  3. Unified execution for SQL, Spark, and AI. Agents, BI dashboards, and Spark jobs should access one governed dataset together. Copying data should not be necessary.

Quick look at the reference architecture

The diagram below illustrates how six components come together to create a cohesive, agent-enabled data platform.

At the foundation of the system lies the Cross-Cloud Lakehouse built on an open Iceberg framework that works across GCS S3, and ADLS. Sitting above this layer, the Lightning Engine speeds up Spark-based tasks, while the BigQuery engine takes care of running SQL queries and handling AI processes. Everything connects to the Knowledge Catalog, which integrates Smart Storage to manage unstructured data and provides a updating semantic context. Various agents such as Google’s Deep Research Agent or custom-made agents using the Data Agent Kit, interact with this stack through the Model Context Protocol (MCP), which acts as a universal “API for data.”

Breaking down the components

1. Knowledge Catalog — acting as the central context system

The Knowledge Catalog introduced as the rebranded and expanded version of Dataplex Universal Catalog on April 10, 2026, aims to answer a fundamental question for every agent: “What data is out there, what does it mean, and can I rely on it?”

It works by enriching data in the background. It examines query logs, profiles tables analyzes BI semantic models such as Looker and LookML, and pulls out entity relationships from unstructured files. Unlike the older catalog, which required manual effort to organize, the Knowledge Catalog creates and maintains a dynamic knowledge graph that evolves within the organization.

What’s new for architects: A LookML Agent now auto-creates semantics, BigQuery Graph (in preview) reveals entity-relationship logic to conversational agents, and there are native MCP endpoints. This allows any agent like Gemini, Claude, or even custom ones to find assets without needing special connectors.

Why it matters: The catalog acts as the stopping point for “ungoverned RAG.” When an agent seeks an exact definition — like what “active customer” means — the Knowledge Catalog serves as the accurate source instead of some made-up guess.

2. Smart Storage — adding intelligence within GCS

Smart Storage (preview) adds new capabilities to Google Cloud Storage. As soon as a file enters a bucket, GCS labels it, creates embeddings, pulls out key entities, and links it to the Knowledge Catalog. Everything from PDFs to images audio files, contracts, and support tickets becomes searchable assets, and there’s no need to set up an external process.

Enterprises with years of messy unorganized data sitting in object storage can use this to shorten long AI preparation projects. Tasks like extracting data, performing OCR, embedding, indexing, and organizing — which once took months — now operate as part of the storage system itself. The impact on architecture is huge. A single bucket policy now manages both raw data and the AI-powered metadata generated from it.

3. Deep Research Agent — solving the toughest questions

Google introduced the Deep Research Agent and its more advanced counterpart, Deep Research Max, to highlight how an agent performs when integrated with the Knowledge Catalog and Cross-Cloud Lakehouse. Using Gemini 3.1 Pro, it creates detailed research plans, navigates both internal files and the open web, and delivers reports with citations along with an audio summary.

For businesses in areas like finance, life sciences, market research, or competitive intelligence, tasks that used to take analysts one to three weeks now result in a solid, citation-supported draft within minutes. Two SKUs allow architects to choose the ideal balance between speed and quality. Standard Deep Research suits tasks with interactive platforms. Deep Research Max fits situations that involve complex critical synthesis.

4. Data Agent Kit — embed the agent where developers work

The Data Agent Kit (preview) serves as the platform’s interface for developers. This kit provides a set of MCP tools, prebuilt agents, and extensions tailored to specific environments. It integrates Google’s data agents with platforms like VS Code, Gemini CLI, Codex, and Claude Code.

The package includes three ready-made agents. The Data Engineering Agent converts natural-language inputs into managed pipelines deciding between tools like BigQuery, dbt, Spark, or Airflow. The Data Science Agent takes care of automating the full model lifecycle using BigQuery and Spark. The Database Observability Agent works to find and fix infrastructure problems. MCP offers support for BigQuery Spanner (in preview) AlloyDB, Cloud SQL (general availability), and Looker (also in preview).

To reduce the expense of creating agents, architects benefit from this setup. Teams no longer need to create fragile data wrappers. They can instead begin with a reliable, secure, and reusable set of tools.

5. Lightning Engine — Spark, made faster

Lightning Engine is a C++ vectorized execution layer designed for Google Cloud Serverless to run Apache Spark. It uses Apache Gluten and Velox as its foundation. Its performance stats are impressive: it runs up to 4.9 times faster than open-source Spark, performs 3.6 times faster on TPC-H-type tasks at 10 TB, and delivers about twice the price-performance of the top commercial Spark rival. You can use it now with Serverless for Apache Spark runtime 2.3 on the premium pricing level.

Why this is important in an agent-driven setup: agents do more than just ask questions. They create code—often Spark and PySpark—and execute it in loops. Even a small boost in Spark’s performance can add up across hundreds of jobs that agents kick off . Lightning Engine helps make running “agent-created Spark” affordable on a large scale.

6. Cross-Cloud Lakehouse — one query works across three clouds

The Cross-Cloud Lakehouse (preview) offers a key feature enterprise architects managing true multi-cloud environments have long anticipated. Using the Iceberg REST Catalog Cross-Cloud Interconnect, and a smart caching layer, it allows BigQuery to access data stored in AWS S3 and Azure ADLS. Google claims it delivers total cost and performance on par with native cloud warehouses.

This section tackles the “where does the data live?” problem for agents. A procurement agent can combine Ariba data in BigQuery with contract PDFs stored in S3 and inventory details in Azure. Everything runs under a unified Iceberg catalog, and no one needs to handle an ETL process. To operate , businesses that rely on AWS or Azure — and based on my work with SAP, Salesforce, Workday, and Oracle systems all Fortune 500 companies do — will find this to be the most crucial part of the announcement.

How this affects enterprise architects

People who design integration and data systems need to keep a few things in mind.

First, MCP isn’t just a tool for developers anymore; it’s a governance priority. You should manage MCP endpoints just like you manage your API gateway today. This means applying strict rules around versioning, authentication, authorization, monitoring, and usage limits. The Knowledge Catalog and Data Agent Kit can help with this. But it’s up to us to make sure these standards are followed.

Second, the catalog shifts from being about “documentation” to acting as a “runtime dependency.” If ensuring an agent works requires the Knowledge Catalog to deliver accurate semantics then things like keeping the catalog updated maintaining its history, and meeting service level agreements become critical for operations. You should prepare to handle this in your operating model right from the start.

Third multi-cloud stops being just about moving systems and turns into a story about working across platforms. Tools like Cross-Cloud Lakehouse and managed Iceberg reduce the costs of keeping data in its current place instead of transferring it. The focus of architecture discussions changes from “how do we move to one cloud” to “how do we integrate multiple clouds using one governed catalog.” This shift creates a new operational approach and better fits where many companies already find themselves.

Final thought

The Agentic Data Cloud is not just one product being launched. It’s a planned overhaul of Google’s data systems to let agents interact with enterprise data in a safe, precise, and cost-effective way. Architects need to focus now on practical steps. Start by figuring out where your data is stored, prioritize which agents will deliver results first, and test out Smart Storage and the Knowledge Catalog in a domain where errors would be costly enough to demand thoroughness.

Companies that handle this well over the next year won’t just gain access to agents. They’ll create a solid, agent-ready data system that the rest of their business can keep using for the next ten years.

References


Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.