Data Catalog is a fully managed and scalable metadata management service that enables you to discover and govern your data assets in Google Cloud.
Let’s see it in action. Imagine you have a dataset of customer orders in BigQuery. You want to make it discoverable for your data analysts, but also ensure that sensitive PII is protected.
First, we’ll enable the Data Catalog API.
gcloud services enable datacatalog.googleapis.com
Next, we’ll create an entry for our BigQuery table. This entry will represent the table in Data Catalog, allowing it to be searched and cataloged.
gcloud data-catalog entries create \
--project=your-gcp-project-id \
--location=us-central1 \
--display-name="Customer Orders Table" \
--description="Contains all customer order information." \
--bigquery-table-project=your-gcp-project-id \
--bigquery-table-dataset=your_dataset_id \
--bigquery-table-table=customer_orders
Now, we can add further metadata to this entry, such as tags. Tags are key-value pairs that can be used to classify data. For example, we can tag this table as "PII" to indicate it contains personally identifiable information.
gcloud data-catalog tags create \
--project=your-gcp-project-id \
--location=us-central1 \
--tag-template-field=your_tag_template_id:PII \
--entry=your-datacatalog-entry-id
You can also create custom tag templates to define your own metadata schemas. This allows for structured and consistent tagging across your organization.
The real power of Data Catalog comes from its ability to automatically discover and ingest metadata from various Google Cloud services like BigQuery, Cloud Storage, Pub/Sub, and Dataproc. This means you don’t have to manually catalog everything. As new tables are created in BigQuery or new files are added to Cloud Storage, Data Catalog can pick them up automatically.
Once your data assets are cataloged, users can search for them using the Google Cloud Console or programmatically via the Data Catalog API. They can search by table name, description, tags, or even specific column names.
Consider a scenario where a marketing analyst needs to find all tables containing customer email addresses. They can simply search Data Catalog for "email" and filter by tags like "PII" or "customer data." This dramatically reduces the time spent searching for the right data.
Data Catalog also integrates with other Google Cloud services for data governance. For instance, you can use IAM to control who has access to view or modify metadata in Data Catalog. You can also integrate with Data Loss Prevention (DLP) to automatically scan and classify sensitive data, and then use those classifications as tags in Data Catalog.
The "search index" for Data Catalog is not a traditional Lucene or Elasticsearch index that you can directly query. Instead, Data Catalog builds and manages its own internal search infrastructure based on the metadata it ingests. When you perform a search, you’re querying this managed index through the Data Catalog API, which then translates your query into relevant results from your cataloged assets. This abstraction means you don’t need to worry about the underlying indexing technology, but it also means you can’t directly tune or inspect the search index itself.
The next step is to explore how to build custom search experiences and integrate Data Catalog with data governance policies.