Solved: How to share Google Dataplex/Data Catalog metadata...

shekhar0413

We are sharing some BQ datasets with a third party (outside our org) using analytics hub. So, the subscriber is able to create a linked dataset in their project and any queries they run are billed to their project.

Now, we are exploring enriching the metadata for some of these datasets that are being shared using Dataplex. Are there any use cases/best practices on how this metadata can be shared? Two options come to mind:

Option1: Enrich the metadata in Dataplex/Data Catalog and push it to a BQ dataset. And then share the same using Analytics Hub. This would be the cleanest solution. But I am not clear on the following points:
- Is it possible to push Dataplex/Data Catalog metadata to BQ?
- It seems possible to create two types of metadata for a BQ table in Dataplex, the Dataplex entity and the Data Catalog entry. Can both/either/none be pushed? Link about question related to Dataplex vs. Data Catalog metadata: Link
Option2: Give a user from the 3rd party permissions to view the metadata in Dataplex. Nor sure if this possible, to have someone outside the org. have access to only the metadata for certain datasets in dataplex. Maybe organize data in lakes and then grant access at the lake level, but in that case, would not be able to share the Data Catalog metadata. This would however not be the preferred approach, to have a 3rd party user access resources directly inside the project.

ms4446

Dataplex is designed to manage data across various storage mediums, organizing it into lakes and zones for structured access. It is not primarily used for metadata management but does maintain some metadata about these structures.

Data Catalog acts as a centralized metadata repository that enables search and discovery across various data assets in Google Cloud. It does not manage data directly but rather the metadata that describes data assets, such as those in BigQuery.

Analytics Hub:

Primarily used for sharing datasets, Analytics Hub does not directly handle the sharing of raw metadata stored in Dataplex or Data Catalog without converting this metadata into a structured dataset first.

Recommended Approach: A Hybrid Solution

Curate Essential Metadata:

Identify the most valuable metadata elements to share with the third party, which might include:

Technical Metadata: Column names, data types, descriptions.
Business Metadata: Ownership details, classifications, and tagging.
Data Lineage: Details concerning data origins and transformations, although Data Catalog's capabilities here are limited.
Dataplex-Specific Metadata: Information about lakes and zones if relevant.

Structured Metadata Export:

Dataplex: Use custom scripts or processes (potentially leveraging Google Cloud services like Dataflow) to programmatically extract metadata from Dataplex, as there is no direct API for exporting metadata for user consumption.
Data Catalog: Utilize the Data Catalog API to systematically export metadata associated with BigQuery datasets,likely requiring transformation into a BigQuery-friendly format.

Create a Metadata Dataset:

Prepare a BigQuery dataset specifically designed for sharing, which may include separate tables for different metadata types (technical, business, lineage).
Use clear and descriptive naming conventions for easy understanding and consumption by third parties.
If sharing Dataplex entities, include a mapping table correlating them with their respective BigQuery datasets.

Share Metadata via Analytics Hub:

Publish your well-structured metadata dataset on Analytics Hub, allowing third parties to access it as they would any shared dataset.

Additional Considerations:

Metadata Security: Review all metadata for sensitive information. Utilize obfuscation or anonymization for sensitive elements and employ Data Catalog’s IAM controls to enhance security.
Metadata Updates: Implement a regular process to update the exported metadata in BigQuery, ensuring that changes in source systems are reflected timely.
Metadata Documentation: Provide comprehensive documentation explaining the structure, meaning, and special considerations of the metadata to ensure it is usable and understandable.

Example (Illustrative):

Imagine a Dataplex setup with a lake named "customer_data" and a table "customer_transactions". You intend to share this metadata:

Dataplex: Lake name, table name.
Data Catalog: Column names, data types, descriptions.
Custom Metadata: A "PII Flag" column indicating the presence of Personally Identifiable Information (PII).

Your metadata dataset structure could look like this:

Entity Type	Entity Name	Column Name	Data Type	Description	PII Flag
Lake	customer_data	customer_id	STRING	Customer's unique ID	Yes
Table	customer_transactions	transaction_amount	FLOAT64	Transaction amount in USD	No

drive_spreadsheetExport to Sheets

This hybrid solution leverages the strengths of Dataplex, Data Catalog, and Analytics Hub. It gives full control over which metadata elements are shared and provides a structured, easily consumable method for third parties to access your metadata.

View solution in original post

ms4446

Sorry for the confusion. The "Metadata" tab mentioned in the older Google Cloud Community post is inaccurate.There is no reference to this "Metadata" tab in the current Dataplex documentation, and many users, including yourself, have reported not being able to locate it.

As of now, the recommended approach for pushing Dataplex metadata to BigQuery involves using the Dataplex REST API to programmatically export metadata and then load it into BigQuery using your preferred method (e.g., Python scripts, Cloud Functions).

Alternative to Consider:

While the direct "Metadata" tab option seems unavailable, you might want to explore the Dataplex Metadata Export feature if available. Please note, as of the latest documentation, Dataplex does not explicitly mention an automatic metadata export to Google Cloud Storage in Avro format for direct use. You would typically need to implement custom solutions for exporting metadata. Check for the most current capabilities in the Dataplex documenta

View solution in original post

ms4446

Dataplex is designed to manage data across various storage mediums, organizing it into lakes and zones for structured access. It is not primarily used for metadata management but does maintain some metadata about these structures.

Data Catalog acts as a centralized metadata repository that enables search and discovery across various data assets in Google Cloud. It does not manage data directly but rather the metadata that describes data assets, such as those in BigQuery.

Analytics Hub:

Primarily used for sharing datasets, Analytics Hub does not directly handle the sharing of raw metadata stored in Dataplex or Data Catalog without converting this metadata into a structured dataset first.

Recommended Approach: A Hybrid Solution

Curate Essential Metadata:

Identify the most valuable metadata elements to share with the third party, which might include:

Technical Metadata: Column names, data types, descriptions.
Business Metadata: Ownership details, classifications, and tagging.
Data Lineage: Details concerning data origins and transformations, although Data Catalog's capabilities here are limited.
Dataplex-Specific Metadata: Information about lakes and zones if relevant.

Structured Metadata Export:

Dataplex: Use custom scripts or processes (potentially leveraging Google Cloud services like Dataflow) to programmatically extract metadata from Dataplex, as there is no direct API for exporting metadata for user consumption.
Data Catalog: Utilize the Data Catalog API to systematically export metadata associated with BigQuery datasets,likely requiring transformation into a BigQuery-friendly format.

Create a Metadata Dataset:

Prepare a BigQuery dataset specifically designed for sharing, which may include separate tables for different metadata types (technical, business, lineage).
Use clear and descriptive naming conventions for easy understanding and consumption by third parties.
If sharing Dataplex entities, include a mapping table correlating them with their respective BigQuery datasets.

Share Metadata via Analytics Hub:

Publish your well-structured metadata dataset on Analytics Hub, allowing third parties to access it as they would any shared dataset.

Additional Considerations:

Metadata Security: Review all metadata for sensitive information. Utilize obfuscation or anonymization for sensitive elements and employ Data Catalog’s IAM controls to enhance security.
Metadata Updates: Implement a regular process to update the exported metadata in BigQuery, ensuring that changes in source systems are reflected timely.
Metadata Documentation: Provide comprehensive documentation explaining the structure, meaning, and special considerations of the metadata to ensure it is usable and understandable.

Example (Illustrative):

Imagine a Dataplex setup with a lake named "customer_data" and a table "customer_transactions". You intend to share this metadata:

Dataplex: Lake name, table name.
Data Catalog: Column names, data types, descriptions.
Custom Metadata: A "PII Flag" column indicating the presence of Personally Identifiable Information (PII).

Your metadata dataset structure could look like this:

Entity Type	Entity Name	Column Name	Data Type	Description	PII Flag
Lake	customer_data	customer_id	STRING	Customer's unique ID	Yes
Table	customer_transactions	transaction_amount	FLOAT64	Transaction amount in USD	No

drive_spreadsheetExport to Sheets

This hybrid solution leverages the strengths of Dataplex, Data Catalog, and Analytics Hub. It gives full control over which metadata elements are shared and provides a structured, easily consumable method for third parties to access your metadata.

shekhar0413

Hi @ms4446

Thank you for your response. You suggest identifying and then extracting the required metadata from Dataplex and Data Catalog.

For Dataplex, you mention writing custom scripts to ETL the metadata into BQ as there is no direct API for user consumption. So, essentially, make the API calls, script or client libraries, massage the responses and load the end result into the sink.

Wouldn't it be the same for Data Catalog too, as in writing a custom script or program in a client library to extract metadata, transform and load it into BQ (or some other sink)?

Additionally, can you also please provide input to the below:

What are ways to add metadata in Data Catalog other than the UI, that are scalable?

In the documentation I see options for gcloud and client libraries for creating and attaching tags, but for the rich text overview and steward there does not seem to be any APIs, only the console option is mentioned for those two features. Also, I assume, it is not possible to use terraform to update/enrich the metadata.

ms4446

Yes, you're correct in your understanding of how to handle metadata extraction from both Dataplex and Data Catalog. For both services, the process essentially involves using APIs to fetch metadata, followed by scripting or using client libraries to transform this data into a suitable format, and then loading it into BigQuery or another suitable data sink.

Extracting and Loading Metadata

For Data Catalog, the process mirrors that of Dataplex:

Extract Metadata: Use the Data Catalog API to fetch metadata about your BigQuery tables or other data assets. This might involve retrieving information about entries, tags, and schemas.
Transform Data: The raw metadata obtained from the Data Catalog API will likely need transformation to fit your organizational needs and to be compatible with BigQuery. This might include formatting data as JSON or CSV, reshaping it to match your BigQuery schema, or enriching the metadata with additional context.
Load into BigQuery: The transformed metadata can then be loaded into BigQuery using a data loading method suitable for your data format, such as BigQuery's API for JSON/CSV files or streaming data in.

Adding Metadata in Data Catalog

To scale the addition of metadata in Google Data Catalog beyond the UI, you can use several methods:

gcloud Commands: The gcloud CLI provides commands for managing Data Catalog resources, including creating and managing tags and tag templates. This can be scripted and automated to handle metadata for multiple assets efficiently.
Client Libraries: Google provides client libraries for languages like Python, Java, Node.js, etc., which can be used to programmatically interact with Data Catalog. These libraries can create, read, update, and delete metadata entries, tag templates, and tags.
Bulk Operations: Although creating bulk operations natively in Data Catalog through APIs or gcloud isn't as straightforward as single entry operations, you can script these actions to iterate over multiple assets. This is particularly useful when you need to apply similar metadata (like tags) across many assets.

Limitations for Rich Text and Steward Fields

For certain features like adding a rich text overview and specifying a data steward, Data Catalog's current API and gcloud tooling do not provide direct support. These features are typically managed via the Google Cloud Console:

Rich Text Overview: This type of metadata, which is designed to provide a detailed description of a dataset or table, currently needs to be added through the Data Catalog UI.
Data Steward: Assigning a steward for data governance purposes is also a feature handled in the UI.

These limitations mean that for comprehensive metadata management, especially for large-scale operations, you might need to combine automated scripting for supported features with manual processes for those not accessible via APIs or gcloud.

Terraform and Metadata Management

Currently, Terraform's support for Google Data Catalog is limited to managing certain resources like entries and tag templates. Direct management of metadata fields such as rich text descriptions or steward assignments via Terraform is not supported. This aligns with the general practice where Terraform is used for infrastructure setup and configuration, but content management, which often requires frequent and dynamic updates, is handled through scripts or manual intervention.

In summary, to scale the management of metadata in Data Catalog, you would typically use automated scripts or client libraries for the supported features, and rely on the UI for those features not accessible via APIs. Combining these methods allows for a more comprehensive approach to metadata management across your data assets.

shekhar0413

Hi @ms4446.,

Thanks again for your feedback. Seems like writing custom scripts to extract the metadata and push to a sink is the way to go. But before we make decisions based on that, would like to get confirmation on one thing.

In this post , you mention that there is a "Metadata" tab in Dataplex, where it can be configured so that Dataplex metadata is published to BigQuery. However, I have found no such metadata tab, and I have all the necessary permissions. Other users have reported facing similar issue of not being able to view the "Metadata" tab, and I found no reference to it in the documentation. Just wanted to confirm if this tab is no longer available, or maybe it only appears in case of specific storage classes attached as assets to zones. I am trying to attach BQ resources as assets, would like to enrich them and then push the metadata back to a BQ dataset. Since you do not mention the "Metadata" tab here in this thread, I assume that feature is discontinued?

ms4446

Sorry for the confusion. The "Metadata" tab mentioned in the older Google Cloud Community post is inaccurate.There is no reference to this "Metadata" tab in the current Dataplex documentation, and many users, including yourself, have reported not being able to locate it.

As of now, the recommended approach for pushing Dataplex metadata to BigQuery involves using the Dataplex REST API to programmatically export metadata and then load it into BigQuery using your preferred method (e.g., Python scripts, Cloud Functions).

Alternative to Consider:

While the direct "Metadata" tab option seems unavailable, you might want to explore the Dataplex Metadata Export feature if available. Please note, as of the latest documentation, Dataplex does not explicitly mention an automatic metadata export to Google Cloud Storage in Avro format for direct use. You would typically need to implement custom solutions for exporting metadata. Check for the most current capabilities in the Dataplex documenta

How to share Google Dataplex/Data Catalog metadata outside the organization