Solved: Dataplex metadata vs. Data catalog metadata

shekhar0413

Hi,

I have a dataset and a table in google big query (BQ). For the dataset, I can add description, and for the table I can add description and column policy tags to control column level access (I am ignoring the "Labels" and "Tags" that one can attach to any BQ resource).

Next, in Dataplex, I created a lake and a zone, and then attached the previous BQ dataset to the zone.

Then I searched for the BQ table in the "Search" page under the "Discover" page in Dataplex. 2 results come up, one with "System" as "BIGQUERY" and one with "System" as "DATAPLEX". When I select the 2 results, I find the following points:

The one with System as BIGQUERY refers to the BQ table whereas the one with Dataplex refers to the entity created for the table in the Dataplex zone.
For the one with System as BIGQUERY, I can add an Overview, Steward and attach Tags using Tag templates. Additionally, for the columns, I can attach tags and add business terms. For the one with System as Dataplex, I cannot add an Overview or Steward, but can attach Tags using Tag templates and Attributes. Additionally, for the columns, I can only add Column Attributes.

What I understand is that the entry with System as BIGQUERY is Data Catalog metadata (the url contains the string ...entryGroups/@bigquery/entries/...) whereas the one with System as Dataplex is a Dataplex entry. Also, for the same table, I was able to add different metadata using Data Catalog and the Dataplex entry. The system is perfectly fine with it, the metadata from Data Catalog does not surface into the Dataplex entry and vice versa, and metadata from both does not surface in BQ UI.

Is the above behavior expected? Seems that there are 3 sources of metadata for the same table, one in BQ, one in Data Catalog, and one in the Dataplex entry, all independent of each other (albeit the Data Catalog and Dataplex metadata is a superset of the BQ metadata).

ms4446

The distinction between the roles of Dataplex and Data Catalog is indeed clear, with each serving

The independence of tags in Dataplex and Data Catalog, while providing flexibility, can lead to inconsistencies and confusion. Here’s why this might happen and some potential ways to address it:

Different Tools for Different Needs: The separation allows specialized metadata management that is optimized for specific environments—operational metadata in Dataplex and broader governance in Data Catalog. This can be useful in complex enterprises where data governance needs differ widely between departments or functions.
Potential for Inconsistency: Having separate tagging systems means that changes in one environment may not be reflected in the other unless manually synchronized. This can create challenges in maintaining a unified view of metadata.

Recommendations to Mitigate Inconsistencies:

Integration Tools: Consider using tools that can help synchronize metadata changes between systems. This might involve custom solutions that use APIs to bridge metadata updates between Dataplex and Data Catalog.
Unified Metadata Strategy: Develop a comprehensive metadata strategy that defines roles and responsibilities for each type of metadata and outlines processes for updating and maintaining consistency across systems.
Regular Audits and Reviews: Schedule regular audits of metadata to ensure that tags and policies reflect the current state of data governance and usage policies across platforms. This can help identify and rectify discrepancies.
Training and Guidelines: Educate data stakeholders on the purposes and specific uses of Dataplex and Data Catalog. Clear guidelines on when and how to use each tool for tagging can prevent misuse and promote consistency.

While the independence of tagging systems in Dataplex and Data Catalog offers tailored approaches to metadata management, it also requires careful strategy and tools to ensure that the overarching goal of unified data governance is not compromised.

View solution in original post

ms4446

Yes, your understanding is mostly correct. The behavior you're observing is expected and reflects the distinct purposes of each metadata store:

BigQuery Metadata (Native):

Location: Directly within BigQuery.
Purpose: Manages basic descriptions, column names, data types, and table labels—essential for BigQuery's operation.

Data Catalog (via Entry Groups):

Location: A separate service within Google Cloud.
Purpose: Provides rich, business-focused metadata, including technical details (schema, lineage), business context (owners, glossary terms), and governance aspects (tags, policies). Designed as a central repository, Data Catalog facilitates the discovery and understanding of data assets not only within Google Cloud but across multiple platforms, enhancing interoperability and insights.

Dataplex Metadata (via Entities):

Location: Within the Dataplex lake/zone structure.
Purpose: Specialized for data lake management, it includes metadata such as data quality metrics, classification labels, and associations with other Dataplex objects. Dataplex is tailored to streamline and govern data integration from diverse sources into cohesive data lakes.

Why Three Independent Sources:

Decoupling: This separation allows each system to focus on specific functions, providing flexibility and preventing overload in any single system.
Granularity and Specificity: BigQuery handles basic metadata efficiently, Data Catalog offers expansive, business-oriented metadata capabilities, and Dataplex focuses on operational data lake contexts.
Use Cases: Use BigQuery metadata for operational functionality, Data Catalog for comprehensive data governance and discovery, and Dataplex for data lake lifecycle management.

Metadata Flow (Or Lack Thereof):

No Automatic Synchronization: There is no built-in mechanism to synchronize metadata between these platforms, which avoids conflicts and maintains flexibility in operations.
Manual Synchronization Options: Custom scripts or tools like Cloud Data Fusion can be used to manually synchronize or transform metadata between systems as needed, especially for specialized requirements like compliance or centralized reporting.

Recommendations:

Choose Your Source of Truth: Decide whether Data Catalog or Dataplex will serve as your primary source for enriched metadata, depending on your specific business needs. This decision can be strategic, with some metadata elements managed predominantly in one service while others are in another, based on their relevance and utility.
Develop a Governance Strategy: Implement practices to manage and maintain metadata consistently and accurately. Consider using Google Cloud's IAM for access management and enable auditing to track metadata changes.

shekhar0413

Thank you,

I understand the distinction between Dataplex (data lake management, governance etc.) and Data Catalog (central metadata repository).

In that sense, the attribute feature in Dataplex makes sense, since it is related to governance and it applies a unified policy on the lake components. The policy also surfaces in BQ making it consistent.

However, having independent tags on Dataplex and Data Catalog entries for the same underlying data asset does not seem to make sense to me, feels inconsistent and anti to the whole idea of having a unified central view of the data assets.

ms4446

The distinction between the roles of Dataplex and Data Catalog is indeed clear, with each serving

The independence of tags in Dataplex and Data Catalog, while providing flexibility, can lead to inconsistencies and confusion. Here’s why this might happen and some potential ways to address it:

Different Tools for Different Needs: The separation allows specialized metadata management that is optimized for specific environments—operational metadata in Dataplex and broader governance in Data Catalog. This can be useful in complex enterprises where data governance needs differ widely between departments or functions.
Potential for Inconsistency: Having separate tagging systems means that changes in one environment may not be reflected in the other unless manually synchronized. This can create challenges in maintaining a unified view of metadata.

Recommendations to Mitigate Inconsistencies:

Integration Tools: Consider using tools that can help synchronize metadata changes between systems. This might involve custom solutions that use APIs to bridge metadata updates between Dataplex and Data Catalog.
Unified Metadata Strategy: Develop a comprehensive metadata strategy that defines roles and responsibilities for each type of metadata and outlines processes for updating and maintaining consistency across systems.
Regular Audits and Reviews: Schedule regular audits of metadata to ensure that tags and policies reflect the current state of data governance and usage policies across platforms. This can help identify and rectify discrepancies.
Training and Guidelines: Educate data stakeholders on the purposes and specific uses of Dataplex and Data Catalog. Clear guidelines on when and how to use each tool for tagging can prevent misuse and promote consistency.

While the independence of tagging systems in Dataplex and Data Catalog offers tailored approaches to metadata management, it also requires careful strategy and tools to ensure that the overarching goal of unified data governance is not compromised.