Hi,
I have a dataset and a table in google big query (BQ). For the dataset, I can add description, and for the table I can add description and column policy tags to control column level access (I am ignoring the "Labels" and "Tags" that one can attach to any BQ resource).
Next, in Dataplex, I created a lake and a zone, and then attached the previous BQ dataset to the zone.
Then I searched for the BQ table in the "Search" page under the "Discover" page in Dataplex. 2 results come up, one with "System" as "BIGQUERY" and one with "System" as "DATAPLEX". When I select the 2 results, I find the following points:
What I understand is that the entry with System as BIGQUERY is Data Catalog metadata (the url contains the string ...entryGroups/@bigquery/entries/...) whereas the one with System as Dataplex is a Dataplex entry. Also, for the same table, I was able to add different metadata using Data Catalog and the Dataplex entry. The system is perfectly fine with it, the metadata from Data Catalog does not surface into the Dataplex entry and vice versa, and metadata from both does not surface in BQ UI.
Is the above behavior expected? Seems that there are 3 sources of metadata for the same table, one in BQ, one in Data Catalog, and one in the Dataplex entry, all independent of each other (albeit the Data Catalog and Dataplex metadata is a superset of the BQ metadata).
Solved! Go to Solution.
The distinction between the roles of Dataplex and Data Catalog is indeed clear, with each serving
The independence of tags in Dataplex and Data Catalog, while providing flexibility, can lead to inconsistencies and confusion. Here’s why this might happen and some potential ways to address it:
Different Tools for Different Needs: The separation allows specialized metadata management that is optimized for specific environments—operational metadata in Dataplex and broader governance in Data Catalog. This can be useful in complex enterprises where data governance needs differ widely between departments or functions.
Potential for Inconsistency: Having separate tagging systems means that changes in one environment may not be reflected in the other unless manually synchronized. This can create challenges in maintaining a unified view of metadata.
Recommendations to Mitigate Inconsistencies:
Integration Tools: Consider using tools that can help synchronize metadata changes between systems. This might involve custom solutions that use APIs to bridge metadata updates between Dataplex and Data Catalog.
Unified Metadata Strategy: Develop a comprehensive metadata strategy that defines roles and responsibilities for each type of metadata and outlines processes for updating and maintaining consistency across systems.
Regular Audits and Reviews: Schedule regular audits of metadata to ensure that tags and policies reflect the current state of data governance and usage policies across platforms. This can help identify and rectify discrepancies.
Training and Guidelines: Educate data stakeholders on the purposes and specific uses of Dataplex and Data Catalog. Clear guidelines on when and how to use each tool for tagging can prevent misuse and promote consistency.
While the independence of tagging systems in Dataplex and Data Catalog offers tailored approaches to metadata management, it also requires careful strategy and tools to ensure that the overarching goal of unified data governance is not compromised.
Yes, your understanding is mostly correct. The behavior you're observing is expected and reflects the distinct purposes of each metadata store:
BigQuery Metadata (Native):
Data Catalog (via Entry Groups):
Dataplex Metadata (via Entities):
Why Three Independent Sources:
Metadata Flow (Or Lack Thereof):
Recommendations:
Thank you,
I understand the distinction between Dataplex (data lake management, governance etc.) and Data Catalog (central metadata repository).
In that sense, the attribute feature in Dataplex makes sense, since it is related to governance and it applies a unified policy on the lake components. The policy also surfaces in BQ making it consistent.
However, having independent tags on Dataplex and Data Catalog entries for the same underlying data asset does not seem to make sense to me, feels inconsistent and anti to the whole idea of having a unified central view of the data assets.
The distinction between the roles of Dataplex and Data Catalog is indeed clear, with each serving
The independence of tags in Dataplex and Data Catalog, while providing flexibility, can lead to inconsistencies and confusion. Here’s why this might happen and some potential ways to address it:
Different Tools for Different Needs: The separation allows specialized metadata management that is optimized for specific environments—operational metadata in Dataplex and broader governance in Data Catalog. This can be useful in complex enterprises where data governance needs differ widely between departments or functions.
Potential for Inconsistency: Having separate tagging systems means that changes in one environment may not be reflected in the other unless manually synchronized. This can create challenges in maintaining a unified view of metadata.
Recommendations to Mitigate Inconsistencies:
Integration Tools: Consider using tools that can help synchronize metadata changes between systems. This might involve custom solutions that use APIs to bridge metadata updates between Dataplex and Data Catalog.
Unified Metadata Strategy: Develop a comprehensive metadata strategy that defines roles and responsibilities for each type of metadata and outlines processes for updating and maintaining consistency across systems.
Regular Audits and Reviews: Schedule regular audits of metadata to ensure that tags and policies reflect the current state of data governance and usage policies across platforms. This can help identify and rectify discrepancies.
Training and Guidelines: Educate data stakeholders on the purposes and specific uses of Dataplex and Data Catalog. Clear guidelines on when and how to use each tool for tagging can prevent misuse and promote consistency.
While the independence of tagging systems in Dataplex and Data Catalog offers tailored approaches to metadata management, it also requires careful strategy and tools to ensure that the overarching goal of unified data governance is not compromised.
User | Count |
---|---|
4 | |
1 | |
1 | |
1 | |
1 |