Dataplex metadata in BigQuery

I need to read Dataplex metadata in BigQuery to perform data analysis on the Catalog, is it possible?

Solved Solved
0 11 2,624
2 ACCEPTED SOLUTIONS

Yes , Dataplex metadata can be analyzed in BigQuery, but first, you need to set up Dataplex to publish this metadata to BigQuery. This can be achieved either through the Dataplex UI or its API.

Dataplex Configuration

Before diving into the analysis, ensure the following configuration steps are completed in Dataplex:

  • Designate the specific BigQuery dataset where the metadata will be stored.
  • Define any essential data filters or transformations.

Querying Dataplex Metadata in BigQuery

Once the metadata is available in BigQuery, follow these steps for analysis:

  1. Pinpoint the BigQuery dataset housing the Dataplex metadata. This can be located in the Dataplex UI or retrieved via the Dataplex API.
  2. Construct a BigQuery query targeting the metadata of interest. Remember to reference the specific Dataplex metadata tables, not the generic INFORMATION_SCHEMA views.
  3. Execute the query to obtain your results.

For instance, to list all tables within a Dataplex data zone, you might use:

SELECT
table_name,
schema_name,
data_type,
location
FROM
`YOUR_PROJECT_ID.YOUR_DATASET_NAME.ENTITY_TABLES`

Deep Dive into Dataplex Metadata Analysis

The depth and type of analysis you can perform on Dataplex metadata in BigQuery largely depend on the specific columns and data in the metadata tables. It's beneficial to acquaint yourself with the structure of these tables to maximize the insights you can extract.

Potential analyses include:

  • Pinpointing the most frequented tables in your Dataplex lake.
  • Identifying underutilized tables.
  • Highlighting tables lacking crucial metadata.
  • Monitoring Dataplex metadata modifications over time.
  • Tracing data lineage connections between tables.
  • Conducting data quality assessments.

Enhancing Data Lineage and Quality Checks

While BigQuery offers robust capabilities for analyzing Dataplex metadata, certain analyses, like visual data lineage or automated quality checks, might necessitate supplementary tools or integrations. Here are some tools that can augment your analysis:

  • Data Catalog: A robust metadata management tool that aids in data discovery and understanding. It's excellent for visualizing data lineage and conducting quality checks.
  • Data Fusion: A cloud-native data integration service ideal for crafting and overseeing data pipelines. It's adept at automating data lineage tracking and quality checks.
  • Dataflow: Streamlines data transformation in both streaming and batch modes. It's suitable for crafting custom data quality checks.
  • Cloud Data Studio: A user-friendly tool for crafting interactive dashboards and reports, perfect for visualizing data lineage and quality metrics.
  • Looker: A BI and data analytics platform that's great for data exploration and visualization.

Leveraging BigQuery to scrutinize Dataplex metadata can yield invaluable insights, enhancing your data understanding and refining your data governance strategies.

View solution in original post

Hi @ms4446 

I am running into the same issue of not being able to see the metadata tab after selecting a lake. I have attached a BQ dataset to the zone in the lake and want to publish the metadata for its table into BQ (my assumption is that it is possible to publish all Dataplex metadata for all asset types, eg: BQ tables, cloud storage into BQ). Please see screenshot below:

shekhar0413_0-1714763297588.png

I have the "editor" role, and looking at the list of permissions for the editor role, I see that it contains the "dataplex.lakes.get" and the "bigquery.datasets.get" permissions. Also, I am able to view the lakes in dataplex (I was the one who created the lake) and view/query/create tables and datasets in BQ, so permission does not seem to be an issue. Additionally, when I search for for the BQ table in "Search" in dataplex, I see two results for the table, one with "System" as "BIGQUERY" and another as "DATAPLEX", so the metadata entry seems to be present.

Can you please provide some inputs.

Also, we are sharing some BQ datasets with a third party using analytics hub. So, the ultimate goal is to share the rich dataplex metadata for these tables with the 3rd party as well. My approach was to publish the dataplex metadata to BQ and then share the metadata as well through analytics hub. That way, the third party can simply subscribe to the dataset containing the metadata and create a linked dataset in their project. If there are other approaches  or best practices regarding how to share this rich dataplex metadata similar to sharing BQ data in analytics hub, I would be interested in those as well.

Thank you!

 

View solution in original post

11 REPLIES 11

Yes , Dataplex metadata can be analyzed in BigQuery, but first, you need to set up Dataplex to publish this metadata to BigQuery. This can be achieved either through the Dataplex UI or its API.

Dataplex Configuration

Before diving into the analysis, ensure the following configuration steps are completed in Dataplex:

  • Designate the specific BigQuery dataset where the metadata will be stored.
  • Define any essential data filters or transformations.

Querying Dataplex Metadata in BigQuery

Once the metadata is available in BigQuery, follow these steps for analysis:

  1. Pinpoint the BigQuery dataset housing the Dataplex metadata. This can be located in the Dataplex UI or retrieved via the Dataplex API.
  2. Construct a BigQuery query targeting the metadata of interest. Remember to reference the specific Dataplex metadata tables, not the generic INFORMATION_SCHEMA views.
  3. Execute the query to obtain your results.

For instance, to list all tables within a Dataplex data zone, you might use:

SELECT
table_name,
schema_name,
data_type,
location
FROM
`YOUR_PROJECT_ID.YOUR_DATASET_NAME.ENTITY_TABLES`

Deep Dive into Dataplex Metadata Analysis

The depth and type of analysis you can perform on Dataplex metadata in BigQuery largely depend on the specific columns and data in the metadata tables. It's beneficial to acquaint yourself with the structure of these tables to maximize the insights you can extract.

Potential analyses include:

  • Pinpointing the most frequented tables in your Dataplex lake.
  • Identifying underutilized tables.
  • Highlighting tables lacking crucial metadata.
  • Monitoring Dataplex metadata modifications over time.
  • Tracing data lineage connections between tables.
  • Conducting data quality assessments.

Enhancing Data Lineage and Quality Checks

While BigQuery offers robust capabilities for analyzing Dataplex metadata, certain analyses, like visual data lineage or automated quality checks, might necessitate supplementary tools or integrations. Here are some tools that can augment your analysis:

  • Data Catalog: A robust metadata management tool that aids in data discovery and understanding. It's excellent for visualizing data lineage and conducting quality checks.
  • Data Fusion: A cloud-native data integration service ideal for crafting and overseeing data pipelines. It's adept at automating data lineage tracking and quality checks.
  • Dataflow: Streamlines data transformation in both streaming and batch modes. It's suitable for crafting custom data quality checks.
  • Cloud Data Studio: A user-friendly tool for crafting interactive dashboards and reports, perfect for visualizing data lineage and quality metrics.
  • Looker: A BI and data analytics platform that's great for data exploration and visualization.

Leveraging BigQuery to scrutinize Dataplex metadata can yield invaluable insights, enhancing your data understanding and refining your data governance strategies.

About the new trade glossary feature, will we already have tables in this metadata dataset?

The new trade glossary feature in Dataplex is currently in the development phase. As such, the specific tables to be included in the metadata dataset are yet to be finalized. Based on the preliminary information available, we anticipate the metadata dataset to comprise tables for:

  • Trade Glossary Terms: Capturing the terms defined within the trade glossary.
  • Term Definitions: Detailing the definitions associated with each trade glossary term.
  • Term Relationships: Documenting the relationships existing between different trade glossary terms.
  • Term Usage: Recording how trade glossary terms are utilized within the data lake.

Beyond these, the metadata dataset might encompass additional tables pertinent to the trade glossary feature, such as a table detailing user interactions with the glossary terms.

Upon the official release of the trade glossary feature, users will have the capability to query the metadata dataset in BigQuery, extracting insights about the trade glossary terms and their application. This data can serve to enhance the precision and uniformity of your datasets, facilitating a more intuitive understanding for end-users.

Potential applications of the trade glossary metadata in BigQuery include:

  • Identifying frequently used trade glossary terms.
  • Pinpointing terms that remain unused.
  • Highlighting terms lacking definitions.
  • Monitoring modifications to trade glossary terms over time.
  • Tracing data lineage connections between trade glossary terms.
  • Conducting quality assessments on trade glossary terms.

Leveraging BigQuery to scrutinize the trade glossary metadata can yield invaluable insights, fostering improved data governance practices.

Are there an official guideline/documents about necessary steps? Need to share with our DevOps guys

Hello ms4446.

Could you help me again, showing me step by step how to configure the Dataplex metadata in the Big Query becomes UI.

 

To configure Dataplex metadata in BigQuery UI:

Note: The exact labels and sequence of options might vary based on the version and updates to the Dataplex UI. Always refer to the official Dataplex documentation or help section within the UI for the most up-to-date instructions.

  1. Go to the Dataplex UI.
  2. Click on the Lakes tab.

Note: The "Lakes" tab shows all the data lakes configured within Dataplex.

  1. Select the lake that you want to publish metadata to BigQuery.
  2. Click on the Metadata tab.
  3. Click on the Publish to BigQuery button.

Note: Defining data filters or transformations is an optional step for advanced users who have specific requirements. Basic users can skip this step.

  1. Select the BigQuery dataset where you want to store the metadata.
  2. Click on the Publish button.

Note: Users should periodically check the BigQuery dataset to ensure that the metadata is being updated as expected.

Example SQL query:

SELECT
table_name,
schema_name,
data_type,
location
FROM
`YOUR_PROJECT_ID.YOUR_DATASET_NAME.ENTITY_TABLES`
WHERE
zone_name = 'YOUR_DATA_ZONE_NAME';

Note: The table name and column name in the SQL query may vary depending on the actual Dataplex metadata schema.

Once you have published the metadata to BigQuery, you can query the Dataplex metadata tables using BigQuery SQL.

Hi ms4446,

  I am new to dataplex, and I am trying to follow the above steps to integrate dataplex metadata with BQ to generate a report for table audit. However, I can see an option to manage lakes, when I click any of the lakes, I can not see metadata tab. Am I missing any permissions here, to view the metadata tab?

May I request you to throw somelight on this please?

Raghu.

The Dataplex metadata tab is visible for all lakes, but you must have the necessary permissions to view it. To view the metadata tab, you must have the following permissions:

  • dataplex.lakes.get permission on the lake.
  • bigquery.datasets.get permission on the BigQuery dataset where the metadata will be stored.

If you are missing either of these permissions, you will not be able to see the metadata tab.

To check your permissions, you can use the following steps:

  1. Go to the Dataplex UI.
  2. Click on the Lakes tab.
  3. Select the lake that you want to view the metadata for.
  4. Click on the Permissions tab.
  5. Review the permissions that are granted to you.

If you are missing either of the required permissions, you will need to contact your administrator to have them granted to you.

Once you have the necessary permissions, you should be able to see the metadata tab for the lake.

Here are the steps to generate a report for table audit using Dataplex metadata in BigQuery:

  1. Publish the Dataplex metadata to BigQuery.
  2. Query the Dataplex metadata tables in BigQuery.
  3. Create a report using the query results.

To publish the Dataplex metadata to BigQuery, you can use the following steps:

  1. Go to the Dataplex UI.
  2. Click on the Lakes tab.
  3. Select the lake that you want to publish the metadata for.
  4. Click on the Metadata tab.
  5. Click on the Publish to BigQuery button.
  6. Select the BigQuery dataset where you want to store the metadata.
  7. Click on the Publish button.

To query the Dataplex metadata tables in BigQuery, you can use the following steps:

  1. Go to the BigQuery UI.
  2. Select the dataset where the Dataplex metadata is stored.
  3. Click on the Query tab.
  4. Write a query to select the metadata that you want to include in your report.
  5. Click on the Run button.

To create a report using the query results, you can use a variety of tools, such as BigQuery UI, BigQuery Data Studio, or Google Sheets.

Here is an example of a query to select the table name, schema name, and data type for all tables in a Dataplex data zone:

 

SELECT
table_name,
schema_name,
data_type
FROM
`YOUR_PROJECT_ID.YOUR_DATASET_NAME.ENTITY_TABLES`
WHERE
zone_name = 'YOUR_DATA_ZONE_NAME';

You can use this query as a starting point to create more complex queries to generate reports for table audit.

Thanks for the quick response @ms4446

Do you have an official release date for the business glossary feature?

Thank you for the clarity of the information.

Hi @ms4446 

I am running into the same issue of not being able to see the metadata tab after selecting a lake. I have attached a BQ dataset to the zone in the lake and want to publish the metadata for its table into BQ (my assumption is that it is possible to publish all Dataplex metadata for all asset types, eg: BQ tables, cloud storage into BQ). Please see screenshot below:

shekhar0413_0-1714763297588.png

I have the "editor" role, and looking at the list of permissions for the editor role, I see that it contains the "dataplex.lakes.get" and the "bigquery.datasets.get" permissions. Also, I am able to view the lakes in dataplex (I was the one who created the lake) and view/query/create tables and datasets in BQ, so permission does not seem to be an issue. Additionally, when I search for for the BQ table in "Search" in dataplex, I see two results for the table, one with "System" as "BIGQUERY" and another as "DATAPLEX", so the metadata entry seems to be present.

Can you please provide some inputs.

Also, we are sharing some BQ datasets with a third party using analytics hub. So, the ultimate goal is to share the rich dataplex metadata for these tables with the 3rd party as well. My approach was to publish the dataplex metadata to BQ and then share the metadata as well through analytics hub. That way, the third party can simply subscribe to the dataset containing the metadata and create a linked dataset in their project. If there are other approaches  or best practices regarding how to share this rich dataplex metadata similar to sharing BQ data in analytics hub, I would be interested in those as well.

Thank you!