Re: Using the BQ Info Schema Block

fabio1 · 07-13-2023 10:56 AM

The BigQuery Information Schema block has, broadly, two main uses: monitoring and optimization.

This article will focus on the optimization uses of the block, including how to locate problems and resolve common issues.

Ways to locate issues

Sometimes you may already know where a performance problem lies, for example when someone reaches out about their slow query. But, more often than not, data warehouse optimization is a vaguely-defined task. This often leads to it being neglected and put off for long periods of time. So, have a strategy for proactively identifying issues. It can lead to important optimization work that otherwise may not have gotten done.

While the block’s Pulse dashboard is the most convenient way to notice anomalies, the Time Window Investigation dashboard provides the best route to finding them in a proactive way.

Potential issues to look out for

Join patterns: Perhaps the most impactful issue to identify is poor joins, as these can have large impacts on queries. In the “Join Types” tile, look for joins that do not have “HASH”, and instead have “EACH WITH EACH”. These join types have multiplicative resource usage rather than additive, so they can represent a disproportionate amount of resource usage, especially when they are applied to large tables.
Data scanning: Excessive data scanning can result in both slower queries, and in increased cost. The “Top 15 Tables” tile is a good place to identify tables that may be driving excessive data scanning.
Data spilling to disk: Another conclusive indicator of issues is data being spilled to disk. This is indicative of computation steps that required more memory than was available.

How to resolve common issues

Issues can generally be thought of as either query-specific issues, where you will want to make fixes to the queries you are issuing, or systemic issues, where a change to overall table or database settings can help.

Query-specific issues are likely going to represent the majority of opportunities. From the tiles where you identified the issue, you can click on the metrics to drill down into a granular listing of queries.

Of course, since you are interested in optimizing future queries, it’s worth focusing on queries that are not only large in resource usage, but are also repetitive rather than ad-hoc.

Once you have identified a query to analyze, you can use the drill links from the query ID to reach the Query Inspection dashboard, where you can see the query’s text, query plan, and other metrics.

Here are examples of how you might address the common issues described above:

Join patterns: Look for non-equality joins and refactor the logic to be based on equality comparisons. If non-equality comparisons cannot be removed, consider persisting the results of the join, so that it can be run less frequently.
Data scanning: Look for opportunities to set or use a partition column, for example via always_filter. This is particularly useful for queries and datasets that can be filtered on time/recency. For query patterns where the full dataset is required, look for opportunities to pre-aggregate the data.
Data spilling to disk: Sometimes this overlaps with problematic join patterns, but it can also be caused by other operations that require a lot of memory, such as distinct aggregates, ordered window functions, or working with large arrays in nested data.

Beyond query-specific optimization, you may also notice patterns that can be resolved with table or project level optimizations.

BI Engine: BI Engine can provide transparent help speeding up queries on frequently accessed aggregates.
Aggregates: These may be created in a manually defined derived table, an automatically managed aggregate table, or even a transformation external to Looker. However, in all cases, the goal is primarily to reduce data scanning by allowing queries to operate on a smaller dataset that requires less scanning.
Partition columns: For tables that are frequently filtered on a particular column, such as date, make sure that column has been set as the partition column.
Cluster keys: For tables that are frequently joined together on predictable sets of keys, consider using clustering to have the joins execute faster.

prabhakaran_mai

Thanks, @fabio1, this is insightful , I am going to check this out.

Note: The link you provided above is not working, seems it takes us to the internal Looker instance.

fabio1

Thanks for the heads up! I've fixed the link