Hi. I read the article " Outer Join on False - or: How I learned to stop fanning-out and love the NULL and [Analytic Block] - Pivot by Top X - Introducing bind_all_filters: yes. First of all, thanks for the great articles.
I’m trying to make a model / Explore which:
PDTs
in a way like the first article describes, where the data is not fanned-out by performing “normal” joins between the 4 tablesbind_all_filters
), which get pre-filtered before being merged for the final user-facing table. This is for performance reasons; our source tables are > 25TB each, and I’d rather do the joining of these tables after they are relatively small by being already-filtered.AND
filter. May provide user the ability to make this an OR
for certain searches.sql_always_where
clause of “OR TRUE
” (without parenthesis) on the final table. Still have to test this.I’m pretty sure I can do something similar to the above using Liquid variables for every single filterable field (there are like 3 dozen), but I’d rather try doing it with bind_all_filters
, if I can do it in a performance-friendly way.
I’m a bit confused by the bind_all_filters
feature, though. The article states:
Note: A native derived table with
bind_all_filters
is required to be joined in the same explore it came from; i.e.,explore
=explore_source
This is confusing me. At what “stage” in the source Explore does the NDT
receive its input table? My explore was going to include:
PDTs
if the noted NDTs
have to use explore_source
targets contained in the explore (not arbitrary SQL
tables)But if the [4+1] NDTs
all get their tables from the single Explore source, and those tables are included in the Explore source, this seems cyclical. Do the NDTs
get the “full” table obtained after doing all the joining in the Explore? And then they just pick the columns that they need? If so, wouldn’t this cause a possible performance loss (tons of superfluous rows per NDT
) by joining those tables into a massive table before applying my custom logic?
Questions:
explore_source
? I’ve been meaning to just play with what Looker generates as SQL
while playing with such an Explore, but I haven’t been able to dedicate enough time.bind_all_filters
versus Liquid variables?Thanks so much!!
Sean
Hi!
I have personally not attempted to use bind_all_filters
with an OJOF or join paths explore. I think it seems a bit ill-suited to pushing join-specific filters into a subquery since there is no gesture to suggest that only some filters are relevant, thus the subquery would be very wide, with lots of joins. In other words, the “all” is at the explore level, not at the view level.
On the other hand, even with larger datasets, I believe that the large majority of the benefit comes from pushing down one or two selected filters that correspond to a tables’ partition (and maybe cluster) keys, and so a one-time application of liquid has been enough for me.
In theory, I do suspect the query could be optimized further like you’re suggesting. In practice, I haven’t seen it done and have had acceptable performance with just filtering the one column, and just allowing the grouping to happen after the joins.
Thanks for the reply @fabio1! Find my responses after your comments below.
…no gesture to suggest that only some filters are relevant
If needed, I don’t mind using bind_filter
instead of bind_all_filters
.
…the subquery would be very wide,
So you’re saying that the NDT
s would in fact include all 4 PDT
s, making them each wide, including all columns from all 4 PDT
s? That was my fear.
So how does the “sequence of joining” exactly go, when you have an explore like I’m proposing, with 4 NDTs? Each NDT gets a copy of the 4-way join of all 4 NDTs, then selects the columns it wants out of that, and all rows are included? If so, I think my hope of using bind_filter
/ bind_all_filters
might be untenable… So it’s back to Liquid, either for all variables or using your idea of just including a few key variables for Liquid.
The problem with not including all variables as Liquid is: when I do my Outer Join On False clause, each row will only contain data for 1/4’th of the columns. So if the user does any filtering where fields from two tables are included, they will get no results. They’d have to include myKeyword1, myKeyword2, NULL
in all their searches to get any results - a little awkward. I kind of think I’d rather just spend the time up-front and make all the Liquid variables; it’s not all that painful…? Unless we can think of some other slick idea…
I wish there was a way to just retrieve Looker’s planned WHERE
clause and then override it if you want.
… so a one-time application of liquid has been enough for me.
I’ve been thinking about doing something like that, too. Most of the 4 PDT
s come from data which includes arrays
of structs
, and the PDT
s pre-unnest this data. I was thinking of alternatively leaving the array
s as-is, and using a Liquid parameter to allow searching the relevant fields without joining the unnested array contents. That alone could be a big performance booster.
…acceptable performance with just filtering the one column, and just allowing the grouping to happen after the join
Regarding “grouping”, are you referring to the grouping I might have mentioned in my post, where I create a temporary table to do: “find all time-bin values from all 4 filtered tables, then inner join these, group the results, and join the 4 filtered tables to this beast”? FYI, the reason for this process is: I want to only show results where the filter criteria was met in all 4 tables, but with fuzzy timestamp values – the timestamps cannot be expected to line up exactly.
Also, regarding “acceptable performance” - just wanted to emphasize that our dataset is reaaaaaally large! 😉 Like, the 4 tables combined are a fair-sized fraction of a petabyte.
So I decided to spend the time to get to the bottom of this, examining the SQL that Looker generates when an Explore uses a certain View, and that same View sets its explore_source:
field to that same Explore.
TL;DR: It looks like (1) The order that the View is joined within the Explore determines what fields it can access, and what joins are made to define that view’s NDT, and (2) The joins are unconditional: Looker’s normal friendly “optimize-out joins that aren’t needed” behavior does not apply. @fabio1, I think you were concerned that using bind_all_filters
might cause the table to get extra-wide by bringing in a bunch of extra fields. Maybe it might have, if the optimization algorithms ever kicked-in with this type of paradigm, but it looks like getting the extra joins is a foregone conclusion.
I started off by defining two dummy views dummyPredecessor
and dummySuccessor
, plus used a view I already had - we’ll call it myPDT
to avoid anything company-specific:
view: dummyPredecessor {
derived_table: {
sql: ;;
# attempted the below with some columns; it causes a circular dependency error
# explore_source: dummyPredecessor { ... }
}
dimension: pred {
type: string
sql: select "pred" ;;
}
}
view: dummySuccessor {
derived_table: {
explore_source: dummyPredecessor {
column: pred2 {
field: dummyPredecessor.pred
}
column: myField2 {
field: myPDT.stamp
}
}
}
dimension: succ {
type: string
sql: select "succ" ;;
}
dimension: pred2 {
type: string
}
dimension: myField2 {
type: number
}
}
You can see that dummySuccessor
references Explore dummyPredeccessor
, which I’ll show in a moment. It uses one value from each of the other two tables (pred
from dummyPredecessor
and myField2
from myPDT
). The Model file defining the Explore contains:
explore: dummyPredecessor {
join: myPDT {
type: full_outer
sql_on: FALSE ;;
relationship: one_to_one
}
join: dummySuccessor {
type: full_outer
sql_on: FALSE ;;
relationship: one_to_one
}
}
Then I navigated to the dummyPredecessor
Explore, added some fields, and looked at the generated SQL. I intentionally did not select myField2
(the field that mySuccessor
takes from myPDT
), so my hope was that Looker’s SQL optimization routines would kick-in, and myPredecessor
wouldn’t even be joined to myPDT
. Unfortunately, that is not the case, as you can see below in the generated SQL:
WITH dummyPredecessor AS ()
, dummySuccessor AS (WITH dummyPredecessor AS ()
SELECT
select "pred" AS pred2,
myPDT.myField AS myField2
FROM dummyPredecessor
FULL OUTER JOIN `server-name.looker_pdt.longID_myPDT` AS myPDT ON FALSE
GROUP BY 1,2)
SELECT
select "succ" AS dummysuccessor_succ_1,
dummySuccessor.stamp2 AS dummysuccessor_stamp2_1,
dummySuccessor.pred2 AS dummysuccessor_pred2_1,
select "pred" AS dummypredecessor_pred_1,
myPDT.myField AS myPDT_myField
FROM dummyPredecessor
FULL OUTER JOIN `server-name.looker_pdt.longID_myPDT` AS myPDT ON FALSE
FULL OUTER JOIN dummySuccessor ON FALSE
GROUP BY 1,2,3,4,5
ORDER BY 1
LIMIT 500
After fixing a few errors by hand, I ran this SQL through the BigQuery console, to get some metrics on the impact of this. It took over 30 seconds and 2.3 GB of processing (for a very small test PDT) to generate the following result:
Row dummysuccessor_succ_1
1 succ
So I think that when I use this on my actual dataset (a tenth of a petabyte vs probably [well] under 1 TB for this test), the performance optimizations of using Liquid are in fact well-founded.
Good news: I found this post, and it looks like it’s now possible to use regular non-parameter, non-templated filter fields in sql_
fields using the _filters[view.field]
syntax! This provides a more standardize look & feel for users, so there aren’t both filter-only fields and “filterable-but-shouldn’t-be-filtered” regular dimensions. I’ll have to play around with it… Between that and the {% if field._in_query %}
syntax, I should be able to come up with some really efficient SQL, but with all the “magic” hidden from the end-user experience.
I think I can get by with Liquid, but if I was to make a feature request, it’d be for (1) applying the SQL optimization algorithms to joins involving NDTs in situations like this, or (2) make bind_all_filters
work with SQL derived tables, and just map the pertinent filters, or only require that the NDT be part of the Explore, but not use the Explore as its explore_source:
parameter.
Anyways, if you’ve read this far, thanks again for the help, you got me thinking too!
Hi Sean!
The only reason I was suggesting that the SQL for the NDT would be very “wide”/jjoin-heavy was because of the “all” in “bind_all_filters”. If you are binding specific filters (which is essentially what the liquid solution does as well), then unnecessary joins should not be included in the NDT’s SQL.
In your test example, the NDT is wide because the following code references fields from both views:
derived_table: {
explore_source: dummyPredecessor {
column: pred2 {
field: dummyPredecessor.pred
}
column: myField2 {
field: myPDT.stamp
}
}
If you did want to use NDTs in this way, at the very least, you would have to make a separate NDT for each view/join that you wanted to filter&group pre-join, and each such NDT would not reference unneeded tables. (Not saying that I believe the effort & maintainability will be justified, but if you wanted to, that’s how you would have to do it)
Ahhh great point @fabio1! That was in fact the plan; maybe my example was too artificial. Sounds like I have more options back on the table; will have to think about it a bit before making a decision. I’m leaning back towards my original plan now, with your clarification to remember to make separate NDTs so each one doesn’t become wide. It sound like they not only won’t become wide, but they won’t become “tall” either (that was my original concern for performance), because they won’t be joined more than they need to be. That’s great news!
I think that’s a better option than using _filters[view.field]
in the manner referenced in the post I linked to. It looks like some fairly “heavy” Liquid processing of _filters[view.field]
is required to massage the raw value from _filters[view.field]
into the right format, using the method in that post. And it’d probably apply differently for each filter’s type
parameter.
As far as I know, binding a filter with an NDT should work the same as using the {% condition %}
tag.
Either way, I feel compelled to reiterate that there are many challenges with attempting this 🙂
Totally understand the warning @fabio1, but I’m an efficiency junky, so even knowing why our queries take so long (huge dataset), I feel compelled to do something about 20+ minute queries. There’s probably some other low-hanging fruit we could tackle, but I’m going to hit it with a sledgehammer. 😉
So I tried an example use-case to test out your comment:
If you are binding specific filters (which is essentially what the liquid solution does as well), then unnecessary joins should not be included in the NDT’s SQL.
Unfortunately, I got the same results. I tried both an native derived_table
and an aggregate_table
to see if the behavior applied to both, and it appears to. Even when the NDT only references fields from one view in the parent Explore, the table is constructed by joining both views first.
LookML - View file:
view: myPDT {
derived_table: {
persist_for: "24 hours"
sql:
...
}
dimension: ... { ...}
}
view: dummy_source_2 {
# This would be a PDT too, in the real use-case
dimension: dummy {
type: string
sql: SELECT "dummy" ;;
}
}
view: dummy_consumer {
# this would have a bunch of bind_filter's, in the real use-case
derived_table: {
explore_source: myPDT {
column: dummy2 {
field: dummy_source_2.dummy
}
}
}
dimension: dummy2 {
type: string
}
}
LookML - Model file:
explore: myPDT {
join: dummy_source_2 {
sql_on: FALSE ;;
relationship: one_to_one
}
join: dummy_consumer {
sql_on: FALSE ;;
relationship: one_to_one
}
}
Resulting SQL:
WITH dummy_consumer AS (SELECT
SELECT "dummy" AS dummy2
FROM `server-name.looker_pdt.longID_myPDT` AS myPDT
LEFT JOIN dummy_source_2 ON FALSE
GROUP BY 1)
SELECT
dummy_consumer.dummy2 AS dummy_consumer_dummy2
FROM `server-name.looker_pdt.longID_myPDT` AS myPDT
LEFT JOIN dummy_consumer ON FALSE
GROUP BY 1
ORDER BY 1
LIMIT 500
Any thoughts?
Actually, I think I found a work-around. It looks like only the first View defined in an Explore is unconditionally joined with whatever view contains the fields used by an NDT. So you can just make the “main” view empty to get the desired behavior, as shown below.
My test included confirming that the behavior isn’t something weird like “whatever view contains the fields used by the NDT, plus everything joined above it” – that’s not the case – great news!
LookML - View file - Source data (would be PDTs):
view: dummy_empty {
}
view: dummy_source_1 {
derived_table: {
sql: ;;
}
dimension: dummy1 {
type: string
sql: 'dummy1' ;;
}
}
view: dummy_source_2 {
... similar, replace dummy1 with dummy2
}
view: dummy_source_3 {
... similar, replace dummy1 with dummy3
}
LookML - View file - Consumer derived tables (would be NDTs):
view: dummy_consumer1 {
derived_table: {
explore_source: dummy_explore {
column: dummy1_consumed {
field: dummy_source_1.dummy1
}
}
}
dimension: dummy1_consumed {
type: string
}
}
view: dummy_consumer2 {
... similar, replace dummy1 with dummy2
}
view: dummy_consumer3 {
... similar, replace dummy1 with dummy3
}
LookML - Model file:
explore: dummy_explore {
from: dummy_empty
join: dummy_source_1 {
sql_on: FALSE ;;
relationship: one_to_one
}
join: dummy_source_2 { similar... }
join: dummy_source_3 { similar... }
join: dummy_consumer1 {
sql_on: FALSE ;;
relationship: one_to_one
}
join: dummy_consumer2 { similar... }
join: dummy_consumer3 { similar... }
aggregate_table: dummy_aggregate {
query: {
dimensions: [dummy_source_1.dummy1]
}
materialization: {
persist_for: "30 minutes"
}
}
}
Resulting SQL for NDTs:
WITH dummy_consumer2 AS (WITH dummy_source_2 AS ()
SELECT
'dummy2' AS dummy2_consumed
FROM dummy_empty AS dummy_explore LEFT JOIN dummy_source_2 ON FALSE
GROUP BY
1)
-- Did not use dummy_explore::dummy_aggregate; it does not include the following fields in the query: dummy_consumer2.dummy2_consumed
SELECT
dummy_consumer2.dummy2_consumed AS dummy_consumer2_dummy2_consumed
FROM dummy_empty AS dummy_explore LEFT JOIN dummy_consumer2 ON FALSE
GROUP BY
1
ORDER BY
1
LIMIT 500
Similar for dummy_consumer1
, dummy_consumer3
, and dummy_aggregate
.