Hi. I read the article " Outer Join on False - or: How I learned to stop fanning-out and love the NULL and [Analytic Block] - Pivot by Top X - Introducing bind_all_filters: yes. First of all, thanks for the great articles.
Iโm trying to make a model / Explore which:
PDTs
in a way like the first article describes, where the data is not fanned-out by performing โnormalโ joins between the 4 tablesbind_all_filters
), which get pre-filtered before being merged for the final user-facing table. This is for performance reasons; our source tables are > 25TB each, and Iโd rather do the joining of these tables after they are relatively small by being already-filtered.AND
filter. May provide user the ability to make this an OR
for certain searches.sql_always_where
clause of โOR TRUE
โ (without parenthesis) on the final table. Still have to test this.Iโm pretty sure I can do something similar to the above using Liquid variables for every single filterable field (there are like 3 dozen), but Iโd rather try doing it with bind_all_filters
, if I can do it in a performance-friendly way.
Iโm a bit confused by the bind_all_filters
feature, though. The article states:
Note: A native derived table with
bind_all_filters
is required to be joined in the same explore it came from; i.e.,explore
=explore_source
This is confusing me. At what โstageโ in the source Explore does the NDT
receive its input table? My explore was going to include:
PDTs
if the noted NDTs
have to use explore_source
targets contained in the explore (not arbitrary SQL
tables)But if the [4+1] NDTs
all get their tables from the single Explore source, and those tables are included in the Explore source, this seems cyclical. Do the NDTs
get the โfullโ table obtained after doing all the joining in the Explore? And then they just pick the columns that they need? If so, wouldnโt this cause a possible performance loss (tons of superfluous rows per NDT
) by joining those tables into a massive table before applying my custom logic?
Questions:
explore_source
? Iโve been meaning to just play with what Looker generates as SQL
while playing with such an Explore, but I havenโt been able to dedicate enough time.bind_all_filters
versus Liquid variables?Thanks so much!!
Sean
Hi!
I have personally not attempted to use bind_all_filters
with an OJOF or join paths explore. I think it seems a bit ill-suited to pushing join-specific filters into a subquery since there is no gesture to suggest that only some filters are relevant, thus the subquery would be very wide, with lots of joins. In other words, the โallโ is at the explore level, not at the view level.
On the other hand, even with larger datasets, I believe that the large majority of the benefit comes from pushing down one or two selected filters that correspond to a tablesโ partition (and maybe cluster) keys, and so a one-time application of liquid has been enough for me.
In theory, I do suspect the query could be optimized further like youโre suggesting. In practice, I havenโt seen it done and have had acceptable performance with just filtering the one column, and just allowing the grouping to happen after the joins.
Thanks for the reply @fabio1! Find my responses after your comments below.
โฆno gesture to suggest that only some filters are relevant
If needed, I donโt mind using bind_filter
instead of bind_all_filters
.
โฆthe subquery would be very wide,
So youโre saying that the NDT
s would in fact include all 4 PDT
s, making them each wide, including all columns from all 4 PDT
s? That was my fear.
So how does the โsequence of joiningโ exactly go, when you have an explore like Iโm proposing, with 4 NDTs? Each NDT gets a copy of the 4-way join of all 4 NDTs, then selects the columns it wants out of that, and all rows are included? If so, I think my hope of using bind_filter
/ bind_all_filters
might be untenableโฆ So itโs back to Liquid, either for all variables or using your idea of just including a few key variables for Liquid.
The problem with not including all variables as Liquid is: when I do my Outer Join On False clause, each row will only contain data for 1/4โth of the columns. So if the user does any filtering where fields from two tables are included, they will get no results. Theyโd have to include myKeyword1, myKeyword2, NULL
in all their searches to get any results - a little awkward. I kind of think Iโd rather just spend the time up-front and make all the Liquid variables; itโs not all that painfulโฆ? Unless we can think of some other slick ideaโฆ
I wish there was a way to just retrieve Lookerโs planned WHERE
clause and then override it if you want.
โฆ so a one-time application of liquid has been enough for me.
Iโve been thinking about doing something like that, too. Most of the 4 PDT
s come from data which includes arrays
of structs
, and the PDT
s pre-unnest this data. I was thinking of alternatively leaving the array
s as-is, and using a Liquid parameter to allow searching the relevant fields without joining the unnested array contents. That alone could be a big performance booster.
โฆacceptable performance with just filtering the one column, and just allowing the grouping to happen after the join
Regarding โgroupingโ, are you referring to the grouping I might have mentioned in my post, where I create a temporary table to do: โfind all time-bin values from all 4 filtered tables, then inner join these, group the results, and join the 4 filtered tables to this beastโ? FYI, the reason for this process is: I want to only show results where the filter criteria was met in all 4 tables, but with fuzzy timestamp values โ the timestamps cannot be expected to line up exactly.
Also, regarding โacceptable performanceโ - just wanted to emphasize that our dataset is reaaaaaally large! ๐ Like, the 4 tables combined are a fair-sized fraction of a petabyte.
So I decided to spend the time to get to the bottom of this, examining the SQL that Looker generates when an Explore uses a certain View, and that same View sets its explore_source:
field to that same Explore.
TL;DR: It looks like (1) The order that the View is joined within the Explore determines what fields it can access, and what joins are made to define that viewโs NDT, and (2) The joins are unconditional: Lookerโs normal friendly โoptimize-out joins that arenโt neededโ behavior does not apply. @fabio1, I think you were concerned that using bind_all_filters
might cause the table to get extra-wide by bringing in a bunch of extra fields. Maybe it might have, if the optimization algorithms ever kicked-in with this type of paradigm, but it looks like getting the extra joins is a foregone conclusion.
I started off by defining two dummy views dummyPredecessor
and dummySuccessor
, plus used a view I already had - weโll call it myPDT
to avoid anything company-specific:
view: dummyPredecessor {
derived_table: {
sql: ;;
# attempted the below with some columns; it causes a circular dependency error
# explore_source: dummyPredecessor { ... }
}
dimension: pred {
type: string
sql: select "pred" ;;
}
}
view: dummySuccessor {
derived_table: {
explore_source: dummyPredecessor {
column: pred2 {
field: dummyPredecessor.pred
}
column: myField2 {
field: myPDT.stamp
}
}
}
dimension: succ {
type: string
sql: select "succ" ;;
}
dimension: pred2 {
type: string
}
dimension: myField2 {
type: number
}
}
You can see that dummySuccessor
references Explore dummyPredeccessor
, which Iโll show in a moment. It uses one value from each of the other two tables (pred
from dummyPredecessor
and myField2
from myPDT
). The Model file defining the Explore contains:
explore: dummyPredecessor {
join: myPDT {
type: full_outer
sql_on: FALSE ;;
relationship: one_to_one
}
join: dummySuccessor {
type: full_outer
sql_on: FALSE ;;
relationship: one_to_one
}
}
Then I navigated to the dummyPredecessor
Explore, added some fields, and looked at the generated SQL. I intentionally did not select myField2
(the field that mySuccessor
takes from myPDT
), so my hope was that Lookerโs SQL optimization routines would kick-in, and myPredecessor
wouldnโt even be joined to myPDT
. Unfortunately, that is not the case, as you can see below in the generated SQL:
WITH dummyPredecessor AS ()
, dummySuccessor AS (WITH dummyPredecessor AS ()
SELECT
select "pred" AS pred2,
myPDT.myField AS myField2
FROM dummyPredecessor
FULL OUTER JOIN `server-name.looker_pdt.longID_myPDT` AS myPDT ON FALSE
GROUP BY 1,2)
SELECT
select "succ" AS dummysuccessor_succ_1,
dummySuccessor.stamp2 AS dummysuccessor_stamp2_1,
dummySuccessor.pred2 AS dummysuccessor_pred2_1,
select "pred" AS dummypredecessor_pred_1,
myPDT.myField AS myPDT_myField
FROM dummyPredecessor
FULL OUTER JOIN `server-name.looker_pdt.longID_myPDT` AS myPDT ON FALSE
FULL OUTER JOIN dummySuccessor ON FALSE
GROUP BY 1,2,3,4,5
ORDER BY 1
LIMIT 500
After fixing a few errors by hand, I ran this SQL through the BigQuery console, to get some metrics on the impact of this. It took over 30 seconds and 2.3 GB of processing (for a very small test PDT) to generate the following result:
Row dummysuccessor_succ_1
1 succ
So I think that when I use this on my actual dataset (a tenth of a petabyte vs probably [well] under 1 TB for this test), the performance optimizations of using Liquid are in fact well-founded.
Good news: I found this post, and it looks like itโs now possible to use regular non-parameter, non-templated filter fields in sql_
fields using the _filters[view.field]
syntax! This provides a more standardize look & feel for users, so there arenโt both filter-only fields and โfilterable-but-shouldnโt-be-filteredโ regular dimensions. Iโll have to play around with itโฆ Between that and the {% if field._in_query %}
syntax, I should be able to come up with some really efficient SQL, but with all the โmagicโ hidden from the end-user experience.
I think I can get by with Liquid, but if I was to make a feature request, itโd be for (1) applying the SQL optimization algorithms to joins involving NDTs in situations like this, or (2) make bind_all_filters
work with SQL derived tables, and just map the pertinent filters, or only require that the NDT be part of the Explore, but not use the Explore as its explore_source:
parameter.
Anyways, if youโve read this far, thanks again for the help, you got me thinking too!
Hi Sean!
The only reason I was suggesting that the SQL for the NDT would be very โwideโ/jjoin-heavy was because of the โallโ in โbind_all_filtersโ. If you are binding specific filters (which is essentially what the liquid solution does as well), then unnecessary joins should not be included in the NDTโs SQL.
In your test example, the NDT is wide because the following code references fields from both views:
derived_table: {
explore_source: dummyPredecessor {
column: pred2 {
field: dummyPredecessor.pred
}
column: myField2 {
field: myPDT.stamp
}
}
If you did want to use NDTs in this way, at the very least, you would have to make a separate NDT for each view/join that you wanted to filter&group pre-join, and each such NDT would not reference unneeded tables. (Not saying that I believe the effort & maintainability will be justified, but if you wanted to, thatโs how you would have to do it)
Ahhh great point @fabio1! That was in fact the plan; maybe my example was too artificial. Sounds like I have more options back on the table; will have to think about it a bit before making a decision. Iโm leaning back towards my original plan now, with your clarification to remember to make separate NDTs so each one doesnโt become wide. It sound like they not only wonโt become wide, but they wonโt become โtallโ either (that was my original concern for performance), because they wonโt be joined more than they need to be. Thatโs great news!
I think thatโs a better option than using _filters[view.field]
in the manner referenced in the post I linked to. It looks like some fairly โheavyโ Liquid processing of _filters[view.field]
is required to massage the raw value from _filters[view.field]
into the right format, using the method in that post. And itโd probably apply differently for each filterโs type
parameter.
As far as I know, binding a filter with an NDT should work the same as using the {% condition %}
tag.
Either way, I feel compelled to reiterate that there are many challenges with attempting this ๐
Totally understand the warning @fabio1, but Iโm an efficiency junky, so even knowing why our queries take so long (huge dataset), I feel compelled to do something about 20+ minute queries. Thereโs probably some other low-hanging fruit we could tackle, but Iโm going to hit it with a sledgehammer. ๐
So I tried an example use-case to test out your comment:
If you are binding specific filters (which is essentially what the liquid solution does as well), then unnecessary joins should not be included in the NDTโs SQL.
Unfortunately, I got the same results. I tried both an native derived_table
and an aggregate_table
to see if the behavior applied to both, and it appears to. Even when the NDT only references fields from one view in the parent Explore, the table is constructed by joining both views first.
LookML - View file:
view: myPDT {
derived_table: {
persist_for: "24 hours"
sql:
...
}
dimension: ... { ...}
}
view: dummy_source_2 {
# This would be a PDT too, in the real use-case
dimension: dummy {
type: string
sql: SELECT "dummy" ;;
}
}
view: dummy_consumer {
# this would have a bunch of bind_filter's, in the real use-case
derived_table: {
explore_source: myPDT {
column: dummy2 {
field: dummy_source_2.dummy
}
}
}
dimension: dummy2 {
type: string
}
}
LookML - Model file:
explore: myPDT {
join: dummy_source_2 {
sql_on: FALSE ;;
relationship: one_to_one
}
join: dummy_consumer {
sql_on: FALSE ;;
relationship: one_to_one
}
}
Resulting SQL:
WITH dummy_consumer AS (SELECT
SELECT "dummy" AS dummy2
FROM `server-name.looker_pdt.longID_myPDT` AS myPDT
LEFT JOIN dummy_source_2 ON FALSE
GROUP BY 1)
SELECT
dummy_consumer.dummy2 AS dummy_consumer_dummy2
FROM `server-name.looker_pdt.longID_myPDT` AS myPDT
LEFT JOIN dummy_consumer ON FALSE
GROUP BY 1
ORDER BY 1
LIMIT 500
Any thoughts?
Actually, I think I found a work-around. It looks like only the first View defined in an Explore is unconditionally joined with whatever view contains the fields used by an NDT. So you can just make the โmainโ view empty to get the desired behavior, as shown below.
My test included confirming that the behavior isnโt something weird like โwhatever view contains the fields used by the NDT, plus everything joined above itโ โ thatโs not the case โ great news!
LookML - View file - Source data (would be PDTs):
view: dummy_empty {
}
view: dummy_source_1 {
derived_table: {
sql: ;;
}
dimension: dummy1 {
type: string
sql: 'dummy1' ;;
}
}
view: dummy_source_2 {
... similar, replace dummy1 with dummy2
}
view: dummy_source_3 {
... similar, replace dummy1 with dummy3
}
LookML - View file - Consumer derived tables (would be NDTs):
view: dummy_consumer1 {
derived_table: {
explore_source: dummy_explore {
column: dummy1_consumed {
field: dummy_source_1.dummy1
}
}
}
dimension: dummy1_consumed {
type: string
}
}
view: dummy_consumer2 {
... similar, replace dummy1 with dummy2
}
view: dummy_consumer3 {
... similar, replace dummy1 with dummy3
}
LookML - Model file:
explore: dummy_explore {
from: dummy_empty
join: dummy_source_1 {
sql_on: FALSE ;;
relationship: one_to_one
}
join: dummy_source_2 { similar... }
join: dummy_source_3 { similar... }
join: dummy_consumer1 {
sql_on: FALSE ;;
relationship: one_to_one
}
join: dummy_consumer2 { similar... }
join: dummy_consumer3 { similar... }
aggregate_table: dummy_aggregate {
query: {
dimensions: [dummy_source_1.dummy1]
}
materialization: {
persist_for: "30 minutes"
}
}
}
Resulting SQL for NDTs:
WITH dummy_consumer2 AS (WITH dummy_source_2 AS ()
SELECT
'dummy2' AS dummy2_consumed
FROM dummy_empty AS dummy_explore LEFT JOIN dummy_source_2 ON FALSE
GROUP BY
1)
-- Did not use dummy_explore::dummy_aggregate; it does not include the following fields in the query: dummy_consumer2.dummy2_consumed
SELECT
dummy_consumer2.dummy2_consumed AS dummy_consumer2_dummy2_consumed
FROM dummy_empty AS dummy_explore LEFT JOIN dummy_consumer2 ON FALSE
GROUP BY
1
ORDER BY
1
LIMIT 500
Similar for dummy_consumer1
, dummy_consumer3
, and dummy_aggregate
.