Combine "FULL OUTER JOIN ON FALSE" with "Introduci...

spmclaugh · 09-28-2020 11:07 AM

Hi. I read the article " Outer Join on False - or: How I learned to stop fanning-out and love the NULL and [Analytic Block] - Pivot by Top X - Introducing bind_all_filters: yes. First of all, thanks for the great articles.

I’m trying to make a model / Explore which:

Combines multiple (4) existing PDTs in a way like the first article describes, where the data is not fanned-out by performing “normal” joins between the 4 tables
Uses non-persistent derived tables (probably native, to use bind_all_filters), which get pre-filtered before being merged for the final user-facing table. This is for performance reasons; our source tables are > 25TB each, and I’d rather do the joining of these tables after they are relatively small by being already-filtered.
Create a derived table that finds common time bins (time interval defined by a Liquid parameter) between the 4 tables. Do this by taking the filtered tables, show only the calculated time bin column, and do a union on these 4 sub-tables. Then merge the 4 tables back into this derived table using left joins. This essentially performs a fuzzy AND filter. May provide user the ability to make this an OR for certain searches.
Create a final table which left joins the 4 filtered tables using the common time bins from the table above. Do this by including 2 columns in the common time bin table: the time bin values that are common to all 4 tables, plus a column with the names of each table (one of 4 string literals).
Return results to the user without re-performing the user’s filter on the final table above. I think I can do this with a sql_always_where clause of “OR TRUE” (without parenthesis) on the final table. Still have to test this.

I’m pretty sure I can do something similar to the above using Liquid variables for every single filterable field (there are like 3 dozen), but I’d rather try doing it with bind_all_filters, if I can do it in a performance-friendly way.

I’m a bit confused by the bind_all_filters feature, though. The article states:

Note: A native derived table with bind_all_filters is required to be joined in the same explore it came from; i.e., explore = explore_source

This is confusing me. At what “stage” in the source Explore does the NDT receive its input table? My explore was going to include:

the 4 filtered tables above,
maybe the 4 PDTs if the noted NDTs have to use explore_source targets contained in the explore (not arbitrary SQL tables)
the above table containing the timestamp bins common to all 4 filtered tables.

But if the [4+1] NDTs all get their tables from the single Explore source, and those tables are included in the Explore source, this seems cyclical. Do the NDTs get the “full” table obtained after doing all the joining in the Explore? And then they just pick the columns that they need? If so, wouldn’t this cause a possible performance loss (tons of superfluous rows per NDT) by joining those tables into a massive table before applying my custom logic?

Questions:

Can anybody clarify the order of operations performed when an Explore contains a view, and that view uses the Explore for its explore_source? I’ve been meaning to just play with what Looker generates as SQL while playing with such an Explore, but I haven’t been able to dedicate enough time.
Am I over-worried about doing my join (after which I calculate time bins, group by time bins, and then define this as a temporary table) before filtering the rather-large individual tables?
Should I just stick with using Liquid variables for this task if my main goal is performance, or does anybody have ideas for how I could accomplish a high-performance version of this using bind_all_filters versus Liquid variables?

Thanks so much!!

Sean

fabio1

Hi!

I have personally not attempted to use bind_all_filters with an OJOF or join paths explore. I think it seems a bit ill-suited to pushing join-specific filters into a subquery since there is no gesture to suggest that only some filters are relevant, thus the subquery would be very wide, with lots of joins. In other words, the “all” is at the explore level, not at the view level.

On the other hand, even with larger datasets, I believe that the large majority of the benefit comes from pushing down one or two selected filters that correspond to a tables’ partition (and maybe cluster) keys, and so a one-time application of liquid has been enough for me.

In theory, I do suspect the query could be optimized further like you’re suggesting. In practice, I haven’t seen it done and have had acceptable performance with just filtering the one column, and just allowing the grouping to happen after the joins.

spmclaugh

Thanks for the reply @fabio1! Find my responses after your comments below.

…no gesture to suggest that only some filters are relevant

If needed, I don’t mind using bind_filter instead of bind_all_filters.

…the subquery would be very wide,

So you’re saying that the NDTs would in fact include all 4 PDTs, making them each wide, including all columns from all 4 PDTs? That was my fear.

So how does the “sequence of joining” exactly go, when you have an explore like I’m proposing, with 4 NDTs? Each NDT gets a copy of the 4-way join of all 4 NDTs, then selects the columns it wants out of that, and all rows are included? If so, I think my hope of using bind_filter / bind_all_filters might be untenable… So it’s back to Liquid, either for all variables or using your idea of just including a few key variables for Liquid.

The problem with not including all variables as Liquid is: when I do my Outer Join On False clause, each row will only contain data for 1/4’th of the columns. So if the user does any filtering where fields from two tables are included, they will get no results. They’d have to include myKeyword1, myKeyword2, NULL in all their searches to get any results - a little awkward. I kind of think I’d rather just spend the time up-front and make all the Liquid variables; it’s not all that painful…? Unless we can think of some other slick idea…

I wish there was a way to just retrieve Looker’s planned WHERE clause and then override it if you want.

… so a one-time application of liquid has been enough for me.

I’ve been thinking about doing something like that, too. Most of the 4 PDTs come from data which includes arrays of structs, and the PDTs pre-unnest this data. I was thinking of alternatively leaving the arrays as-is, and using a Liquid parameter to allow searching the relevant fields without joining the unnested array contents. That alone could be a big performance booster.

…acceptable performance with just filtering the one column, and just allowing the grouping to happen after the join

Regarding “grouping”, are you referring to the grouping I might have mentioned in my post, where I create a temporary table to do: “find all time-bin values from all 4 filtered tables, then inner join these, group the results, and join the 4 filtered tables to this beast”? FYI, the reason for this process is: I want to only show results where the filter criteria was met in all 4 tables, but with fuzzy timestamp values – the timestamps cannot be expected to line up exactly.

Also, regarding “acceptable performance” - just wanted to emphasize that our dataset is reaaaaaally large! 😉 Like, the 4 tables combined are a fair-sized fraction of a petabyte.

spmclaugh

So I decided to spend the time to get to the bottom of this, examining the SQL that Looker generates when an Explore uses a certain View, and that same View sets its explore_source: field to that same Explore.

TL;DR: It looks like (1) The order that the View is joined within the Explore determines what fields it can access, and what joins are made to define that view’s NDT, and (2) The joins are unconditional: Looker’s normal friendly “optimize-out joins that aren’t needed” behavior does not apply. @fabio1, I think you were concerned that using bind_all_filters might cause the table to get extra-wide by bringing in a bunch of extra fields. Maybe it might have, if the optimization algorithms ever kicked-in with this type of paradigm, but it looks like getting the extra joins is a foregone conclusion.

I started off by defining two dummy views dummyPredecessor and dummySuccessor, plus used a view I already had - we’ll call it myPDT to avoid anything company-specific:

view: dummyPredecessor {
  derived_table: {
    sql:   ;;
    # attempted the below with some columns; it causes a circular dependency error
    # explore_source: dummyPredecessor { ... }
  }

  dimension: pred {
    type:  string
    sql:  select "pred" ;;
  }
}

view: dummySuccessor {
  derived_table: {
    explore_source: dummyPredecessor {
      column: pred2 {
        field: dummyPredecessor.pred
      }
      column: myField2 {
        field: myPDT.stamp
      }
    }
  }

  dimension: succ {
    type:  string
    sql:  select "succ" ;;
  }

  dimension: pred2 {
    type:  string
  }
  dimension: myField2 {
    type:  number
  }
}

You can see that dummySuccessor references Explore dummyPredeccessor, which I’ll show in a moment. It uses one value from each of the other two tables (pred from dummyPredecessor and myField2 from myPDT). The Model file defining the Explore contains:

explore: dummyPredecessor {
  join: myPDT {
    type: full_outer
    sql_on: FALSE ;;
    relationship: one_to_one
  }
  join: dummySuccessor {
    type: full_outer
    sql_on: FALSE ;;
    relationship: one_to_one
  }
}

Then I navigated to the dummyPredecessor Explore, added some fields, and looked at the generated SQL. I intentionally did not select myField2 (the field that mySuccessor takes from myPDT), so my hope was that Looker’s SQL optimization routines would kick-in, and myPredecessor wouldn’t even be joined to myPDT. Unfortunately, that is not the case, as you can see below in the generated SQL:

WITH dummyPredecessor AS ()
,  dummySuccessor AS (WITH dummyPredecessor AS ()
SELECT
    select "pred"  AS pred2,
    myPDT.myField AS myField2
FROM dummyPredecessor
FULL OUTER JOIN `server-name.looker_pdt.longID_myPDT` AS myPDT ON FALSE

GROUP BY 1,2)
SELECT
    select "succ"  AS dummysuccessor_succ_1,
    dummySuccessor.stamp2 AS dummysuccessor_stamp2_1,
    dummySuccessor.pred2 AS dummysuccessor_pred2_1,
    select "pred"  AS dummypredecessor_pred_1,
    myPDT.myField AS myPDT_myField
FROM dummyPredecessor
FULL OUTER JOIN `server-name.looker_pdt.longID_myPDT` AS myPDT ON FALSE
FULL OUTER JOIN dummySuccessor ON FALSE

GROUP BY 1,2,3,4,5
ORDER BY 1
LIMIT 500

After fixing a few errors by hand, I ran this SQL through the BigQuery console, to get some metrics on the impact of this. It took over 30 seconds and 2.3 GB of processing (for a very small test PDT) to generate the following result:

Row	dummysuccessor_succ_1	
1	    succ

So I think that when I use this on my actual dataset (a tenth of a petabyte vs probably [well] under 1 TB for this test), the performance optimizations of using Liquid are in fact well-founded.

Good news: I found this post, and it looks like it’s now possible to use regular non-parameter, non-templated filter fields in sql_ fields using the _filters[view.field] syntax! This provides a more standardize look & feel for users, so there aren’t both filter-only fields and “filterable-but-shouldn’t-be-filtered” regular dimensions. I’ll have to play around with it… Between that and the {% if field._in_query %} syntax, I should be able to come up with some really efficient SQL, but with all the “magic” hidden from the end-user experience.

I think I can get by with Liquid, but if I was to make a feature request, it’d be for (1) applying the SQL optimization algorithms to joins involving NDTs in situations like this, or (2) make bind_all_filters work with SQL derived tables, and just map the pertinent filters, or only require that the NDT be part of the Explore, but not use the Explore as its explore_source: parameter.

Anyways, if you’ve read this far, thanks again for the help, you got me thinking too!

fabio1

Hi Sean!

The only reason I was suggesting that the SQL for the NDT would be very “wide”/jjoin-heavy was because of the “all” in “bind_all_filters”. If you are binding specific filters (which is essentially what the liquid solution does as well), then unnecessary joins should not be included in the NDT’s SQL.

In your test example, the NDT is wide because the following code references fields from both views:

 derived_table: {
    explore_source: dummyPredecessor {
      column: pred2 {
        field: dummyPredecessor.pred
      }
      column: myField2 {
        field: myPDT.stamp
      }
    }

If you did want to use NDTs in this way, at the very least, you would have to make a separate NDT for each view/join that you wanted to filter&group pre-join, and each such NDT would not reference unneeded tables. (Not saying that I believe the effort & maintainability will be justified, but if you wanted to, that’s how you would have to do it)

spmclaugh

Ahhh great point @fabio1! That was in fact the plan; maybe my example was too artificial. Sounds like I have more options back on the table; will have to think about it a bit before making a decision. I’m leaning back towards my original plan now, with your clarification to remember to make separate NDTs so each one doesn’t become wide. It sound like they not only won’t become wide, but they won’t become “tall” either (that was my original concern for performance), because they won’t be joined more than they need to be. That’s great news!

I think that’s a better option than using _filters[view.field] in the manner referenced in the post I linked to. It looks like some fairly “heavy” Liquid processing of _filters[view.field] is required to massage the raw value from _filters[view.field] into the right format, using the method in that post. And it’d probably apply differently for each filter’s type parameter.

fabio1

As far as I know, binding a filter with an NDT should work the same as using the {% condition %} tag.

Either way, I feel compelled to reiterate that there are many challenges with attempting this 🙂

spmclaugh

Totally understand the warning @fabio1, but I’m an efficiency junky, so even knowing why our queries take so long (huge dataset), I feel compelled to do something about 20+ minute queries. There’s probably some other low-hanging fruit we could tackle, but I’m going to hit it with a sledgehammer. 😉

So I tried an example use-case to test out your comment:

If you are binding specific filters (which is essentially what the liquid solution does as well), then unnecessary joins should not be included in the NDT’s SQL.

Unfortunately, I got the same results. I tried both an native derived_table and an aggregate_table to see if the behavior applied to both, and it appears to. Even when the NDT only references fields from one view in the parent Explore, the table is constructed by joining both views first.

LookML - View file:

view: myPDT {
    derived_table: {
        persist_for: "24 hours"
        sql:
            ...
    }
    dimension: ... { ...}
}

view: dummy_source_2 {
    # This would be a PDT too, in the real use-case
    dimension: dummy {
        type:  string
        sql:  SELECT "dummy" ;;
    }
}

view: dummy_consumer {
    # this would have a bunch of bind_filter's, in the real use-case
    derived_table: {
        explore_source: myPDT {
            column: dummy2 {
                field: dummy_source_2.dummy
            }
        }
    }
    dimension: dummy2 {
        type:  string
    }
}

LookML - Model file:

explore: myPDT {
    join: dummy_source_2 {
        sql_on:  FALSE ;;
        relationship:  one_to_one
    }
    join: dummy_consumer {
        sql_on:  FALSE ;;
        relationship:  one_to_one
    }
}

Resulting SQL:

WITH dummy_consumer AS (SELECT
    SELECT "dummy"  AS dummy2
FROM `server-name.looker_pdt.longID_myPDT` AS myPDT
LEFT JOIN dummy_source_2 ON FALSE

GROUP BY 1)
SELECT
    dummy_consumer.dummy2 AS dummy_consumer_dummy2
FROM `server-name.looker_pdt.longID_myPDT` AS myPDT
LEFT JOIN dummy_consumer ON FALSE

GROUP BY 1
ORDER BY 1
LIMIT 500

Any thoughts?

spmclaugh

Actually, I think I found a work-around. It looks like only the first View defined in an Explore is unconditionally joined with whatever view contains the fields used by an NDT. So you can just make the “main” view empty to get the desired behavior, as shown below.

My test included confirming that the behavior isn’t something weird like “whatever view contains the fields used by the NDT, plus everything joined above it” – that’s not the case – great news!

LookML - View file - Source data (would be PDTs):

view: dummy_empty {
}

view: dummy_source_1 {
    derived_table: {
        sql: ;;
    }
    dimension: dummy1 {
        type:  string
        sql:  'dummy1' ;;
    }
}

view: dummy_source_2 {
    ... similar, replace dummy1 with dummy2
}

view: dummy_source_3 {
    ... similar, replace dummy1 with dummy3
}

LookML - View file - Consumer derived tables (would be NDTs):

view: dummy_consumer1 {
    derived_table: {
        explore_source: dummy_explore {
            column: dummy1_consumed {
                field: dummy_source_1.dummy1
            }
        }
    }
    dimension: dummy1_consumed {
        type:  string
    }
}

view: dummy_consumer2 {
    ... similar, replace dummy1 with dummy2
}

view: dummy_consumer3 {
    ... similar, replace dummy1 with dummy3
}

LookML - Model file:

explore: dummy_explore {
    from:  dummy_empty
    join: dummy_source_1 {
        sql_on:  FALSE ;;
        relationship:  one_to_one
    }
    join: dummy_source_2 { similar... }
    join: dummy_source_3 { similar... }
    join: dummy_consumer1 {
        sql_on:  FALSE ;;
        relationship:  one_to_one
    }
    join: dummy_consumer2 { similar... }
    join: dummy_consumer3 { similar... }
    aggregate_table: dummy_aggregate {
        query: {
            dimensions: [dummy_source_1.dummy1]
        }
        materialization: {
            persist_for: "30 minutes"
        }
    }
}

Resulting SQL for NDTs:

WITH dummy_consumer2 AS (WITH dummy_source_2 AS ()
SELECT
    'dummy2'  AS dummy2_consumed
FROM dummy_empty AS dummy_explore LEFT JOIN dummy_source_2 ON FALSE
GROUP BY
    1)
-- Did not use dummy_explore::dummy_aggregate; it does not include the following fields in the query: dummy_consumer2.dummy2_consumed
SELECT
    dummy_consumer2.dummy2_consumed AS dummy_consumer2_dummy2_consumed
FROM dummy_empty AS dummy_explore LEFT JOIN dummy_consumer2 ON FALSE
GROUP BY
    1
ORDER BY
    1
LIMIT 500

Similar for dummy_consumer1, dummy_consumer3, and dummy_aggregate.

Combine "FULL OUTER JOIN ON FALSE" with "Introducing bind_all_filters"