The problem of SQL fanouts

  • 22 June 2022
  • 0 replies
  • 1894 views

Userlevel 3

Looker will not be updating this content, nor guarantees that everything is up-to-date. 

NOTE: When this entry was originally written, Looker users needed to use the following strategies to properly calculate certain metrics. Symmetric aggregates  render much of the information in this article unnecessary.

However, this article is still useful for writing SQL in general, and specifically for working with SQL dialects that do not support symmetric aggregates.

 

If you're a SQL user, some of the first SQL concepts you probably learned about were joins and aggregate functions (such as COUNT and SUM). One thing that is not always taught is how these two concepts can interact and sometimes produce incorrect results. In this article we'll discuss what to look out for, the concept of a "fanout," and why it matters to SQL writers and Looker users alike.

Starting with a friendly join

 

Let's start off with a simple example, where we'll join together a couple of tables. Our first table will show our customers' names and the number of visits each customer has made to our website:
 

customer

customer_id first_name last_name visits
1 Amelia Earhart 2
2 Charles Lindbergh 2
3 Wilbur Wright 4


Our second table will include all the orders that those customers have placed. You can see that each order is linked to the customer who placed it by the customer's ID.
 

order

order_id amount customer_id
1 25.00 1
2 50.00 1
3 75.00 2
4 100.00 3


Joining these tables together in SQL would be pretty straightforward:

SELECT    *
FROM customer
LEFT JOIN order
ON customer.customer_id = order.customer_id


The result of that query would be this table:
 

customer_id first_name last_name visits order_id amount customer_id
1 Amelia Earhart 2 1 25.00 1
1 Amelia Earhart 2 2 50.00 1
2 Charles Lindbergh 2 3 75.00 2
3 Wilbur Wright 4 4 100.00 3


Aggregate functions gone bad


Now that we have a joined table, we need to be careful about how we use aggregate functions like COUNT and SUM.
 

Aggregate functions on a single table


Let's consider the customer table all by itself again. If we want to know the total number of customers, we can execute a simple query like this:

SELECT COUNT(*)
FROM customer

SQL will count up the rows in the table as follows:
 

customer

COUNT customer_id first_name last_name visits
1 1 Amelia Earhart 2
2 2 Charles Lindbergh 2
3 3 Wilbur Wright 4


We'll get a count of 3, which is correct.

Or, if we want to know the total number of customer visits, we can execute another straightforward query like this:

SELECT SUM(visits)
FROM customer

SQL will add up the number of visits in the table as follows:

customer

customer_id first_name last_name visits
1 Amelia Earhart 2
2 Charles Lindbergh + 2
3 Wilbur Wright + 4
SUM 8


We'll get a result of 8, which is also correct.
 

Aggregate functions on the joined tables


So far, so good. However, if we try to use the same aggregate functions on either of our joined tables, we'll start to see incorrect results.

Running a basic count on the joined table, we will no longer get the correct number of customers:

SELECT    COUNT(*)
FROM customer
LEFT JOIN order
ON customer.customer_id = order.customer_id

SQL will count up the rows in the table as follows:

COUNT customer_id first_name last_name visits order_id amount customer_id
1 1 Amelia Earhart 2 1 25.00 1
2 1 Amelia Earhart 2 2 50.00 1
3 2 Charles Lindbergh 2 3 75.00 2
4 3 Wilbur Wright 4 4 100.00 3


We'll get a result of 4, even though there are really only 3 customers. You can see that Amelia is counted twice.

Similarly, if we try to sum the number of visits, we will no longer get the correct result:

SELECT    SUM(visits)
FROM customer
LEFT JOIN order
ON customer.customer_id = order.customer_id

SQL will add up the number of visits in the table as follows:

customer_id first_name last_name visits order_id amount customer_id
1 Amelia Earhart 2 1 25.00 1
1 Amelia Earhart + 2 2 50.00 1
2 Charles Lindbergh + 2 3 75.00 2
3 Wilbur Wright + 4 4 100.00 3
SUM 10  


We'll get a result of 10, even though there are only 8 visits. Amelia's 2 visits are added twice.
 

A fanout happened while you weren't looking


In the example we've been looking at, the primary table (customer) had only three rows. The "primary table" is the table that is in the FROM clause of our SQL queries. After the join, we now have 4 rows. Since the joined table has more rows than the primary table, we say that a fanout has occurred.

To avoid a fanout, we can write the join in the opposite order. So, instead of this:

SELECT    *
FROM customer
LEFT JOIN order
ON customer.customer_id =
order.customer_id

We would do this:

SELECT    *
FROM order
LEFT JOIN customer
ON order.customer_id =
customer.customer_id

Now our joined table looks like this:

order_id amount customer_id customer_id first_name last_name visits
1 25.00 1 1 Amelia Earhart 2
2 50.00 1 1 Amelia Earhart 2
3 75.00 2 2 Charles Lindbergh 2
4 100.00 3 3 Wilbur Wright 4


The eagle-eyed observer will recognize that this is exactly like our other joined table, except for the order of the columns. That is true, and it's why most writers of SQL think of our two different joins as exactly the same.

However, when we do the join in this order, we do not have a fanout. Our original table (order) had four rows, and our joined table also has four rows, so there has been no fanout. Herein lies a key point: To help avoid fanouts, begin your joins with the most granular table.
 

Fanouts, schmanouts, who cares?


In both the fanout and the non-fanout cases, we still need to worry about the accuracy of aggregate functions. However, there is a subtle difference in the type of problems we're going to see. If you use Looker, this is a very important concept.

  • No fanout — You can trust aggregate functions on your primary table but not necessarily on your joined tables.
  • Fanout — You cannot necessarily trust aggregate functions on either your primary table or your joined tables.

To drive home this point, let's take a look at our previous examples.
 

Fanout example


If you'll recall from the previous example, the following join resulted in a fanout, because while the customer table only had three rows, the joined table had four rows.

SELECT    *
FROM customer
LEFT JOIN order
ON customer.customer_id = order.customer_id
customer_id first_name last_name visits order_id amount customer_id
1 Amelia Earhart 2 1 25.00 1
1 Amelia Earhart 2 2 50.00 1
2 Charles Lindbergh 2 3 75.00 2
3 Wilbur Wright 4 4 100.00 3


Since we are in a fanout situation, we cannot trust that aggregate functions will work on the primary table (customer). As we saw previously, SUM(visits) will give us a value of 10, even though only 8 visits have actually occurred.
 

No fanout example


When we reversed the join, we did not get a fanout, because the order table had four rows, and the joined table also had four rows.

SELECT    *
FROM order
LEFT JOIN customer
ON order.customer_id = customer.customer_id
order_id amount customer_id customer_id first_name last_name visits
1 25.00 1 1 Amelia Earhart 2
2 50.00 1 1 Amelia Earhart 2
3 75.00 2 2 Charles Lindbergh 2
4 100.00 3 3 Wilbur Wright 4


Without any fanout, we can trust that aggregate functions will work on the primary table (order). For example, SUM(amount) will give us a value of 250.00, which is the correct amount of money collected.
 

Two friendly joins, one frenemy join


If we want to avoid fanouts, it's important that we understand the three different types of joins.
 

One-to-one (friendly)


If one row of your primary table only ever matches up with one row of your joined table, you have a one-to-one join. This type of join will not result in a fanout, and aggregate functions will be accurate no matter where you use them.

Example: Suppose you have a person table and a DNA table. Since only one person can be matched with one DNA record, this is a one-to-one join.
 

Many-to-one (friendly)


If many rows of your primary table match up with the same row in your joined table, you have a many-to-one join. This type of join will also not result in a fanout, and aggregate functions will at least be accurate on the primary table.

Example: Suppose you have a person table and a state-of-residence table. Since many people can live in one state, this is a many-to-one join.
 

One-to-many (frenemy)


If one row of your primary table can match up with multiple rows in your joined table, you have a one-to-many join. This type of join can result in a fanout, and aggregate functions are not necessarily accurate anywhere.

Example: Suppose you have a person table and a children table. Since one person can have more than one child, this is a one-to-many join.
 

A fanout witch hunt


Understand your join type


The first, and preferred, method to check for a fanout is to understand the type of join that is occurring. One-to-one and many-to-one joins won't ever result in a fanout. However, if you know you are in a one-to-many situation, then there will always be the risk of a fanout. Even if a fanout has not already occurred, it will be a risk in the future if new rows are added.
 

Count rows before and after the join


The second method you can use to check for a fanout is to query a COUNT before and after the join. The queries would look like this:

SELECT COUNT(*)
FROM my_primary_table

SELECT COUNT(*)
FROM my_primary_table
LEFT JOIN my_joined_table
ON my_primary_table.my_column_1 = my_joined_table.my_column_2

If the count increases between the two queries, we know that a fanout has occurred. Since we're looking for an increase, it's important that the second query use a LEFT JOIN. We don't want to artificially decrease the number of rows being reported just because a row in the primary table doesn't have a corresponding row in the joined table.

Unfortunately, this method cannot tell you if there is a risk of a future fanout. To know that, you need to understand the type of join that is occurring.
 

Looker to the rescue


If you're one of the lucky folks who use Looker, there are several LookML options that can help protect you from these pitfalls.
 

Joining with foreign_key makes fanouts impossible


In Looker, you'll typically define joins between your views (views represent an existing or derived table) in your model file. At Looker, we recommend you implement joins using the foreign_key declaration when possible. In LookML, a join that uses this method will be written like this:

explore: person {
join: dna {
foreign_key: dna_id
}
}

In plain English, we would interpret this to mean "take the primary key from dna, and join it to dna_id from person".

It is impossible for a join performed with foreign_key to result in a fanout, as long as you've defined your primary keys correctly. Since the joined views are being joined with their primary key, which is unique, we know that only one row from the joined views can ever be matched with a row from the explore.

Sometimes the column relationships in your database will not allow you to use foreign_key, even though no fanout will occur. That's perfectly fine, and in this situation you can use sql_on to define your join. The example above could be written like this:

explore: person {
join: dna {
sql_on: ${dna.id} = ${person.dna_id} ;;
}
}

As we learned in this article, if you eliminate the possibility of a fanout, you can be sure that the measures in your explore can be trusted.
 

Aggregate measures from joined views are hidden unless the join is one-to-one


That sure is a mouthful, so let's work through it. In this article, we've seen that aggregate functions don't always work for tables that have been joined in. For that reason, Looker will not display aggregate functions (except for counts) for joined views. That way, your users will not accidentally retrieve bad information.

Counts can be used because Looker automatically uses a COUNT DISTINCT in these cases. This is safe because, even if a row from a joined table is repeated multiple times, a COUNT DISTINCT will only count it once.

Sometimes it is safe to use other aggregate functions, besides just COUNT DISTINCT, on joined views. This occurs when the join relationship is one-to-one. If you know that a join is one-to-one, you can tell Looker about this, using the relationship declaration. It looks like this:

explore: person {
join: dna {
foreign_key: dna_id
relationship: one_to_one
}
}

When you make this declaration, Looker will start displaying aggregate functions for the joined view.
 

When a fanout is OK


Sometimes you know that a fanout is going to occur, but you've thought it through and decided that you don't need aggregate functions or that the ones you plan to use will work properly. In that case, Looker will definitely allow you to define your joins however you like. Just use sql_on to define your join. As a reminder, the syntax will look like:

explore: person {
join: child {
sql_on: ${child.parent_id} = ${person.id} ;;
}
}

But I want to have my cake and eat it, too!


There are some clever ways to avoid inaccuracy with aggregate functions, even if you have problematic joins. If you can force the information in your joined tables to have a one-to-one relationship with your primary table, you can use aggregate functions to your heart's content.
 

Group your data in a one-to-one join


At the very beginning of this article, we started with the customer table, joined in order, and then saw some bad behavior when we extracted total customers and total visits. As a reminder, the tables look like this:

customer
customer_id first_name last_name visits
1 Amelia Earhart 2
2 Charles Lindbergh 2
3 Wilbur Wright 4
order
order_id amount customer_id
1 25.00 1
2 50.00 1
3 75.00 2
4 100.00 3


One approach we might take, to get this data joined together in a safe way, is to group the order table by customer_id. In so doing, we'll create a single row to be matched with each row in customer and end up with a one-to-one relationship. The SQL to perform this grouping might be written as:

SELECT   customer_id,
SUM(amount) AS total_amount
FROM order
GROUP BY customer_id

The resulting table will look like this:

customer_id total_amount
1 75.00
2 75.00
3 105.00


If you compare this result set to the customer table, you can see that it will join quite nicely. To actually execute the join in SQL, you would need to use a sub query:

SELECT *
FROM customer
LEFT JOIN
(
SELECT customer_id,
SUM(amount) AS total_amount
FROM order
GROUP BY customer_id
)
AS customer_totals
ON customer.customer_id = customer_totals.customer_id

The resulting table will look like this:

customer_id first_name last_name visits customer_id total_amount
1 Amelia Earhart 2 1 75.00
2 Charles Lindbergh 2 2 75.00
3 Wilbur Wright 4 3 105.00


Now any aggregate function calculations we want to use (such as a SUM on visits or total_amount) will work just fine.
 

Using Looker's derived tables to do the same thing


Looker has a great feature called derived tables. Derived tables have many sophisticated options but can be used simply to achieve the grouping of the order table that we want. This is how the LookML would be written:

view: customer_totals {
derived_table: {
sql:
SELECT
customer_id,
SUM(amount) AS total_amount
FROM order
GROUP BY customer_id ;;
}
}

You could now treat this data as if it were a real table in your database, just like any other. You can define measures and dimensions just like any other view. This data can also be joined in LookML just like any other table. In this example, the LookML might be:

explore: customer {
join: customer_totals {
foreign_key: customer_id
relationship: one_to_one
}
}

Quick summary


To summarize everything we've just covered:

  • Aggregate functions like SUM and COUNT can misbehave if used against joined tables.
  • If a join has a one-to-one relationship, aggregate functions will work just fine.
  • If a join has a many-to-one relationship, aggregate functions will work on the primary table but might not work on the joined tables.
  • If a join has a one-to-many relationship, aggregate functions may not work anywhere.
  • If you're using Looker, protect yourself from many of these pitfalls by making your joins with foreign_key if possible, and setting relationship: one-to-one when appropriate.
  • If you need to join tables that have a one-to-many relationship, and want to use aggregate functions, try grouping the joined tables first such that they will have a one-to-one relationship with the primary table.

This topic has been closed for comments