Three ways to model EAV schemas and many-to-many r...

Victor_P · 11-30-2015 03:27 PM

Introduction

The entity-attribute-value (EAV) schema is often encountered in business tools where the number of possible attributes for an entity is very large, but any given record only has data for a handful of them. This schema is very flexible and allows for efficient storage of sparse data, but unfortunately is poorly suited for analytical queries. In this article we demonstrate three options for modeling the schema in Looker so the attributes are available for effective querying. For a broader discussion of EAV schemas, be sure to check out our blog post on the topic too.

Many-to-many relationships can be imagined as a special case of the EAV schema where the join table has no value column – only the association between two entities is what matters. The solutions offered here work equally well for many-to-many relationships; you will just have boolean values for the attributes instead of strings. See Option 3 for an example many-to-many dataset based on a product tagging system.

Example Schema

EAV is especially common for storing user data since there is a vast number of attributes you might want to store about a person. Let’s take this bookstore data as an example:

       users                user_attribute_values              user_attributes
---------------------  ------------------------------------  --------------------
 id  |     email       user_id | attribute_id |   value           id | name
-----+---------------  --------+--------------+------------  --------+-----------
 500 | jose@gmail.com    500   |       1      | inbound            1 | channel
 600 | jane@gmail.com    500   |       2      | Young Adult        2 | genre
                         500   |       3      | 12-18              3 | age_bracket
                       --------+--------------+------------        4 | zipcode
                         600   |       1      | outreach
                         600   |       2      | Sci-Fi

users is our typical entity with a primary key and some standard attribute columns. user_attributes is the enumeration of possible attributes with their id and name. user_attribute_values joins these two tables and stores the value of that particular user+attribute combination. Note that each user need not be associated with every attribute; here we have no zipcodes for either user, nor the age bracket for Jane.

Option 1: Create a Fact Table

With this method we’ll use Looker’s Persistent Derived Tables (PDTs) to create a new table that captures the desired attribute data in standard table columns. Each row of this table represents a single user, so it can be joined one-to-one anywhere we explore the users entity.

- view: user_facts
  derived_table:
    sql_trigger_value: [insert trigger condition]
    sql: |
      SELECT
        users.id AS user_id
        , MAX( IF(attributes.name = 'channel',     attr_values.value, NULL) ) AS channel
        , MAX( IF(attributes.name = 'genre',       attr_values.value, NULL) ) AS genre
        , MAX( IF(attributes.name = 'age_bracket', attr_values.value, NULL) ) AS age_bracket
      FROM users
      LEFT JOIN user_attribute_values AS attr_values
        ON attr_values.user_id = users.id
      LEFT JOIN user_attributes AS attributes
        ON attributes.id = attr_values.attribute_id
      GROUP BY 1
    
  fields:
  - dimension: user_id
    primary_key: true
    sql: ${TABLE}.user_id

  - dimension: channel
    sql: ${TABLE}.channel

  - dimension: genre
    sql: ${TABLE}.genre

  - dimension: age_bracket
    sql: ${TABLE}.age_bracket

  - dimension: is_target_market
    type: yesno
    sql: ${age_bracket} = '12-18' OR ${genre} = 'Young Adult'

The advantage of this solution is that the user attributes become dimensions just like any other Looker view. You can also embellish the code to cast attributes into the appropriate type (e.g. dates, tiers) or build measures off them. The disadvantage is that the PDT code must enumerate every attribute to be exposed.

Note how we can create more complex filter conditions – like an “OR” between two fields – using custom dimensions with relevant business logic in the sql parameter. In this case is_target_market checks both the age_bracket and genre attributes for important values.

Option 2: Dynamic Joins

This solution trades some of the simplicity and reusability of the fact table for greater flexibility in choosing which attributes to report. It makes use of templated filters to take the desired attribute names as user input. While this method is a little more complex, it means users can report/filter on any attribute without needing to define them ahead of time.

First we need to make sure the attribute names are available in the same table as the values, like this:

        user_attribute_values     
-------------------------------------- 
user_id | attribute_name |   value           
--------+----------------+------------  
   500  | channel        | inbound           
   500  | genre          | Young Adult  
   500  | age_bracket    | 12-18    
--------+----------------+------------
   600  | channel        | outreach
   600  | genre          | Sci-Fi

If your schema is not already in this form, we can achieve the de-normalization with a simple PDT. In this snippet I’m also including the new fields that will be needed for our Explore:

  - view: user_attribute_values
    derived_table:
      sql_trigger_value: [insert trigger condition]
      sql: |
        SELECT
          attr_values.user_id AS user_id
          , attributes.name AS attribute_name
          , attr_values.value AS value
        FROM user_attribute_values AS attr_values
        LEFT JOIN user_attributes AS attributes
          ON attributes.id = attr_values.attribute_id

    fields:

    # material columns, hidden since only needed for joins
    #
    - dimension: user_id
      hidden: true
      sql: ${TABLE}.user_id

    - dimension: attribute_name
      hidden: true
      sql: ${TABLE}.attribute_name

    # these filters can be embellished
    # with default_values and/or suggestions
    #
    - filter: attribute_1_name

    - filter: attribute_2_name

    - filter: attribute_3_name

    # dimensions to report the dynamic values
    # --one for each dynamic join
    #
    - dimension: attribute_1_value
      sql: ${TABLE}.value

    - dimension: attribute_2_value
      sql: ${TABLE}.value

    - dimension: attribute_3_value
      sql: ${TABLE}.value

Here is the Explore definition. Note how templated filters are used to join only the attribute rows with the desired name:

- explore: users
  joins:
  - join: dynamic_1
    from: user_attribute_values
    view_label: "Dynamic User Attributes"
    sql_on: |
      ${dynamic_1.user_id} = ${users.id}
      AND
      {% condition dynamic_1.attribute_1_name %} ${dynamic_1.attribute_name} {% endcondition %}
    fields: [attribute_1_name, attribute_1_value]
    relationship: one_to_one

  - join: dynamic_2
    from: user_attribute_values
    view_label: "Dynamic User Attributes"
    sql_on: |
      ${dynamic_2.user_id} = ${users.id}
      AND
      {% condition dynamic_2.attribute_2_name %} ${dynamic_2.attribute_name} {% endcondition %}
    fields: [attribute_2_name, attribute_2_value]
    relationship: one_to_one

  - join: dynamic_3
    from: user_attribute_values
    view_label: "Dynamic User Attributes"
    sql_on: |
      ${dynamic_3.user_id} = ${users.id}
      AND
      {% condition dynamic_3.attribute_3_name %} ${dynamic_3.attribute_name} {% endcondition %}
    fields: [attribute_3_name, attribute_3_value]
    relationship: one_to_one

We use this Explore by setting the attribute name filter “is equal to” the desired attribute, and reporting on the value dimensions. You don’t have to use every field:

Like any Explore, we can also filter on the values and aggregate with measures:

With this solution, setting multiple conditions on a single attribute name will cause a fan-out of the user records. We can exploit this to natively achieve an OR filter between values, so long as the measure only considers distinct records:

Option 3: Search a serialized list of tags (many-to-many associations only)

The strategy behind this solution is to condense all the attributes for an entity into a single column, so we can then search that column for specific values of interest. This is only practical for many-to-many relationships (rather than a full EAV) because serializing both the attributes and their values would be overly complex. Here is an example dataset for product tags:

       products                tag_assigns            tags
------------------------   -------------------   --------------
 id  |       name          product_id | tag_id   id | name
-----+------------------   -----------+-------   ---+----------
 600 | Workman Coveralls       600    |   2       1 | Skirts
 700 | Citrus Skirt Suit       600    |   4       2 | Jumpsuits
 800 | Plaid Romper            600    |   1       3 | Casual
                               700    |   3       4 | Work
                               700    |   5       5 | 80s    
                               800    |   3
                               800    |   2

To consolidate the tags into a single column we use a PDT like the one below. Note that this requires a SQL dialect with aggregate function like GROUP_CONCAT (MySQL) or ARRAY_AGG (Postgres). Unfortunately, Redshift does not yet support this type of function.

- view: serialized_product_tags
  derived_table:
    sql_trigger_value: [insert trigger condition]
    sql: |
      SELECT
        products.id AS product_id
        , GROUP_CONCAT(DISTINCT LOWER(tags.name) SEPARATOR ',') AS tags_list
      FROM ${products.SQL_TABLE_NAME} AS products
      LEFT JOIN ${tag_assigns.SQL_TABLE_NAME} AS tag_assigns
        ON tag_assigns.product_id = products.id
      LEFT JOIN ${tags.SQL_TABLE_NAME} AS tags
        ON tags.id = tag_assigns.tag_id
      GROUP BY 1

  fields:
  - dimension: product_id
    primary_key: true
    hidden: true
    sql: ${TABLE}.product_id

  - dimension: tags_list
    sql: ${TABLE}.tags_list

Then we just join this view one-to-one in our Products explore. The advantage of this solution is that you can do fairly complex AND and OR filtering without any additional modeling. Just be sure to select the ‘contains’ filter type:

If we do add another dimension for the same list column, we can achieve even more complex filtering by combining types like “contains” and “doesn’t contain”:

I hope this post has inspired you to tackle some of the EAV data in your own data warehouse. If you have questions or suggestions, let us know in the comments below!

scott_hoover

For a few SQL dialects, including PostgreSQL and Snowflake, this can be handled with a fourth option—namely, aggregating the information into a semi-structured data type and using LookML to selectively choose the attributes of interest. Continuing with @Victor_P’s user_attribute_values example:

###PostgreSQL (9.4 or higher)

select user_id
   , json_object_agg(attribute_name, value) as attribute_value_map
from user_attribute_values
group by 1

###Snowflake

select user_id
   , object_agg(attribute_name, value::variant) as attribute_value_map
from user_attribute_values
group by 1

In both cases, this would yield:

At this point, it’s a matter of joining this data back into our primary table and defining some dimensions to traverse the json/variant data structure:

###PostgreSQL

- dimension: channel
  type: string
  sql: ${TABLE}.attribute_value_map -> 'channel'

###Snowflake

- dimension: channel
  type: string
  sql: ${TABLE}.attribute_value_map:channel::string

I like this solution for a few reasons:

It assumes minimal knowledge of the attributes at play. Rather than enumerating every attribute-value pair, we can use a general solution.
It follows that this solution is general enough to handle evolving attribute-value pairs.
Choice is placed in the LookML developer’s hands rather than the data engineer or author of the SQL.
This pattern yields a sparse data structure, which is nice for certain large-scale machine-learning algorithms.

Looker friends … please install the markdown-table plugin for Discourse ❤️

fabio1

Option #2 with “disjoint” attribute joins
Option #2 above can be appealing since it is so flexible. But depending on your data, you may run into issues with fan-out. Of course, you can address that. Once again continuing on Victor’s user_attributes example:

explore: users {
  join: attribute_number {
    #This ensures that one-to-many's don't result in multiplicative fan-out
    type: left_outer
    sql_on: ${attribute_number.n} IS NOT NULL ;;
  }
  join: attribute_1 {
    from:  user_attribute_values
    type: left_outer
    relationship: one_to_many
    sql_on:
        ${attribute_number.n} = 1
        AND ${attribute_1.user_id} = ${users.id}
        AND {% condition attribute_1.attribute_picker %} ${attribute_1.attribute_name} {% endcondition %}
    ;;
  }
  join: attribute_2 {
    from:  user_attribute_values
    type: left_outer
    relationship: one_to_many
    sql_on:
        ${attribute_number.n} = 2
        AND ${attribute_2.user_id} = ${users.id}
        AND {% condition attribute_2.attribute_picker %} ${attribute_2.attribute_name} {% endcondition %}
    ;;
  }
}

view: attribute_number {
  #The below allows the join to only expose attribute numbers for the selected filters
  sql_table_name:(
    SELECT CASE WHEN {% condition attribute_1.attribute_picker %} NULL {% endcondition %} THEN NULL ELSE 1 END as n
    UNION ALL
    SELECT CASE WHEN {% condition attribute_2.attribute_picker %} NULL {% endcondition %} THEN NULL ELSE 2 END as n
    );; # You can add on an additional UNION'ed select for each time you want to be able to join on attributes
  dimension: n {
    type: number
    hidden: yes
    sql: ${TABLE}.n ;;
  }
}

gregono

Option 3 is now possible in Redshift using the window function LISTAGG.

To apply this to the provided example and add an ordering by name:

  SELECT
    products.id AS product_id
    , LISTAGG(LOWER(tags.name),',') WITHIN GROUP (ORDER BY tags.name) AS tags_list
  FROM
    (
      SELECT DISTINCT products.id, tags.name
      FROM ${products.SQL_TABLE_NAME} AS products
        LEFT JOIN ${tag_assigns.SQL_TABLE_NAME} AS tag_assigns
          ON tag_assigns.product_id = products.id
        LEFT JOIN ${tags.SQL_TABLE_NAME} AS tags
          ON tags.id = tag_assigns.tag_id
    ) q
  GROUP BY 1

Note LISTAGG does not support DISTINCT, so a SELECT DISTINCT subquery is needed. If you have to use LISTAGG on multiple columns, it can get tricky. There are some solutions proposed here: