Question

Three ways to model EAV schemas and many-to-many relationships

  • 1 December 2015
  • 3 replies
  • 1182 views

Userlevel 2

Introduction


The entity-attribute-value (EAV) schema is often encountered in business tools where the number of possible attributes for an entity is very large, but any given record only has data for a handful of them. This schema is very flexible and allows for efficient storage of sparse data, but unfortunately is poorly suited for analytical queries. In this article we demonstrate three options for modeling the schema in Looker so the attributes are available for effective querying. For a broader discussion of EAV schemas, be sure to check out our blog post on the topic too.


Many-to-many relationships can be imagined as a special case of the EAV schema where the join table has no value column – only the association between two entities is what matters. The solutions offered here work equally well for many-to-many relationships; you will just have boolean values for the attributes instead of strings. See Option 3 for an example many-to-many dataset based on a product tagging system.


Example Schema


EAV is especially common for storing user data since there is a vast number of attributes you might want to store about a person. Let’s take this bookstore data as an example:


       users                user_attribute_values              user_attributes
--------------------- ------------------------------------ --------------------
id | email user_id | attribute_id | value id | name
-----+--------------- --------+--------------+------------ --------+-----------
500 | jose@gmail.com 500 | 1 | inbound 1 | channel
600 | jane@gmail.com 500 | 2 | Young Adult 2 | genre
500 | 3 | 12-18 3 | age_bracket
--------+--------------+------------ 4 | zipcode
600 | 1 | outreach
600 | 2 | Sci-Fi

users is our typical entity with a primary key and some standard attribute columns. user_attributes is the enumeration of possible attributes with their id and name. user_attribute_values joins these two tables and stores the value of that particular user+attribute combination. Note that each user need not be associated with every attribute; here we have no zipcodes for either user, nor the age bracket for Jane.


Option 1: Create a Fact Table


With this method we’ll use Looker’s Persistent Derived Tables (PDTs) to create a new table that captures the desired attribute data in standard table columns. Each row of this table represents a single user, so it can be joined one-to-one anywhere we explore the users entity.


- view: user_facts
derived_table:
sql_trigger_value: [insert trigger condition]
sql: |
SELECT
users.id AS user_id
, MAX( IF(attributes.name = 'channel', attr_values.value, NULL) ) AS channel
, MAX( IF(attributes.name = 'genre', attr_values.value, NULL) ) AS genre
, MAX( IF(attributes.name = 'age_bracket', attr_values.value, NULL) ) AS age_bracket
FROM users
LEFT JOIN user_attribute_values AS attr_values
ON attr_values.user_id = users.id
LEFT JOIN user_attributes AS attributes
ON attributes.id = attr_values.attribute_id
GROUP BY 1

fields:
- dimension: user_id
primary_key: true
sql: ${TABLE}.user_id

- dimension: channel
sql: ${TABLE}.channel

- dimension: genre
sql: ${TABLE}.genre

- dimension: age_bracket
sql: ${TABLE}.age_bracket

- dimension: is_target_market
type: yesno
sql: ${age_bracket} = '12-18' OR ${genre} = 'Young Adult'

The advantage of this solution is that the user attributes become dimensions just like any other Looker view. You can also embellish the code to cast attributes into the appropriate type (e.g. dates, tiers) or build measures off them. The disadvantage is that the PDT code must enumerate every attribute to be exposed.


Note how we can create more complex filter conditions – like an “OR” between two fields – using custom dimensions with relevant business logic in the sql parameter. In this case is_target_market checks both the age_bracket and genre attributes for important values.


Option 2: Dynamic Joins


This solution trades some of the simplicity and reusability of the fact table for greater flexibility in choosing which attributes to report. It makes use of templated filters to take the desired attribute names as user input. While this method is a little more complex, it means users can report/filter on any attribute without needing to define them ahead of time.


First we need to make sure the attribute names are available in the same table as the values, like this:


        user_attribute_values     
--------------------------------------
user_id | attribute_name | value
--------+----------------+------------
500 | channel | inbound
500 | genre | Young Adult
500 | age_bracket | 12-18
--------+----------------+------------
600 | channel | outreach
600 | genre | Sci-Fi

If your schema is not already in this form, we can achieve the de-normalization with a simple PDT. In this snippet I’m also including the new fields that will be needed for our Explore:


  - view: user_attribute_values
derived_table:
sql_trigger_value: [insert trigger condition]
sql: |
SELECT
attr_values.user_id AS user_id
, attributes.name AS attribute_name
, attr_values.value AS value
FROM user_attribute_values AS attr_values
LEFT JOIN user_attributes AS attributes
ON attributes.id = attr_values.attribute_id

fields:

# material columns, hidden since only needed for joins
#
- dimension: user_id
hidden: true
sql: ${TABLE}.user_id

- dimension: attribute_name
hidden: true
sql: ${TABLE}.attribute_name

# these filters can be embellished
# with default_values and/or suggestions
#
- filter: attribute_1_name

- filter: attribute_2_name

- filter: attribute_3_name

# dimensions to report the dynamic values
# --one for each dynamic join
#
- dimension: attribute_1_value
sql: ${TABLE}.value

- dimension: attribute_2_value
sql: ${TABLE}.value

- dimension: attribute_3_value
sql: ${TABLE}.value

Here is the Explore definition. Note how templated filters are used to join only the attribute rows with the desired name:


- explore: users
joins:
- join: dynamic_1
from: user_attribute_values
view_label: "Dynamic User Attributes"
sql_on: |
${dynamic_1.user_id} = ${users.id}
AND
{% condition dynamic_1.attribute_1_name %} ${dynamic_1.attribute_name} {% endcondition %}
fields: [attribute_1_name, attribute_1_value]
relationship: one_to_one

- join: dynamic_2
from: user_attribute_values
view_label: "Dynamic User Attributes"
sql_on: |
${dynamic_2.user_id} = ${users.id}
AND
{% condition dynamic_2.attribute_2_name %} ${dynamic_2.attribute_name} {% endcondition %}
fields: [attribute_2_name, attribute_2_value]
relationship: one_to_one

- join: dynamic_3
from: user_attribute_values
view_label: "Dynamic User Attributes"
sql_on: |
${dynamic_3.user_id} = ${users.id}
AND
{% condition dynamic_3.attribute_3_name %} ${dynamic_3.attribute_name} {% endcondition %}
fields: [attribute_3_name, attribute_3_value]
relationship: one_to_one

We use this Explore by setting the attribute name filter “is equal to” the desired attribute, and reporting on the value dimensions. You don’t have to use every field:



Like any Explore, we can also filter on the values and aggregate with measures:



With this solution, setting multiple conditions on a single attribute name will cause a fan-out of the user records. We can exploit this to natively achieve an OR filter between values, so long as the measure only considers distinct records:



Option 3: Search a serialized list of tags (many-to-many associations only)


The strategy behind this solution is to condense all the attributes for an entity into a single column, so we can then search that column for specific values of interest. This is only practical for many-to-many relationships (rather than a full EAV) because serializing both the attributes and their values would be overly complex. Here is an example dataset for product tags:


       products                tag_assigns            tags
------------------------ ------------------- --------------
id | name product_id | tag_id id | name
-----+------------------ -----------+------- ---+----------
600 | Workman Coveralls 600 | 2 1 | Skirts
700 | Citrus Skirt Suit 600 | 4 2 | Jumpsuits
800 | Plaid Romper 600 | 1 3 | Casual
700 | 3 4 | Work
700 | 5 5 | 80s
800 | 3
800 | 2

To consolidate the tags into a single column we use a PDT like the one below. Note that this requires a SQL dialect with aggregate function like GROUP_CONCAT (MySQL) or ARRAY_AGG (Postgres). Unfortunately, Redshift does not yet support this type of function.


- view: serialized_product_tags
derived_table:
sql_trigger_value: [insert trigger condition]
sql: |
SELECT
products.id AS product_id
, GROUP_CONCAT(DISTINCT LOWER(tags.name) SEPARATOR ',') AS tags_list
FROM ${products.SQL_TABLE_NAME} AS products
LEFT JOIN ${tag_assigns.SQL_TABLE_NAME} AS tag_assigns
ON tag_assigns.product_id = products.id
LEFT JOIN ${tags.SQL_TABLE_NAME} AS tags
ON tags.id = tag_assigns.tag_id
GROUP BY 1

fields:
- dimension: product_id
primary_key: true
hidden: true
sql: ${TABLE}.product_id

- dimension: tags_list
sql: ${TABLE}.tags_list

Then we just join this view one-to-one in our Products explore. The advantage of this solution is that you can do fairly complex AND and OR filtering without any additional modeling. Just be sure to select the ‘contains’ filter type:



If we do add another dimension for the same list column, we can achieve even more complex filtering by combining types like “contains” and “doesn’t contain”:



I hope this post has inspired you to tackle some of the EAV data in your own data warehouse. If you have questions or suggestions, let us know in the comments below!


3 replies

For a few SQL dialects, including PostgreSQL and Snowflake, this can be handled with a fourth option—namely, aggregating the information into a semi-structured data type and using LookML to selectively choose the attributes of interest. Continuing with @victor’s user_attribute_values example:


###PostgreSQL (9.4 or higher)


select user_id
, json_object_agg(attribute_name, value) as attribute_value_map
from user_attribute_values
group by 1

###Snowflake


select user_id
, object_agg(attribute_name, value::variant) as attribute_value_map
from user_attribute_values
group by 1

In both cases, this would yield:



At this point, it’s a matter of joining this data back into our primary table and defining some dimensions to traverse the json/variant data structure:


###PostgreSQL


- dimension: channel
type: string
sql: ${TABLE}.attribute_value_map -> 'channel'

###Snowflake


- dimension: channel
type: string
sql: ${TABLE}.attribute_value_map:channel::string

I like this solution for a few reasons:



  1. It assumes minimal knowledge of the attributes at play. Rather than enumerating every attribute-value pair, we can use a general solution.

  2. It follows that this solution is general enough to handle evolving attribute-value pairs.

  3. Choice is placed in the LookML developer’s hands rather than the data engineer or author of the SQL.

  4. This pattern yields a sparse data structure, which is nice for certain large-scale machine-learning algorithms.


Looker friends … please install the markdown-table plugin for Discourse <3

Userlevel 6
Badge

Option #2 with “disjoint” attribute joins

Option #2 above can be appealing since it is so flexible. But depending on your data, you may run into issues with fan-out. Of course, you can address that. Once again continuing on Victor’s user_attributes example:


explore: users {
join: attribute_number {
#This ensures that one-to-many's don't result in multiplicative fan-out
type: left_outer
sql_on: ${attribute_number.n} IS NOT NULL ;;
}
join: attribute_1 {
from: user_attribute_values
type: left_outer
relationship: one_to_many
sql_on:
${attribute_number.n} = 1
AND ${attribute_1.user_id} = ${users.id}
AND {% condition attribute_1.attribute_picker %} ${attribute_1.attribute_name} {% endcondition %}
;;
}
join: attribute_2 {
from: user_attribute_values
type: left_outer
relationship: one_to_many
sql_on:
${attribute_number.n} = 2
AND ${attribute_2.user_id} = ${users.id}
AND {% condition attribute_2.attribute_picker %} ${attribute_2.attribute_name} {% endcondition %}
;;
}
}

view: attribute_number {
#The below allows the join to only expose attribute numbers for the selected filters
sql_table_name:(
SELECT CASE WHEN {% condition attribute_1.attribute_picker %} NULL {% endcondition %} THEN NULL ELSE 1 END as n
UNION ALL
SELECT CASE WHEN {% condition attribute_2.attribute_picker %} NULL {% endcondition %} THEN NULL ELSE 2 END as n
);; # You can add on an additional UNION'ed select for each time you want to be able to join on attributes
dimension: n {
type: number
hidden: yes
sql: ${TABLE}.n ;;
}
}
Userlevel 2

Option 3 is now possible in Redshift using the window function LISTAGG.


To apply this to the provided example and add an ordering by name:


  SELECT
products.id AS product_id
, LISTAGG(LOWER(tags.name),',') WITHIN GROUP (ORDER BY tags.name) AS tags_list
FROM
(
SELECT DISTINCT products.id, tags.name
FROM ${products.SQL_TABLE_NAME} AS products
LEFT JOIN ${tag_assigns.SQL_TABLE_NAME} AS tag_assigns
ON tag_assigns.product_id = products.id
LEFT JOIN ${tags.SQL_TABLE_NAME} AS tags
ON tags.id = tag_assigns.tag_id
) q
GROUP BY 1

Note LISTAGG does not support DISTINCT, so a SELECT DISTINCT subquery is needed. If you have to use LISTAGG on multiple columns, it can get tricky. There are some solutions proposed here:


Reply