Question

Three ways to model EAV schemas and many-to-many relationships

  • 1 December 2015
  • 3 replies
  • 2908 views

Userlevel 2

Introduction



The entity-attribute-value (EAV) schema is often encountered in business tools where the number of possible attributes for an entity is very large, but any given record only has data for a handful of them. This schema is very flexible and allows for efficient storage of sparse data, but unfortunately is poorly suited for analytical queries. In this article we demonstrate three options for modeling the schema in Looker so the attributes are available for effective querying. For a broader discussion of EAV schemas, be sure to check out our blog post on the topic too.



Many-to-many relationships can be imagined as a special case of the EAV schema where the join table has no value column – only the association between two entities is what matters. The solutions offered here work equally well for many-to-many relationships; you will just have boolean values for the attributes instead of strings. See Option 3 for an example many-to-many dataset based on a product tagging system.



Example Schema



EAV is especially common for storing user data since there is a vast number of attributes you might want to store about a person. Let’s take this bookstore data as an example:



       users                user_attribute_values              user_attributes

--------------------- ------------------------------------ --------------------

id | email user_id | attribute_id | value id | name

-----+--------------- --------+--------------+------------ --------+-----------

500 | jose@gmail.com 500 | 1 | inbound 1 | channel

600 | jane@gmail.com 500 | 2 | Young Adult 2 | genre

500 | 3 | 12-18 3 | age_bracket

--------+--------------+------------ 4 | zipcode

600 | 1 | outreach

600 | 2 | Sci-Fi



users is our typical entity with a primary key and some standard attribute columns. user_attributes is the enumeration of possible attributes with their id and name. user_attribute_values joins these two tables and stores the value of that particular user+attribute combination. Note that each user need not be associated with every attribute; here we have no zipcodes for either user, nor the age bracket for Jane.



Option 1: Create a Fact Table



With this method we’ll use Looker’s Persistent Derived Tables (PDTs) to create a new table that captures the desired attribute data in standard table columns. Each row of this table represents a single user, so it can be joined one-to-one anywhere we explore the users entity.



- view: user_facts

derived_table:

sql_trigger_value: [insert trigger condition]

sql: |

SELECT

users.id AS user_id

, MAX( IF(attributes.name = 'channel', attr_values.value, NULL) ) AS channel

, MAX( IF(attributes.name = 'genre', attr_values.value, NULL) ) AS genre

, MAX( IF(attributes.name = 'age_bracket', attr_values.value, NULL) ) AS age_bracket

FROM users

LEFT JOIN user_attribute_values AS attr_values

ON attr_values.user_id = users.id

LEFT JOIN user_attributes AS attributes

ON attributes.id = attr_values.attribute_id

GROUP BY 1



fields:

- dimension: user_id

primary_key: true

sql: ${TABLE}.user_id



- dimension: channel

sql: ${TABLE}.channel



- dimension: genre

sql: ${TABLE}.genre



- dimension: age_bracket

sql: ${TABLE}.age_bracket



- dimension: is_target_market

type: yesno

sql: ${age_bracket} = '12-18' OR ${genre} = 'Young Adult'



The advantage of this solution is that the user attributes become dimensions just like any other Looker view. You can also embellish the code to cast attributes into the appropriate type (e.g. dates, tiers) or build measures off them. The disadvantage is that the PDT code must enumerate every attribute to be exposed.



Note how we can create more complex filter conditions – like an “OR” between two fields – using custom dimensions with relevant business logic in the sql parameter. In this case is_target_market checks both the age_bracket and genre attributes for important values.



Option 2: Dynamic Joins



This solution trades some of the simplicity and reusability of the fact table for greater flexibility in choosing which attributes to report. It makes use of templated filters to take the desired attribute names as user input. While this method is a little more complex, it means users can report/filter on any attribute without needing to define them ahead of time.



First we need to make sure the attribute names are available in the same table as the values, like this:



        user_attribute_values     

--------------------------------------

user_id | attribute_name | value

--------+----------------+------------

500 | channel | inbound

500 | genre | Young Adult

500 | age_bracket | 12-18

--------+----------------+------------

600 | channel | outreach

600 | genre | Sci-Fi



If your schema is not already in this form, we can achieve the de-normalization with a simple PDT. In this snippet I’m also including the new fields that will be needed for our Explore:



  - view: user_attribute_values

derived_table:

sql_trigger_value: [insert trigger condition]

sql: |

SELECT

attr_values.user_id AS user_id

, attributes.name AS attribute_name

, attr_values.value AS value

FROM user_attribute_values AS attr_values

LEFT JOIN user_attributes AS attributes

ON attributes.id = attr_values.attribute_id



fields:



# material columns, hidden since only needed for joins

#

- dimension: user_id

hidden: true

sql: ${TABLE}.user_id



- dimension: attribute_name

hidden: true

sql: ${TABLE}.attribute_name



# these filters can be embellished

# with default_values and/or suggestions

#

- filter: attribute_1_name



- filter: attribute_2_name



- filter: attribute_3_name



# dimensions to report the dynamic values

# --one for each dynamic join

#

- dimension: attribute_1_value

sql: ${TABLE}.value



- dimension: attribute_2_value

sql: ${TABLE}.value



- dimension: attribute_3_value

sql: ${TABLE}.value



Here is the Explore definition. Note how templated filters are used to join only the attribute rows with the desired name:



- explore: users

joins:

- join: dynamic_1

from: user_attribute_values

view_label: "Dynamic User Attributes"

sql_on: |

${dynamic_1.user_id} = ${users.id}

AND

{% condition dynamic_1.attribute_1_name %} ${dynamic_1.attribute_name} {% endcondition %}

fields: [attribute_1_name, attribute_1_value]

relationship: one_to_one



- join: dynamic_2

from: user_attribute_values

view_label: "Dynamic User Attributes"

sql_on: |

${dynamic_2.user_id} = ${users.id}

AND

{% condition dynamic_2.attribute_2_name %} ${dynamic_2.attribute_name} {% endcondition %}

fields: [attribute_2_name, attribute_2_value]

relationship: one_to_one



- join: dynamic_3

from: user_attribute_values

view_label: "Dynamic User Attributes"

sql_on: |

${dynamic_3.user_id} = ${users.id}

AND

{% condition dynamic_3.attribute_3_name %} ${dynamic_3.attribute_name} {% endcondition %}

fields: [attribute_3_name, attribute_3_value]

relationship: one_to_one



We use this Explore by setting the attribute name filter “is equal to” the desired attribute, and reporting on the value dimensions. You don’t have to use every field:





Like any Explore, we can also filter on the values and aggregate with measures:





With this solution, setting multiple conditions on a single attribute name will cause a fan-out of the user records. We can exploit this to natively achieve an OR filter between values, so long as the measure only considers distinct records:





Option 3: Search a serialized list of tags (many-to-many associations only)



The strategy behind this solution is to condense all the attributes for an entity into a single column, so we can then search that column for specific values of interest. This is only practical for many-to-many relationships (rather than a full EAV) because serializing both the attributes and their values would be overly complex. Here is an example dataset for product tags:



       products                tag_assigns            tags

------------------------ ------------------- --------------

id | name product_id | tag_id id | name

-----+------------------ -----------+------- ---+----------

600 | Workman Coveralls 600 | 2 1 | Skirts

700 | Citrus Skirt Suit 600 | 4 2 | Jumpsuits

800 | Plaid Romper 600 | 1 3 | Casual

700 | 3 4 | Work

700 | 5 5 | 80s

800 | 3

800 | 2



To consolidate the tags into a single column we use a PDT like the one below. Note that this requires a SQL dialect with aggregate function like GROUP_CONCAT (MySQL) or ARRAY_AGG (Postgres). Unfortunately, Redshift does not yet support this type of function.



- view: serialized_product_tags

derived_table:

sql_trigger_value: [insert trigger condition]

sql: |

SELECT

products.id AS product_id

, GROUP_CONCAT(DISTINCT LOWER(tags.name) SEPARATOR ',') AS tags_list

FROM ${products.SQL_TABLE_NAME} AS products

LEFT JOIN ${tag_assigns.SQL_TABLE_NAME} AS tag_assigns

ON tag_assigns.product_id = products.id

LEFT JOIN ${tags.SQL_TABLE_NAME} AS tags

ON tags.id = tag_assigns.tag_id

GROUP BY 1



fields:

- dimension: product_id

primary_key: true

hidden: true

sql: ${TABLE}.product_id



- dimension: tags_list

sql: ${TABLE}.tags_list



Then we just join this view one-to-one in our Products explore. The advantage of this solution is that you can do fairly complex AND and OR filtering without any additional modeling. Just be sure to select the ‘contains’ filter type:





If we do add another dimension for the same list column, we can achieve even more complex filtering by combining types like “contains” and “doesn’t contain”:





I hope this post has inspired you to tackle some of the EAV data in your own data warehouse. If you have questions or suggestions, let us know in the comments below!


3 replies

Userlevel 2

For a few SQL dialects, including PostgreSQL and Snowflake, this can be handled with a fourth option—namely, aggregating the information into a semi-structured data type and using LookML to selectively choose the attributes of interest. Continuing with @victor’s user_attribute_values example:



###PostgreSQL (9.4 or higher)



select user_id

, json_object_agg(attribute_name, value) as attribute_value_map

from user_attribute_values

group by 1



###Snowflake



select user_id

, object_agg(attribute_name, value::variant) as attribute_value_map

from user_attribute_values

group by 1



In both cases, this would yield:





At this point, it’s a matter of joining this data back into our primary table and defining some dimensions to traverse the json/variant data structure:



###PostgreSQL



- dimension: channel

type: string

sql: ${TABLE}.attribute_value_map -> 'channel'



###Snowflake



- dimension: channel

type: string

sql: ${TABLE}.attribute_value_map:channel::string



I like this solution for a few reasons:





  1. It assumes minimal knowledge of the attributes at play. Rather than enumerating every attribute-value pair, we can use a general solution.


  2. It follows that this solution is general enough to handle evolving attribute-value pairs.


  3. Choice is placed in the LookML developer’s hands rather than the data engineer or author of the SQL.


  4. This pattern yields a sparse data structure, which is nice for certain large-scale machine-learning algorithms.




Looker friends … please install the markdown-table plugin for Discourse <3

Userlevel 7
Badge

Option #2 with “disjoint” attribute joins


Option #2 above can be appealing since it is so flexible. But depending on your data, you may run into issues with fan-out. Of course, you can address that. Once again continuing on Victor’s user_attributes example:



explore: users {

join: attribute_number {

#This ensures that one-to-many's don't result in multiplicative fan-out

type: left_outer

sql_on: ${attribute_number.n} IS NOT NULL ;;

}

join: attribute_1 {

from: user_attribute_values

type: left_outer

relationship: one_to_many

sql_on:

${attribute_number.n} = 1

AND ${attribute_1.user_id} = ${users.id}

AND {% condition attribute_1.attribute_picker %} ${attribute_1.attribute_name} {% endcondition %}

;;

}

join: attribute_2 {

from: user_attribute_values

type: left_outer

relationship: one_to_many

sql_on:

${attribute_number.n} = 2

AND ${attribute_2.user_id} = ${users.id}

AND {% condition attribute_2.attribute_picker %} ${attribute_2.attribute_name} {% endcondition %}

;;

}

}



view: attribute_number {

#The below allows the join to only expose attribute numbers for the selected filters

sql_table_name:(

SELECT CASE WHEN {% condition attribute_1.attribute_picker %} NULL {% endcondition %} THEN NULL ELSE 1 END as n

UNION ALL

SELECT CASE WHEN {% condition attribute_2.attribute_picker %} NULL {% endcondition %} THEN NULL ELSE 2 END as n

);; # You can add on an additional UNION'ed select for each time you want to be able to join on attributes

dimension: n {

type: number

hidden: yes

sql: ${TABLE}.n ;;

}

}
Userlevel 3

Option 3 is now possible in Redshift using the window function LISTAGG.



To apply this to the provided example and add an ordering by name:



  SELECT

products.id AS product_id

, LISTAGG(LOWER(tags.name),',') WITHIN GROUP (ORDER BY tags.name) AS tags_list

FROM

(

SELECT DISTINCT products.id, tags.name

FROM ${products.SQL_TABLE_NAME} AS products

LEFT JOIN ${tag_assigns.SQL_TABLE_NAME} AS tag_assigns

ON tag_assigns.product_id = products.id

LEFT JOIN ${tags.SQL_TABLE_NAME} AS tags

ON tags.id = tag_assigns.tag_id

) q

GROUP BY 1



Note LISTAGG does not support DISTINCT, so a SELECT DISTINCT subquery is needed. If you have to use LISTAGG on multiple columns, it can get tricky. There are some solutions proposed here:



Reply