[Analytic Block] Tags and Attributes (with and without Arrays)

  • 5 February 2017
  • 0 replies
  • 2303 views

Userlevel 6
Badge

Looker has created this block to make analyzing data easier and more efficient. This Data Block is made available on an “as is” basis, meaning there will not be updates moving forward.



Tags and Attributes


Objects often have lists of attributes associated with them. Stack Overflow posts are tagged with the subjects they are about. Github issues are labeled with product areas they affect. Users may be tagged with different attributes over time to track the campaigns they were contacted by. Even words in a blog or news headline can be used as tags for an article or a post.


With tag data, analysts might want to ask questions such as:



  • “What are the most common tags in my data?”

  • “How much of my data is tagged a specific way?”

  • “Which tags appear most frequently together?”

  • “Is there any trend in price/percentage/views/purchases/etc. between the tags that occur most frequently?”


These questions can be difficult to answer with SQL. In this article, we will address two ways to model a pattern in LookML to make finding insights about tagged data much easier.


Two Implementations


Many SQL databases have an array type that can be used to store an array of strings in a particular data record. (An array is a variable that contains a list of values.) Postgres, Presto, Snowflake, BigQuery, and Athena all support arrays. They will all use the first implementation method outlined below.



Note: Redshift and MySQL don’t support arrays and will use the technique in the second implementation.



Implementation #1: Stack Overflow Data and SPLIT()


We will start with a very simple model of Stack Overflow data. Stack Overflow is a place where developers can ask questions and get answers. Posts generally contain the questions, and each post is tagged with subjects. Subject tags allow experts to easily find and answer questions that fall under their areas of expertise.


Here is a basic Stack Overflow data table:



Looking at these records, we can see that the tags are contained in a single string. For example, these are the tags for a question that reads “How to build EclipseLink”:


java | maven | jpa | eclipselink


Analysts might want to ask questions of this data, such as “What are the most common tags in articles that are also tagged with java?” and “How many articles are tagged with both java and maven?” The LookML model below will help us develop a model to answer these questions.


The Model


The stackoverflow_posts Explore joins together the stackoverflow_posts view, which contains post data, and the tag view, which contains tag data.


In order to analyze tag co-occurrence and answer the questions above, the tag view needs to be self-joined. The SPLIT() function in the sql join parameter makes this possible by breaking apart the array tags.



Note: Best practice is to divide views into separate view files and place Explores into a model file or Explore file.



explore: stackoverflow_posts {
join: tag {
sql: LEFT JOIN UNNEST(SPLIT(${stackoverflow_posts.tags},'|')) as tag ;;
relationship: one_to_many
}
join: tag2 {
from: tag
sql: LEFT JOIN UNNEST(SPLIT(${stackoverflow_posts.tags},'|')) as tag2 ;;
relationship: one_to_many
sql_where: ${tag.name} <> ${tag2.name} ;;
}
}

view: stackoverflow_posts {
sql_table_name: bigquery-public-data.stackoverflow.stackoverflow_posts ;;

dimension: id {
primary_key: yes
}

dimension: title {}

dimension: tags {}

dimension: owner_display_name {}

dimension: score {
type: number
}

dimension_group: creation {
type: creation_time
sql: ${TABLE}.creation_date ;;
}

measure: count {
type: count
drill_fields: [id, creation_time, title, tags, score]
}

measure: average_score {
type: average
sql: ${score} ;;
}
}

view: tag {

dimension: name {
sql: ${TABLE} ;;
}
}

With this model, we can effectively and easily answer our questions in Looker:


"What are the most common tags in articles that are also tagged with java?"



"How many articles are tagged with both java and maven?"



Getting Fancy: Comparing SQL Dialects


We can use the model above to get even deeper insights into the Stack Overflow data — such as determining which SQL dialects most often appear together in posts. We can build an Explore to answer this question by filtering multiple tags for a set of SQL dialects and then pivoting. We can also drill on the count measure for additional granularity:



The SQL Behind the Scenes


Let’s take a deeper dive into what’s happening behind the model. Take the basic Stack Overflow table from the first example:



The generated SQL for the basic Stack Overflow data table is fairly simple:


 SELECT
stackoverflow_posts.id AS stackoverflow_posts_id,
stackoverflow_posts.tags AS stackoverflow_posts_tags,
stackoverflow_posts.title AS stackoverflow_posts_title
FROM `bigquery-public-data.stackoverflow.stackoverflow_posts` AS stackoverflow_posts
WHERE
((stackoverflow_posts.tags IS NOT NULL AND LENGTH(stackoverflow_posts.tags) <> 0 ))
ORDER BY 1 DESC
LIMIT 500

Arrays, SPLIT, and UNNEST


An array is a variable that contains a list of values.


Different dialects have different names for functions that convert strings into arrays. Snowflakeand BigQuery have SPLIT, and Postgres has string_to_array. All of these functions will parse a string and return an array. For example, in BigQuery and Snowflake:


SELECT SPLIT('java|maven|jpa|eclipselink','|')


returns


['java','maven','jpa','eclipselink']


Once the data is in an array, we can turn it into a table function to fan out the table and join the data back in:


SELECT
stackoverflow_posts.id AS stackoverflow_posts_id,
tag AS tag_name,
stackoverflow_posts.tags AS stackoverflow_posts_tags,
stackoverflow_posts.title AS stackoverflow_posts_title
FROM `bigquery-public-data.stackoverflow.stackoverflow_posts` AS stackoverflow_posts
LEFT JOIN UNNEST(SPLIT(stackoverflow_posts.tags,'|')) as tag
WHERE
((stackoverflow_posts.tags IS NOT NULL AND LENGTH(stackoverflow_posts.tags) <> 0 ))
GROUP BY 1,2,3,4
ORDER BY 1 DESC
LIMIT 500

Notice that this is identical to the first Stack Overflow SQL except for two lines:


tag AS tag_name


and


LEFT JOIN UNNEST(SPLIT(stackoverflow_posts.tags,'|')) as tag


The first change is simply a dimension reference to tag. The interesting stuff happens in the second line. The SPLIT function turns the string field tags into an array of tag names. UNNEST joins the array as if it were a table. The effect is to repeat each row for each element of the array. If we run this query, we see the same data as before, except each article is repeated for each tag:



Implementation #2: Redshift, MySQL, and Other SQL Dialects That Don’t Support Arrays


Arrays can be simulated using a numbers table with a SPLIT_PART() function. SPLIT_PART() functions behave similarly to the SPLIT functions in SQL dialects that support arrays, except that they only return a single string value (generally the Nth one). In order to fan out the results like we did above, we will need a numbers table to cross-join.


An Example: Words in Product Names


We are going to use basic e-commerce data, which contains product information such as product name, brand, category, department, retail price, and SKU, as displayed in the basic table below:



What if we want to look at how frequently certain words occur in product names, correlate those words with the products’ prices, or search for words that appear together and their rates of co-occurrence? The model below leverages a numbers table to break out each individual word in the product name field — allowing for deeper insights into the data.


The Data Model


connection: "demonew_events_ecommerce"

explore: products {
join: word {
type: cross
relationship: one_to_many
sql_where: ${word.name} <> '' ;;
}

join: word2 {
from: word
type: cross
relationship: one_to_many
sql_where: ${word2.name} <> '' AND ${word.name} <> ${word2.name} ;;
}
}

explore: numbers16 {}

view: products {
sql_table_name: products ;;

dimension: id {
primary_key: yes
}

dimension: brand {}

dimension: category {}

dimension: department {}

dimension: name {}

dimension: retail_price {
type: number
}

dimension: sku {}

measure: count {
type: count
drill_fields: [id, name, brand, category, retail_price]
}

measure: average_price {
type: average
sql: ${retail_price} ;;
}
}

view: word {
extends: [numbers16]

dimension: name {
sql: SPLIT_PART(${products.name},' ', ${num}::integer) ;;
}
}

view: numbers16 {
derived_table: {
sql:
SELECT row_number() OVER () num
FROM
(SELECT 1 n UNION SELECT 2 n ) t1,
(SELECT 1 n UNION SELECT 2 n ) t2,
(SELECT 1 n UNION SELECT 2 n ) t3,
(SELECT 1 n UNION SELECT 2 n ) t4
;;
}

dimension: num {
hidden: yes
}
}

From this model, we can determine which words in product names occur most frequently, and which words are associated with those products that cost more on average:



This type of analysis shows clear results: for example, products that contain the word Women’s cost more than products that contain the word Men’s.


We can also determine the co-occurrence of words in product names:



Getting Fancy: Comparing Brands


Now that we’ve scratched the surface, what if we want to make a more complex analysis — such as comparing which words in product names occur most frequently, and which of those products cost more on average, across brands? We can build an Explore from the model above to answer this question, first by filtering the data by the brands we’d like to compare, and then by pivoting on those brands:



Conclusion


Looker and LookML make it much easier to perform tag and attribute analysis to gain actionable insights about your data.


0 replies

Be the first to reply!

Reply