This content, written by Lloyd Tabb, was initially posted in Looker Blog on Jan 8, 2016. The content is subject to limited support.
I love Pandora. Type in an artist's name and it starts playing similar stuff. Pandora's recommendation engine feels like magic.
BigQuery provides a sample data set of some playlist data (Google's says the original data set was created by , awesome data!). The data is very simple: a single row for each track in the playlist. Track data contains playlist_id, artist (id and name), album (id and title) and track (id and title).
Using this simple data, we built a recommendation engine in LookML that takes an artist, finds the most related artists and then recommends a playlist, all in about 300 lines of LookML.
First, go ahead, play with it
Change the filter to your favorite artist and based on this data, we'll recommend some songs. Click on the artist or song title to play music.
Step 1: Build out a simple LookML model
The table that we're working with is structured like this:
Table: playlists
To build our LookML views and model, we need to build out a LookML dimension for each field in the table. Then we label each 'object' (things that would be in their own table in a de-normalized schema). For example, tracks.data.artist.id becomes 'artist_id'.
dimension: artist_id {
view_label: "Artist"
type: int
sql: ${TABLE}.tracks.data.artist.id ;;
fanout_on: "tracks.data"
}
For each object, we also build a count. To count artists, we want to count the distinct values of artist_id. When drilling into an artist count, we want the artist's id, name and the other counts.
measure: artist_count {
type: count_distinct
sql: ${artist_id} ;;
drill_fields: [
artist_id,
artist_name,
count,
track_count,
track_instance_count,
album_count
]
}
For grins, we build some linkage of artists to external sites so we can see their Twitter, Facebook, Wikipedia and YouTube pages, if they have them.
dimension: artist_name {
link: {
label: "YouTube"
url: "http://www.google.com/search?q=site:youtube.com+&btnI"
icon_url: "http://youtube.com/favicon.ico"
}
link: {
label: "Wikipedia"
url: "http://www.google.com/search?q=site:wikipedia.com+&btnI"
icon_url: "https://en.wikipedia.org/static/favicon/wikipedia.ico"
}
link: {
label: "Twitter"
url: "http://www.google.com/search?q=site:twitter.com+&btnI"
icon_url: "https://abs.twimg.com/favicons/favicon.ico"
}
link: {
label: "Facebook"
url: "http://www.google.com/search?q=site:facebook.com+&btnI"
icon_url: "https://static.xx.fbcdn.net/rsrc.php/yl/r/H3nktOa7ZMg.ico"
}
}
Learn more about the data set
Looks like there are about 500K playlists, with a total of about 12M tracks. There are 92K-ish different artists, with about 900K individual songs. Click on Explore Data then click on any of the numbers to drill into the data further.
Playlists Album Count | Playlists Artist Count | Playlists Count | Playlists Track Count | Playlists Track Instance Count |
---|---|---|---|---|
213,310 | 92,630 | 504,169 | 901,642 | 12,138,977 |
Who is the top artist (in this data set)?
Of course, it depends on how you count it. Which artist has the most instances of songs on playlists? Looks like Linkin Park. The really fun part of this is that after clicking Explore Data, clicking any number takes you to the album, track or artist.
Top artists in data set
Step 2: Build out facts about what is popular
Ranking is a great tool for building up knowledge about particular fields in a data set. The "Top 40" in a given week has long been a way of rating music.
We are going to rank tracks (songs) in their overall popularity (against all songs) and their popularity within an artist. We'd like to end up with a table like:
track_id | artist_id | overall_rank | artist_rank |
---|
We can do this with a relatively simple 2-level query. The first level groups by track_id and artist_id and counts the number of playlists the song appears on. The second level (using ), calculates the overall rank of the song and the rank within (partitioned by) the artist.
SELECT
track_id
, artist_id
, row_number() OVER( PARTITION BY artist_id ORDER BY num_plays DESC) as artist_rank
, row_number() OVER( ORDER BY num_plays DESC) as overal_rank
FROM (
SELECT
playlists.tracks.data.id AS track_id,
playlists.tracks.data.artist.id AS artist_id,
COUNT(*) as num_plays
FROM (SELECT * FROM FLATTEN([bigquery-samples:playlists.playlists]
,tracks.data)) AS playlists
GROUP EACH BY 1,2
)
We build this into a derived table and add a couple of dimensions ():
dimension: rank_within_artist {
view_label: "Track"
type: int
sql: ${TABLE}.artist_rank ;;
}
dimension: overall_rank {
view_label: "Track"
type: int
sql: ${TABLE}.overal_rank ;;
}
With these new rankings we can now see the top 10 songs in our data set.
Next, look at the ranking of the songs for each artist. We'd like more popular songs to have lower numbers. We've already computed rank_within_artist, let's look at Frank Sinatra's and Joan Baez's top three songs. We notice that there is a data problem -- there are two ids in the data for "Frank Sinatra" -- but we're just going to ignore the problem.
Change the filter to see a different artist's top songs.
Step 3: Find artists that appear together.
We're now ready to build the core of our recommendation engine. SQL's cross join (cross product) will allow us to build a mapping table that will ultimately look like this:
artist_id | artist_name | artist_id2 | artist_name2 | num_playlists |
---|
To get here, we need to build an intermediate table, playlist_artist. There is a record for every artist that appears on a playlist. This intermediate table will look like this:
artist_id | artist_name | playlist_id |
---|
Here's the way we write this in LookML:
# Just so we can test that it works
explore: playlist_artist {
hidden: yes
}
view: playlist_artist {
derived_table: {
sql: SELECT
playlists.tracks.data.artist.id AS artist_id,
playlists.tracks.data.artist.name AS artist_name,
playlists.id AS playlist_id
FROM (SELECT * FROM FLATTEN([bigquery-samples:playlists.playlists]
,tracks.data)) AS playlists
WHERE playlists.tracks.data.artist.id IS NOT NULL
GROUP BY 1,2,3
;;
}
dimension: artist_id {}
dimension: artist_name {}
dimension: playlist_id {}
}
Next we join playlist_artist with itself to find pairings of artists on playlists and count the number of times the parings occur. For each pair of artists, we then create a closeness ranking, by again, ranking one artist, with another based on the number of times playlists include both artists. We'll ultimately have a table that looks like this:
a.artist_id | a.artist_name | a.playlist_id = b.playlist_id | b.artist_id | b.artist_name | count(*) |
---|
And here's how we get there in LookML:
explore: artist_artist {}
view: artist_artist {
extends: [artist]
derived_table: {
sql_trigger_value: SELECT COUNT(*) FROM [bigquery-samples:playlists.playlists] ;;
sql: SELECT
*,
row_number() OVER (partition by artist2_id order by num_playlists DESC) as closeness_rank
FROM (
SELECT
a.artist_id as artist_id,
a.artist_name as artist_name,
b.artist_id as artist2_id,
b.artist_name as artist2_name,
COUNT(*) as num_playlists
FROM ${playlist_artist.SQL_TABLE_NAME} AS a
JOIN EACH ${playlist_artist.SQL_TABLE_NAME} as b
ON a.playlist_id = b.playlist_id
WHERE a.artist_id <> b.artist_id
GROUP EACH BY 1,2,3,4
)
;;
}
# Inherited from 'view: artist'
dimension: artist_id {}
dimension: artist_name {}
dimension: artist2_id {}
dimension: artist2_name {}
dimension: num_playlists {
type: int
}
dimension: closeness_rank {
type: int
}
measure: total_playlists {
type: sum
sql: ${num_playlists} ;;
}
measure: count {
type: count
drill_fields: [artist_id, artist_name, artist2_id, artist2_name, num_playlists]
}
}
Now, for any given artist we can find the most closely related artists. Put another artist into the filter to find the other artists most closely related to them.
Step 5: Mission accomplished
For any artist, we now know their most popular songs and which artists are most closely related to them.
To recommend a playlist, we simply find the most closely related 10 artists and include each artist's top 3 tracks.
The code
This is the complete code to the data model.
playlist.model.lookml
connection: "bigquery_publicdata"
include: "*.view"
case_sensitive: no
explore: playlists {
hidden: yes
join: playlist_facts {
sql_on: ${playlists.playlist_id} = ${playlist_facts.playlist_id} ;;
relationship: one_to_one
view_label: "Playlists"
}
join: track_rank {
sql_on: ${playlists.track_id} = ${track_rank.track_id} ;;
relationship: one_to_one
type: left_outer_each
view_label: "Track"
fields: [track_id, overall_rank, rank_within_artist]
}
}
explore: recommender {
view_name: artist_artist
always_filter: {
filters: {
field: track_rank.rank_within_artist
value: "<= 3"
}
}
join: track_rank {
sql_on: ${artist_artist.artist_id} = ${track_rank.artist_id} ;;
relationship: one_to_many
type: left_outer_each
}
}
playlists.view.lookml
# Basic playlist view
view: playlists {
extends: [artist, track]
sql_table_name: [bigquery-samples:playlists.playlists]
;;
measure: count {
type: count_distinct
sql: ${playlist_id} ;;
drill_fields: [playlist_id]
}
dimension: rating {
type: int
sql: ${TABLE}.rating ;;
}
dimension: playlist_id {
type: int
sql: ${TABLE}.id ;;
}
dimension: artist_id {
view_label: "Artist ID"
type: int
sql: ${TABLE}.tracks.data.artist.id ;;
fanout_on: "tracks.data"
}
dimension: artist_name {
view_label: "Artist"
type: string
sql: ${TABLE}.tracks.data.artist.name ;;
fanout_on: "tracks.data"
}
measure: artist_count {
type: count_distinct
sql: ${artist_id} ;;
drill_fields: [
artist_id,
artist_name,
count,
track_count,
track_instance_count,
album_count
]
}
dimension: album_id {
view_label: "Album"
type: int
sql: ${TABLE}.tracks.data.album.id ;;
fanout_on: "tracks.data"
}
dimension: album_title {
view_label: "Album"
type: string
sql: ${TABLE}.tracks.data.album.title ;;
fanout_on: "tracks.data"
link: {
label: "iTunes"
url: "http://www.google.com/search?q=itunes.com++&btnI"
}
}
measure: album_count {
type: count_distinct
sql: ${album_id} ;;
drill_fields: [album_id, album_title, count, track_count, artist_count]
}
dimension: track_title {
view_label: "Track"
type: string
sql: ${TABLE}.tracks.data.title ;;
fanout_on: "tracks.data"
}
dimension: track_id {
view_label: "Track"
type: int
sql: ${TABLE}.tracks.data.id ;;
fanout_on: "tracks.data"
}
measure: track_count {
type: count_distinct
sql: ${track_id} ;;
drill_fields: [track_id, track_title, count]
}
measure: track_instance_count {
type: count_distinct
sql: CONCAT(CAST(${track_id} AS STRING),CAST(${playlist_id} AS STRING)) ;;
drill_fields: [detail*]
}
set: detail {
fields: [playlist_id, artist_name, album_title, track_title]
}
}
playlist_facts.view.lookml
# Facts about playlists, number of different artists and number of tracks on each playlist
# Used to filter out crappy playlists.
view: playlist_facts {
derived_table: {
sql_trigger_value: SELECT COUNT(*) FROM [bigquery-samples:playlists.playlists] ;;
max_billing_tier: 3
sql: SELECT
id as playlist_id
, COUNT(DISTINCT tracks.data.artist.id) as num_artists
, COUNT(DISTINCT tracks.data.id) as num_tracks
FROM FLATTEN([bigquery-samples:playlists.playlists],tracks.data)
GROUP BY 1
HAVING num_artists > 0
;;
}
artist.view.lookml
# Base definition for artist
# Declares external links
view: artist {
dimension: artist_id {}
dimension: artist_name {
link: {
label: "YouTube"
url: "http://www.google.com/search?q=site:youtube.com+&btnI"
icon_url: "http://youtube.com/favicon.ico"
}
link: {
label: "Wikipedia"
url: "http://www.google.com/search?q=site:wikipedia.com+&btnI"
icon_url: "https://en.wikipedia.org/static/favicon/wikipedia.ico"
}
link: {
label: "Twitter"
url: "http://www.google.com/search?q=site:twitter.com+&btnI"
icon_url: "https://abs.twimg.com/favicons/favicon.ico"
}
link: {
label: "Facebook"
url: "http://www.google.com/search?q=site:facebook.com+&btnI"
icon_url: "https://static.xx.fbcdn.net/rsrc.php/yl/r/H3nktOa7ZMg.ico"
}
}
}
artist_suggest.view.lookml
# Simplifed view of the top 5000 artists so we can make resonable suggestions for artists.
view: artist_suggest {
derived_table: {
sql_trigger_value: SELECT COUNT(*) FROM ${playlist_artist.SQL_TABLE_NAME} ;;
sql: SELECT
artist_name
, COUNT(*)
FROM ${playlist_artist.SQL_TABLE_NAME}
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5000
;;
}
dimension: artist_name {}
}
playlist_artist.view.lookml
# for debugging.
explore: playlist_artist {
hidden: yes
}
# Simple table of playlists artist appears on. One row for every artist/playlist combination
view: playlist_artist {
derived_table: {
sql_trigger_value: SELECT COUNT(*) FROM [bigquery-samples:playlists.playlists] ;;
sql: SELECT
playlists.tracks.data.artist.id AS artist_id,
playlists.tracks.data.artist.name AS artist_name,
playlists.id AS playlist_id
FROM (SELECT * FROM FLATTEN([bigquery-samples:playlists.playlists]
,tracks.data)) AS playlists
JOIN ${playlist_facts.SQL_TABLE_NAME} AS playlist_facts
ON playlists.id = playlist_facts.playlist_id
WHERE playlists.tracks.data.artist.id IS NOT NULL
AND playlist_facts.num_artists < 10
GROUP EACH BY 1,2,3
;;
}
dimension: artist_id {}
dimension: artist_name {}
dimension: playlist_id {}
}
artist_artist.view.lookml
for debugging.
explore: artist_artist {
hidden: yes
}
# The core of the recommendaiton engine. Cross joins playlist_artist to build a list of
# related artists.
include: "*.view.lkml"
view: artist_artist {
extends: [artist]
derived_table: {
sql_trigger_value: SELECT COUNT(*) FROM [bigquery-samples:playlists.playlists] ;;
sql: SELECT
*,
row_number() OVER (partition by artist2_id order by num_playlists DESC) as closeness_rank
FROM (
SELECT
a.artist_id as artist_id,
a.artist_name as artist_name,
b.artist_id as artist2_id,
b.artist_name as artist2_name,
COUNT(*) as num_playlists
FROM ${playlist_artist.SQL_TABLE_NAME} AS a
JOIN EACH ${playlist_artist.SQL_TABLE_NAME} as b
ON a.playlist_id = b.playlist_id
WHERE a.artist_id <> b.artist_id
GROUP EACH BY 1,2,3,4
)
;;
}
# Inherited from 'view: artist'
dimension: artist_id {}
dimension: artist_name {}
dimension: artist2_id {}
dimension: artist2_name {}
dimension: num_playlists {
type: number
}
dimension: closeness_rank {
type: number
}
measure: total_playlists {
type: sum
sql: ${num_playlists} ;;
}
measure: count {
type: count
drill_fields: [artist_id, artist_name, artist2_id, artist2_name, num_playlists]
}
}
track_rank.view.lookml
# for debugging
explore: track_rank {
hidden: yes
}
# Rank tracks both overall and within a given artist.
include: "*.view.lkml"
view: track_rank {
extends: [track]
derived_table: {
sql_trigger_value: SELECT COUNT(*) FROM [bigquery-samples:playlists.playlists] ;;
max_billing_tier: 3
sql: SELECT
track_id
, track_title
, artist_id
, artist_name
, row_number() OVER( PARTITION BY artist_id ORDER BY num_plays DESC) as artist_rank
, row_number() OVER( ORDER BY num_plays DESC) as overall_rank
FROM (
SELECT
playlists.tracks.data.id AS track_id,
playlists.tracks.data.title AS track_title,
playlists.tracks.data.artist.id AS artist_id,
playlists.tracks.data.artist.name AS artist_name,
COUNT(*) as num_plays
FROM (SELECT * FROM FLATTEN([bigquery-samples:playlists.playlists]
,tracks.data)) AS playlists
WHERE playlists.tracks.data.artist.id IS NOT NULL
AND playlists.tracks.data.title IS NOT NULL
GROUP EACH BY 1,2,3,4
)
;;
}
dimension: track_id {
primary_key: yes
hidden: yes
type: number
sql: ${TABLE}.track_id ;;
}
dimension: track_title {
sql: ${TABLE}.track_title ;;
}
dimension: artist_id {
type: number
sql: ${TABLE}.artist_id ;;
}
dimension: artist_name {
type: number
sql: ${TABLE}.artist_name ;;
}
dimension: rank_within_artist {
type: number
sql: ${TABLE}.artist_rank ;;
}
dimension: overall_rank {
view_label: "Track"
type: number
sql: ${TABLE}.overall_rank ;;
}
set: detail {
fields: [track_id, artist_id, rank_within_artist, overall_rank]
}
}