Word Frequency Analysis / Word Clouds: Analytical ...

alex_dinsmoor · 05-31-2019 01:07 PM

The Objective:

Looking to analyze text fields? Want to understand top or trending keywords?

Some common use cases might be:

What are customers saying about my brand on social (Facebook, Twitter, etc.)?
How can we optimize campaigns or SEO to reflect what terms customers are searching for?
Which words are most commonly posted together (e.g. “Game” and “Thrones” as of late)

The list goes on… Let’s walk through how this can be done in Looker without taking up a data scientist’s time (or downloading R / Python ourselves).

In this example, we’ll take descriptions of products and search for top keywords and look for associations between keywords.

Starting Data & Desired End State:

To start, we need to loop through our product name field and break out each word used into its own data entry. Here’s what we have to start:

And here’s what we’d want for analytics:

So that we can create reports like the following word cloud to show trending keywords:

How to Get There:

To accomplish this, we’ll use Looker’s derived tables feature to create an analytical pattern (this example uses Snowflake, but could be done in other syntaxes as well) that unpacks our string field into individual records.

We’ll also make use of Lloyd Tabb’s SQL generate series SQL pattern here to generate a number series that we’ll leverage to simulate a loop that iterates through every word in a string field.

select id, brand,
SPLIT_PART(SPLIT_PART(tablename.description, ' ', numbers.n+1), ' ', -1) as parsed_description
from (
SELECT
  p0.n
  + p1.n*2
  + p2.n * POWER(2,2)
  + p3.n * POWER(2,3)
  + p4.n * POWER(2,4)
  + p5.n * POWER(2,5)
  + p6.n * POWER(2,6)
  + p7.n * POWER(2,7)
  as n
FROM
  (SELECT 0 as n UNION SELECT 1) p0,
  (SELECT 0 as n UNION SELECT 1) p1,
  (SELECT 0 as n UNION SELECT 1) p2,
  (SELECT 0 as n UNION SELECT 1) p3,
  (SELECT 0 as n UNION SELECT 1) p4,
  (SELECT 0 as n UNION SELECT 1) p5,
  (SELECT 0 as n UNION SELECT 1) p6,
  (SELECT 0 as n UNION SELECT 1) p7
) as numbers
INNER JOIN (
  SELECT id, brand, name as description
  FROM public.products
) as tablename
on LENGTH(tablename.description) - LENGTH(REPLACE(tablename.description, ' ', '')) >= numbers.n
order by id, n

Let’s review what’s happening here:

The “numbers” nested table is generating a number series
The “tablename” nested table is selecting the field we’d like to parse and the join key (id in this example) to relate it back to other tables, as needed, for later analysis
The inequality join is telling Snowflake to join the number series to a given description when a number (n) is <= the number of distinct words in that description
The SPLIT_PART function then breaks out each individual word into its own data record

How to Use This Pattern:

Consider underlying data volume and DB performance
Limit to recent events to spot trending topics or keywords (e.g. product X mentioned 50% today than yesterday, with over Y people talking about it)
Use Looker’s Scheduling & Alerting features to proactively monitor and spot trends
Pair with Looker’s Affinity Analysis Block to analyze associations between words (e.g. are customers who mention topic A also more likely to mention topic B?)

alex_dinsmoor

Same Pattern for MySQL

select id, SUBSTRING_INDEX(SUBSTRING_INDEX(tablename.title, ' ', numbers.n+1), ' ', -1) as parsed_title
from (
  SELECT 
    p0.n 
    + p1.n*2 
    + p2.n * POWER(2,2) 
    + p3.n * POWER(2,3)
    + p4.n * POWER(2,4)
    + p5.n * POWER(2,5)
    + p6.n * POWER(2,6)
    + p7.n * POWER(2,7) 
    as n
  FROM 
    (SELECT 0 as n UNION SELECT 1) p0,
    (SELECT 0 as n UNION SELECT 1) p1,
    (SELECT 0 as n UNION SELECT 1) p2,
    (SELECT 0 as n UNION SELECT 1) p3,
    (SELECT 0 as n UNION SELECT 1) p4,
    (SELECT 0 as n UNION SELECT 1) p5,
    (SELECT 0 as n UNION SELECT 1) p6,
    (SELECT 0 as n UNION SELECT 1) p7
) numbers 
INNER JOIN (
  select id, title
  from `schema`.tablename 
  limit 1000
 ) as tablename
on CHAR_LENGTH(tablename.title) - CHAR_LENGTH(REPLACE(tablename.title, ' ', '')) >= numbers.n
order by id, n

guptakash_ops

Same Pattern for MySQL

select id, SUBSTRING_INDEX(SUBSTRING_INDEX(tablename.title, ' ', numbers.n+1), ' ', -1) as parsed_title

When trying this using substring_index function in redshift it returns an error :
ERROR: function substring_index(character varying, "unknown", integer) does not exist

I am using this:
select id, SUBSTRING_INDEX(SUBSTRING_INDEX(cast(tablename.title as text), ' ', cast(numbers.n+1 as integer)), ' ', -1) as parsed_title

Can someone help understand what I am missing here?

fernando_marque

Same Pattern for MySQL

select id, SUBSTRING_INDEX(SUBSTRING_INDEX(tablename.title, ' ', numbers.n+1), ' ', -1) as parsed_title

When trying this using substring_index function in redshift it returns an error :
ERROR: function substring_index(character varying, "unknown", integer) does not exist

I am using this:
select id, SUBSTRING_INDEX(SUBSTRING_INDEX(cast(tablename.title as text), ' ', cast(numbers.n+1 as integer)), ' ', -1) as parsed_title

Can someone help understand what I am missing here?

I believe that Redshift does not support SUBSTRING_INDEX. However, it is possible to use:

SPLIT_PART(tablename.title, ' ', numbers.n)

tkraupper

Awesome. Thanks!

Word Frequency Analysis / Word Clouds: Analytical Pattern