Splitting Strings into Rows in the Absence of Table-Generating Functions

  • 12 May 2022
  • 1 reply
  • 3182 views

Userlevel 5

Looker will not be updating this content, nor guarantees that everything is up-to-date. 

Introduction


Most SQL dialects don't support robust table-generating functions. There are a few common use cases for table-generating functions, such as creating a sequence of numbers or dates, un-nesting a JSON object or array into rows, or splitting a string on a delimiter into rows.

In this article, we'll show an example for the last use case: Taking a string containing n distinct comma-separated values and splitting it into n unique rows.

Setup


Suppose we have a table of products containing id, name, and tags:
 

products

id name tags
123 Scrunch Cloth Pants Set Pants, Throwback, 80s
1287 Le Suit Citrus Breeze Skirt Suit Skirts, Dresses, Business Casual
486 MaxStudio Plaid Romper Jumpsuits, Casual, Rompers


Our goal is to un-nest the tags to create a table to map products to tags. We will assume that a table of numbers already exists in the database, though this can be created using this pattern.

Approach


Two functions are useful here:

  • The regexp_count function, which tells us how many times a particular pattern is found in our string.
  • The split_part function, which takes a string, splits it on some delimiter, and returns the first, second, ... , nth value specified from the split string.

NOTE: These functions are not available in all dialects.

Here's what we can do:

  1. Use regexp_count to determine how many instances of our delimiter (", ") are found for each string.
  2. Add one to that value, telling us how many values are separated by our delimiter.
  3. Join in our numbers table on an inequality so that we fan out our tagged products n times, where n is the value returned from regexp_count for each product.
  4. Use the split_part function and our sequence of numbers to get the 1st, 2nd, nth tags for each product.

SQL


Here's an example using Amazon Redshift. Make sure to replace view_or_derived_table_name in the JOIN with the name of your view or derived table:

In Redshift, the third argument in a split_part can only be an integer that is greater than 0. Make sure to check your database dialect for any differences.
SELECT row_number() OVER(order by 1) AS product_tag_id
, products.id as product_id
, split_part(products.tags, ', ', numbers.num) AS tag_name
FROM products
JOIN ${view_or_derived_table_name.SQL_TABLE_NAME} as numbers
ON numbers.num <= regexp_count(products.tags, ',\\s') + 1
WHERE numbers.num > 0
product_tag_id product_id tag_name
1 123 Throwback
2 123 Pants
3 123 80s
4 1287 Skirts
5 1287 Dresses
6 1287 Business Casual
7 486 Casual
8 486 Jumpsuits
9 486 Rompers


For MySQL, you could use SUBSTRING_INDEX instead of split_part:

SELECT
DISTINCT SUBSTRING_INDEX(SUBSTRING_INDEX(vals, ',', n.digit+1), ',', -1) val
FROM
(select "Skirts, Dresses, Business Casual" as vals) tt1
INNER JOIN
(SELECT 0 digit UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6) n
ON LENGTH(REPLACE(vals, ',' , '')) <= LENGTH(vals)-n.digit;

That's it!


1 reply

i am trying to use cross apply for this transformation, but its not working could someone please help

select * from
table name
cross apply string_split(columnname, ' , ‘ )

i tried even this but not working

select *,t2.value from
table name as t1
cross apply string_split(t1.columnname, ' , ‘ ) as t2

i am facing syntax error near “apply”

Reply