Introduction
Most SQL dialects don't support robust table-generating functions. There are a few common use cases for table-generating functions, such as creating a sequence of numbers or dates, un-nesting a JSON object or array into rows, or splitting a string on a delimiter into rows.
In this article, we'll show an example for the last use case: Taking a string containing n distinct comma-separated values and splitting it into n unique rows.
Setup
Suppose we have a table of products containing id
, name
, and tags
:
products
id | name | tags |
---|---|---|
123 | Scrunch Cloth Pants Set | Pants, Throwback, 80s |
1287 | Le Suit Citrus Breeze Skirt Suit | Skirts, Dresses, Business Casual |
486 | MaxStudio Plaid Romper | Jumpsuits, Casual, Rompers |
Our goal is to un-nest the tags to create a table to map products to tags. We will assume that a table of numbers already exists in the database, though this can be created using this pattern.
Approach
Two functions are useful here:
- The
regexp_count
function, which tells us how many times a particular pattern is found in our string. - The
split_part
function, which takes a string, splits it on some delimiter, and returns the first, second, ... , nth value specified from the split string.
NOTE: These functions are not available in all dialects.
Here's what we can do:
- Use
regexp_count
to determine how many instances of our delimiter (", ") are found for each string. - Add one to that value, telling us how many values are separated by our delimiter.
- Join in our numbers table on an inequality so that we fan out our tagged products n times, where n is the value returned from
regexp_count
for each product. - Use the
split_part
function and our sequence of numbers to get the 1st, 2nd, nth tags for each product.
SQL
Here's an example using Amazon Redshift. Make sure to replace view_or_derived_table_name
in the JOIN
with the name of your view or derived table:
In Redshift, the third argument in a split_part
can only be an integer that is greater than 0. Make sure to check your database dialect for any differences.
SELECT row_number() OVER(order by 1) AS product_tag_id , products.id as product_id , split_part(products.tags, ', ', numbers.num) AS tag_name FROM products JOIN ${view_or_derived_table_name.SQL_TABLE_NAME} as numbers ON numbers.num <= regexp_count(products.tags, ',\\s') + 1 WHERE numbers.num > 0
product_tag_id | product_id | tag_name |
---|---|---|
1 | 123 | Throwback |
2 | 123 | Pants |
3 | 123 | 80s |
4 | 1287 | Skirts |
5 | 1287 | Dresses |
6 | 1287 | Business Casual |
7 | 486 | Casual |
8 | 486 | Jumpsuits |
9 | 486 | Rompers |
For MySQL, you could use SUBSTRING_INDEX
instead of split_part
:
SELECT DISTINCT SUBSTRING_INDEX(SUBSTRING_INDEX(vals, ',', n.digit+1), ',', -1) val FROM (select "Skirts, Dresses, Business Casual" as vals) tt1 INNER JOIN (SELECT 0 digit UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6) n ON LENGTH(REPLACE(vals, ',' , '')) <= LENGTH(vals)-n.digit;
That's it!