Tags and Attributes
Objects often have lists of attributes associated with them. Stack Overflow posts are tagged with the subjects they are about. Github issues are labeled with product areas they affect. Users may be tagged with different attributes over time to track the campaigns they were contacted by. Even words in a blog or news headline can be used as tags for an article or a post.
With tag data, analysts might want to ask questions such as:
- "What are the most common tags in my data?"
- "How much of my data is tagged a specific way?"
- "Which tags appear most frequently together?"
- "Is there any trend in price/percentage/views/purchases/etc. between the tags that occur most frequently?"
These questions can be difficult to answer with SQL. In this article, we will address two ways to model a pattern in LookML to make finding insights about tagged data much easier.
Two Implementations
Many SQL databases have an array type that can be used to store an array of strings in a particular data record. (An array is a variable that contains a list of values.) Postgres, Presto, Snowflake, BigQuery, and Athena all support arrays. They will all use the first implementation method outlined below.
Note: Redshift and MySQL don't support arrays and will use the technique in the second implementation, described below.
Implementation #1: Stack Overflow Data and SPLIT()
We will start with a very simple model of Stack Overflow data. Stack Overflow is a place where developers can ask questions and get answers. Posts generally contain the questions, and each post is tagged with subjects. Subject tags allow experts to easily find and answer questions that fall under their areas of expertise.
Here is a basic Stack Overflow data table:
Looking at these records, we can see that the tags are contained in a single string. For example, these are the tags for a question that reads "How to build EclipseLink":
java | maven | jpa | eclipselink
Analysts might want to ask questions of this data, such as "What are the most common tags in articles that are also tagged with java
?" and "How many articles are tagged with both java
and maven
?" The LookML model below will help us develop a model to answer these questions.
The Model
The stackoverflow_posts
Explore joins together the stackoverflow_posts
view, which contains post data, and the tag
view, which contains tag data.
In order to analyze tag co-occurrence and answer the questions above, the tag
view needs to be self-joined. The SPLIT()
function in the sql
join parameter makes this possible by breaking apart the array tags.
Note: Best practice is to divide views into separate view files and place Explores into a model file.
explore: stackoverflow_posts { join: tag { sql: LEFT JOIN UNNEST(SPLIT(${stackoverflow_posts.tags},'|')) as tag ;; relationship: one_to_many } join: tag2 { from: tag sql: LEFT JOIN UNNEST(SPLIT(${stackoverflow_posts.tags},'|')) as tag2 ;; relationship: one_to_many sql_where: ${tag.name} <> ${tag2.name} ;; } } view: stackoverflow_posts { sql_table_name: bigquery-public-data.stackoverflow.stackoverflow_posts ;; dimension: id { primary_key: yes } dimension: title {} dimension: tags {} dimension: owner_display_name {} dimension: score { type: number } dimension_group: creation { type: creation_time sql: ${TABLE}.creation_date ;; } measure: count { type: count drill_fields: [id, creation_time, title, tags, score] } measure: average_score { type: average sql: ${score} ;; } } view: tag { dimension: name { sql: ${TABLE} ;; } }
With this model, we can effectively and easily answer our questions in Looker:
"What are the most common tags in articles that are also tagged with java
?"
"How many articles are tagged with both java
and maven
?"
Getting Fancy: Comparing SQL Dialects
We can use the model above to get even deeper insights into the Stack Overflow data — such as determining which SQL dialects most often appear together in posts. We can build an Explore to answer this question by filtering multiple tags for a set of SQL dialects and then pivoting. We can also drill on the count measure for additional granularity:
The SQL Behind the Scenes
Let's take a deeper dive into what's happening behind the model. Take the basic Stack Overflow table from the first example:
The generated SQL for the basic Stack Overflow data table is fairly simple:
SELECT stackoverflow_posts.id AS stackoverflow_posts_id, stackoverflow_posts.tags AS stackoverflow_posts_tags, stackoverflow_posts.title AS stackoverflow_posts_title FROM `bigquery-public-data.stackoverflow.stackoverflow_posts` AS stackoverflow_posts WHERE ((stackoverflow_posts.tags IS NOT NULL AND LENGTH(stackoverflow_posts.tags) <> 0 )) ORDER BY 1 DESC LIMIT 500
Arrays, SPLIT
, and UNNEST
An array is a variable that contains a list of values.
Different dialects have different names for functions that convert strings into arrays. Snowflake and BigQuery have SPLIT
, and Postgres has string_to_array
. All of these functions will parse a string and return an array. For example, in BigQuery and Snowflake:
SELECT SPLIT('java|maven|jpa|eclipselink','|')
returns
['java','maven','jpa','eclipselink']
Once the data is in an array, we can turn it into a table function to fan out the table and join the data back in:
SELECT stackoverflow_posts.id AS stackoverflow_posts_id, tag AS tag_name, stackoverflow_posts.tags AS stackoverflow_posts_tags, stackoverflow_posts.title AS stackoverflow_posts_title FROM `bigquery-public-data.stackoverflow.stackoverflow_posts` AS stackoverflow_posts LEFT JOIN UNNEST(SPLIT(stackoverflow_posts.tags,'|')) as tag WHERE ((stackoverflow_posts.tags IS NOT NULL AND LENGTH(stackoverflow_posts.tags) <> 0 )) GROUP BY 1,2,3,4 ORDER BY 1 DESC LIMIT 500
Notice that this is identical to the first Stack Overflow SQL except for two lines:
tag AS tag_name
and
LEFT JOIN UNNEST(SPLIT(stackoverflow_posts.tags,'|')) as tag
The first change is simply a dimension reference to tag
. The interesting stuff happens in the second line. The SPLIT
function turns the string field tags
into an array of tag names. UNNEST
joins the array as if it were a table. The effect is to repeat each row for each element of the array. If we run this query, we see the same data as before, except each article is repeated for each tag:
Implementation #2: Redshift, MySQL, and Other SQL Dialects That Don't Support Arrays
Arrays can be simulated using a numbers table with a SPLIT_PART()
function. SPLIT_PART()
functions behave similarly to the SPLIT
functions in SQL dialects that support arrays, except that they only return a single string value (generally the Nth one). In order to fan out the results like we did above, we will need a numbers table to cross-join.
An Example: Words in Product Names
We are going to use basic e-commerce data, which contains product information such as product name, brand, category, department, retail price, and SKU, as displayed in the basic table below:
What if we want to look at how frequently certain words occur in product names, correlate those words with the products' prices, or search for words that appear together and their rates of co-occurrence? The model below leverages a numbers table to break out each individual word in the product name field — allowing for deeper insights into the data.
The Data Model
connection: "demonew_events_ecommerce" explore: products { join: word { type: cross relationship: one_to_many sql_where: ${word.name} <> '' ;; } join: word2 { from: word type: cross relationship: one_to_many sql_where: ${word2.name} <> '' AND ${word.name} <> ${word2.name} ;; } } explore: numbers16 {} view: products { sql_table_name: products ;; dimension: id { primary_key: yes } dimension: brand {} dimension: category {} dimension: department {} dimension: name {} dimension: retail_price { type: number } dimension: sku {} measure: count { type: count drill_fields: [id, name, brand, category, retail_price] } measure: average_price { type: average sql: ${retail_price} ;; } } view: word { extends: [numbers16] dimension: name { sql: SPLIT_PART(${products.name},' ', ${num}::integer) ;; } } view: numbers16 { derived_table: { sql: SELECT row_number() OVER () num FROM (SELECT 1 n UNION SELECT 2 n ) t1, (SELECT 1 n UNION SELECT 2 n ) t2, (SELECT 1 n UNION SELECT 2 n ) t3, (SELECT 1 n UNION SELECT 2 n ) t4 ;; } dimension: num { hidden: yes } }
From this model, we can determine which words in product names occur most frequently, and which words are associated with those products that cost more on average:
This type of analysis shows clear results: for example, products that contain the word Women's cost more than products that contain the word Men's.
We can also determine the co-occurrence of words in product names:
Getting Fancy: Comparing Brands
Now that we've scratched the surface, what if we want to make a more complex analysis — such as comparing which words in product names occur most frequently, and which of those products cost more on average, across brands? We can build an Explore from the model above to answer this question, first by filtering the data by the brands we'd like to compare, and then by pivoting on those brands:
Conclusion
Looker and LookML make it much easier to perform tag and attribute analysis to gain actionable insights about your data.