BigQuery supports nested records within tables. These nested records can be a single record or contain repeated values.
In the example below, each person has a single phone number, but may have lived in multiple cities:
Why Use Nested Records?
There are a few advantages to nested records when scanning over a distributed dataset. Most significantly, they do not require joins. This means computations can be faster and scan much less data than if you had to rejoin the extra data each time you use it.
Nested structures are essentially pre-joined tables. Because BigQuery data is stored in columns, if you do not reference the nested column, there is no added expense to the query. If you do reference the nested column, the logic is identical to a co-located join.
The other advantage is that nested structures avoid repeating data that would have to be repeated in a wide, denormalized table. In other words, for a person who has lived in five cities, a wide denormalized table would contain all of their information in five rows (one for each of the cities they have lived in). In a nested structure, the repeated information only takes one row, since the array of five cities can be contained in a single row and un-nested when needed.
The LookML
The example below is the LookML for the Explores and views we can create from the schema above. There are three views: persons
, persons_cities_lived
, and persons_phone_number
. The Explore appears identical to an Explore written with normal tables.
Note: While all of the components (views and Explore) are written in one code bock below, it is best practice to place views in individual view files, and place Explores and connection:
specification in the model file.
connection: "bigquery_publicdata_standard_sql"
explore: persons {
# Repeated nested object
join: persons_cities_lived {
view_label: "Persons: Cities Lived:"
sql: LEFT JOIN UNNEST(persons.citiesLived) as persons_cities_lived ;;
relationship: one_to_many
}
# Non repeated nested object
join: persons_phone_number {
view_label: "Persons: Phone:"
sql: LEFT JOIN UNNEST([${persons.phoneNumber}]) as persons_phone_number ;;
relationship: one_to_one
}
}
view: persons {
sql_table_name: `bigquery-samples.nested.persons_living` ;;
dimension: id {
primary_key: yes
sql: ${TABLE}.fullName ;;
}
dimension: fullName {label: "Full Name"}
dimension: kind {}
dimension: age {type:number}
dimension: gender {}
measure: average_age {
type: average
sql: ${age} ;;
drill_fields: [fullName,age]
}
measure: count {
type: count
drill_fields: [fullName, cities_lived.place_count, age]
}
dimension: citiesLived {hidden:yes}
dimension: phoneNumber {hidden:yes}
}
view: persons_phone_number {
dimension: areaCode {label: "Area Code"}
dimension: number {}
}
view: persons_cities_lived {
dimension: id {
primary_key: yes
sql: CONCAT(CAST(${persons.fullName} AS STRING),'|', CAST(${place} AS STRING)) ;;
}
dimension: place {}
dimension: numberOfYears {
label: "Number Of Years"
type: number
}
measure: place_count {
type: count
drill_fields: [place, persons.count]
}
measure: total_years {
type: sum
sql: ${numberOfYears} ;;
drill_fields: [persons.fullName, persons.age, place, numberOfYears]
}
}
We will discuss each component of this example in detail.
Views
Each nested record is written as a view. For example, the phoneNumber
view simply declares the dimensions that appear in the record:
view: persons_phone_number {
dimension: areaCode {label: "Area Code"}
dimension: number {}
}
The persons_cities_lived
view is more complex. As shown above, we define the dimensions that appear in the record (numberOfYears
and place
), but we also define some measures. The measures and drill_fields
are defined as usual, as if this data were in its own table. The only real difference is that we declare id
as a primary_key
so that aggregates are properly calculated.
view: persons_cities_lived {
dimension: id {
primary_key: yes
sql: CONCAT(CAST(${persons.fullName} AS STRING),'|', CAST(${place} AS STRING)) ;;
}
dimension: place {}
dimension: numberOfYears {
label: "Number Of Years"
type: number
}
measure: place_count {
type: count
drill_fields: [place, persons.count]
}
measure: total_years {
type: sum
sql: ${numberOfYears} ;;
drill_fields: [persons.fullName, persons.age, place, numberOfYears]
}
}
Record Declarations
In the view that contains the subrecords (in this case persons
), we need to declare the records. These will be used when creating the joins. We hide these LookML fields with the hidden:
parameter because we won't need them when exploring the data.
view: persons {
...
dimension: citiesLived {hidden:yes}
dimension: phoneNumber {hidden:yes}
...
}
Joins
Nested records in BigQuery are ARRAY
s of STRUCT
s. Instead of Joining with a sql_on:
parameter, the join relationship is built into the table. In this case, we use the sql:
join parameter so that we can use the UNNEST
operator. Other than that difference, UNNEST
ing an ARRAY
s of STRUCT
s is exactly like joining a table.
In the case of non-repeated records, we can simply use a STRUCT
; we turn that into an ARRAY
of STRUCT
s by placing it in square brackets. While this may be a little weird to look at, there seems be be no performance penalty and this keeps things clean and simple.
explore: persons {
# Repeated nested object
join: persons_cities_lived {
view_label: "Persons: Cities Lived:"
sql: LEFT JOIN UNNEST(persons.citiesLived) as persons_cities_lived ;;
relationship: one_to_many
}
# Non repeated nested object
join: persons_phone_number {
view_label: "Persons: Phone:"
sql: LEFT JOIN UNNEST([${persons.phoneNumber}]) as persons_phone_number ;;
relationship: one_to_one
}
}
Joins for Arrays Without Unique Keys for Each Row
While it is best to have identifiable natural keys in the data, or surrogate keys created in the ETL process, this is not always possible. For example, you could encounter a situation where some arrays don't have a relative unique key for the row. This is where WITH OFFSET
can come in handy in join syntax.
For example, a column representing a person might load multiple times if the person has lived in multiple cities — Chicago, Denver, San Francisco, etc. It can be difficult to create a primary key on the unnested row if a date or other identifiable natural key is not provided to distinguish the person's tenure in each city. This is where WITH OFFSET
can provide a relative row number (0,1,2,3) for each unnested row. This approach guarantees a unique key on the unnested row:
explore: persons { # Repeated nested Object join: persons_cities_lived { view_label: "Persons: Cities Lived:" sql: LEFT JOIN UNNEST(persons.citiesLived) as persons_cities_lived WITH OFFSET as person_cities_lived_offset;; relationship: one_to_many } } view: persons_cities_lived { dimension: id { primary_key: yes sql: CONCAT(CAST(${persons.fullName} AS STRING),'|', CAST(${offset} AS STRING)) ;; } dimension: offset { type: number sql: person_cities_lived_offset;; } }
Unnesting Simple Repeated Values
Nested data in BigQuery can also be simple values, such as integers or strings. To unnest arrays of simple values, we can use a similar approach as above, using the UNNEST
operator in a join.
The example below unnests a given array of integers, unresolved_skus
:
explore: impressions { join: impressions__unresolved_sku { sql: LEFT JOIN UNNEST(unresolved_skus) AS impressions__unresolved_sku ;; relationship: one_to_many } } view: impressions__unresolved_sku { dimension: sku { type: string sql: ${TABLE} ;; } }
The sql
parameter for the array of integers, unresolved_skus
, is represented as ${TABLE}
. This directly references the table of values itself, which is then unnested in the explore
.