BigQuery is an analytics engine optimized to crunch pre-joined (or: nested) information. Sub-relations make sense in analytical situations as a result of we don’t need to take care of joins over greater datasets — simply think about every day year-over-year comparisons over the past 3 years, aggregating Terabytes of knowledge — however with joins including one other layer of complexity.
A sub-relation, or sub-table, is often carried out as an array of structs. The array as a list-like information sort supplies rows, the struct, much like a map or dictionary, supplies columns. The sub-schema is constant all through the desk — in distinction to JSON varieties who can change their schema from row to row.
The one different engine taking place this route of nested information appears to be AWS Redshift Spectrum. But, if we need to use Google Analytics (GA) information in one other system you’d virtually all the time need to de-join the information to have flat tables, as a result of capabilities to combination or change arrays of structs are fairly restricted. Most analytical database engines appear to optimize for design ideas (regular varieties) that had been made for transactional use-cases, whereas storage codecs like Parquet are completely able to representing nested information. So for all of the Trinos, DuckDBs, SparkSQLs, and Clickhouses on the market — let’s create some unhappy flat tables.
A very powerful factor to grasp is that we don’t need to merely “flatten” the desk. A row in a desk is meant to have that means. One row in GA4 information represents one occasion. If we simply cross-join with each array we see, we’ll be in bother. Even when we solely select to left be a part of the objects array — loads of occasions don’t have objects, so what can we find yourself with? A bizarre scope combine — a row can imply occasion, or it might imply merchandise in an occasion. We don’t need that.
What we actually need is to construct a number of tables every with their very own respective scope: occasions, objects, customized parameters, merchandise parameters, and person properties: