May 31, 2024google-bigquery google-cloud sql

Intermediate

Life's a garden. Dig it.

Table Partition¶

One of BigQuery's key features is its ability to partition a table into small chunks of data. In many cases, table partitions can improve query performance and reduce costs.

Consider a table of sales

| sale_id|sale_date  | customer_id|product | amount|
|-------:|:----------|-----------:|:-------|------:|
|       1|2022-07-26 |           1|G       |   4.98|
|       6|NULL       |           4|U       |   2.18|
|       2|2022-07-26 |           2|I       |  19.67|
|       4|2022-07-29 |           3|J       |   4.73|
|       5|2022-07-30 |           4|W       |  10.68|
|       3|2022-07-27 |           3|I       |   7.01|

To fetch sales on 2022-07-27, BigQuery has to scan the entire sale_date column. However, if we set up the table with a daily partition on sale_date, the records will be partitioned into chunks like this

| sale_id|sale_date  | customer_id|product | amount|
|-------:|:----------|-----------:|:-------|------:|
|       1|2022-07-26 |           1|G       |   4.98|
|       2|2022-07-26 |           2|I       |  19.67|
----------------------------------------------------
|       3|2022-07-27 |           3|I       |   7.01|
----------------------------------------------------
|       4|2022-07-29 |           3|J       |   4.73|
----------------------------------------------------
|       5|2022-07-30 |           4|W       |  10.68|
----------------------------------------------------
|       6|NULL       |           4|U       |   2.18|

Now, BigQuery doesn't need to can the entire table to identify sales on a particular date.

Notes¶

A table can have zero or one partition.
Partitions can be set on sequential data types (Dates, Timestamps, or Integers).
You can make a partition with a PARTITION BY clause.
A partition table can have up to 4,000 partitions.

Table Cluster¶

Another option to improve performance on a BigQuery table is to assign it a cluster. A cluster improves performance by pre-sorting the data inside a table. For example, clustering the sales table by sale_id might order the data like this.

| sale_id|sale_date  | customer_id|product | amount|
|-------:|:----------|-----------:|:-------|------:|
|       1|2022-07-26 |           1|G       |   4.98|
|       2|2022-07-26 |           2|I       |  19.67|
|       3|2022-07-27 |           3|I       |   7.01|
|       4|2022-07-29 |           3|J       |   4.73|
|       5|2022-07-30 |           4|W       |  10.68|
|       6|NULL       |           4|U       |   2.18|

Now, fetching rows WHERE sale_id IN (1,4) will be fast and efficient.

Notes

Tables can have a partition and a cluster (even on the same field).
A cluster can be set on a collection of multiple columns.
A cluster can be defined with a CLUSTER BY clause containing a clustering_column_list.