Ben Gorman

Ben Gorman

Life's a garden. Dig it.

Table Partition

One of BigQuery's key features is its ability to partition a table into small chunks of data. In many cases, table partitions can improve query performance and reduce costs.

Consider a table of sales

| sale_id|sale_date  | customer_id|product | amount|
|-------:|:----------|-----------:|:-------|------:|
|       1|2022-07-26 |           1|G       |   4.98|
|       6|NULL       |           4|U       |   2.18|
|       2|2022-07-26 |           2|I       |  19.67|
|       4|2022-07-29 |           3|J       |   4.73|
|       5|2022-07-30 |           4|W       |  10.68|
|       3|2022-07-27 |           3|I       |   7.01|

To fetch sales on 2022-07-27, BigQuery has to scan the entire sale_date column. However, if we set up the table with a daily partition on sale_date, the records will be partitioned into chunks like this

| sale_id|sale_date  | customer_id|product | amount|
|-------:|:----------|-----------:|:-------|------:|
|       1|2022-07-26 |           1|G       |   4.98|
|       2|2022-07-26 |           2|I       |  19.67|
----------------------------------------------------
|       3|2022-07-27 |           3|I       |   7.01|
----------------------------------------------------
|       4|2022-07-29 |           3|J       |   4.73|
----------------------------------------------------
|       5|2022-07-30 |           4|W       |  10.68|
----------------------------------------------------
|       6|NULL       |           4|U       |   2.18|

Now, BigQuery doesn't need to can the entire table to identify sales on a particular date.

Notes

  • A table can have zero or one partition.
  • Partitions can be set on sequential data types (Dates, Timestamps, or Integers).
  • You can make a partition with a PARTITION BY clause.
  • A partition table can have up to 4,000 partitions.

Table Cluster

Another option to improve performance on a BigQuery table is to assign it a cluster. A cluster improves performance by pre-sorting the data inside a table. For example, clustering the sales table by sale_id might order the data like this.

| sale_id|sale_date  | customer_id|product | amount|
|-------:|:----------|-----------:|:-------|------:|
|       1|2022-07-26 |           1|G       |   4.98|
|       2|2022-07-26 |           2|I       |  19.67|
|       3|2022-07-27 |           3|I       |   7.01|
|       4|2022-07-29 |           3|J       |   4.73|
|       5|2022-07-30 |           4|W       |  10.68|
|       6|NULL       |           4|U       |   2.18|

Now, fetching rows WHERE sale_id IN (1,4) will be fast and efficient.

Notes

  • Tables can have a partition and a cluster (even on the same field).
  • A cluster can be set on a collection of multiple columns.
  • A cluster can be defined with a CLUSTER BY clause containing a clustering_column_list.