Database Table Partitioning - Why?
There are not two common reasons why RDBMS DBA’s partition database tables, but five. The two rote answer reasons usually cited for implementing (horizontal) partitioning of database tables relate to slice and dice (or using the marketing phrase from Oracle when discussing partitioning, “Divide and Conquer”) for performance reasons. The second rote reason is partition management.
Within this small blog article, I will briefly discuss these two concepts, but also point out the other three reasons why database partitioning is used, at least from my perspective as a database developer in the UK/DE/CH.
For readers unfamiliar with database partitioning concepts, a good primer to Oracle database partitioning may be found on DBAzine. Partitioning concepts for PostgreSQL are on the other hand are described well in the reference PostgreSQL documentation. The Microsoft documentation describing SQL Server partitioning is also very readable.
The five reasons are:
Considering database partitioning from solely a performance perspective, there are two considerations; Slice and Dice/Divide and Conquer aka Faster Queries, and removal of or judicious choice of database Indexes. The underlying theory is that in, for example,
WHERE predicateContainingPartitionedColumn in (something1, something2, something3)
that the database query optimizer* may choose to search fewer than the total number of table partitions available thereby reducing the work and time necessary to satisfy the query result set. Note I am specifically referring to SQL queries executed in the absence of database indexes. It is of course possible to both index and partition tables, but maintaining database indexes can be a resource expensive task degrading system performance during the ingestion of large amounts of data (eg. repeated ETL loading bursts or the collation and/or calculation of large amounts of strip data). For this reason, it is common for database partitioned table implementations to be devoid of a significant number of indexes on partitioned tables to improve (sic) loading performance. In data-warehouses, it is also not uncommon for these tables to lack foreign key references, not only for performance reasons, but partition management reasons too vide infra).
* some DBMS implementations do not require the optimizer to choose a specific table partition, but expose it under developer control. For example, Oracle allows a developer to override the optimizer choice of partition and allows the developer to specify it explicitly.
WHERE columnValue in (something1, something2, something3)
PostgreSQL on the other hand implements a parent-child approach to partitioning, where the partitioned data has its own table name, so querying a specific partition requires no additional SELECT PARTITION clause, it just like executing an SQL query against a plain ol’ table.
Furthermore, with PostgreSQL, just in case it isn’t that obvious
“Partitions” can be dropped etc (see Partition Management below) with simple DROP TABLE DDL
Different indexes and much else can be placed on each of the separate partitions, for whatever technical or business reason. This is just not possible with many other vendor DBMS implementations!
The code below demonstrates creation of a parent and partitioned tables using PostgreSQL 10.1. The DDL syntax is markedly different in prior versions, namely version 9.6.4.
CREATE TABLE position(id INT, positionID VARCHAR(20), quantity INT, positionDate DATE NOT NULL)
PARTITION BY RANGE (positionDate);
CREATE TABLE position20171122
PARTITION OF position
FOR VALUES FROM ('2017-11-22') TO ('2017-11-23');
CREATE TABLE position20171123
PARTITION OF position
FOR VALUES FROM ('2017-11-23') TO ('2017-11-24');
CREATE TABLE position20171124
PARTITION OF position
FOR VALUES FROM ('2017-11-24') TO ('2017-11-25');
The underlying concept in both Oracle and PostgreSQL is the same however, IO and CPU resource filtering is reduced to a smaller subset of rows/files, rather than full table/complete database file/row scans.
Partition Management (aka. Data Aging)
Some data is not stored forever, for example preaggregated time-series data such as a trading position is useful for a short period, retained for slightly longer to support recent albeit historical database queries, and then discarded as its purpose has been outlived.
In large banking and telecoms organisations (the exposure I have to partitioning technologies in the past 18 months), efficient mechanisms of discarding this data do not involve slow resource expensive operations such as DELETE FROM …. DML statements, but simple DROP TABLE PARTITION DDL variants. The execution of DML can be slow and result in full table or index scans. The execution of DDL is more often than not instant, the DBMS overhead small and without extensive logging, and maintenance of the underlying database ‘files’ released/deleted at file system level.
Often this type of partition management is implemented as a scheduled task, invoked to age out data greater than, say, 120 days old, and invoked using workload automation and scheduling technologies such as Control-M. Data can use automagically disappear, as required, quickly, and without the database growing and growing full of unused stale data. Notably vide ante foreign key references are often missing on partitioned tables so as to not choke nor lock tables/rows during delete/truncate/drop cascade events. Put another way, adopting a partitioning architecture sometimes forgoes referential integrity.
Database table partitioning is frequently used to manage infrastructure; commonly used or recent table partitions being placed on local fast and cached drives, and infrequently used or required data on slower less expensive hardware. This data-tiering approach can be best summed up as a database (tuning) function f(performance, cost, and infrastructure resource).
In some non-federated DBMS vendor implementations, one or more table partitions may be offline/unavailable and the database instance will remain online, serving some executed queries and not others. The approach does have some restrictions, but enables high-availability or staged data availability in 24/7 systems.
Last but not least, database table partitioning may be implemented for reasons of jurisdictional law. Each of the four options above relates to resource consumption, whether that be a SQL query execution performance, quickly aging out data, or efficiently distributing the underlying data across the silicon or iron. Table partitioning as a legal requirement does not have this technical focus.
The EU (and UK/CH) have a plethora of local laws requiring that (personal) data regarding its citizens are not stored (and on occasion transmitted) outside of its jurisdiction. Database partitioning allows for a database physical schema change to address this type of jurisdictional law.
— Published by Mike, 16:39:17 02 December 2017 (CET)