Column migration from INT to BIGINT in PostgreSQL
by Jan Holas · 25 Aug 2021As a primary data storage for the transactional data for the Zemanta One platform we use PostgreSQL for its robustness, flexibility and time proven stability. Over the years the platform grew a lot both in features and user base and together with that so did our database tables. At the beginning of the project nobody anticipated the sheer amount of data that will eventually be stored and all the table’s primary key ID columns were set to INT as a sort of go-to default data type that is usually used for this column. Taking a space of 4 bytes this allows us to store a range of -2,147,483,648 to +2,147,483,647 integers. Since this is used for an auto incremented value starting at 1 we consider only the positive part. This still looks like a big number - over 2 billion records - but over time we recently started running into a problem that some high volume tables were getting dangerously close to this limit and we had to act.
We needed to change this column’s data type to allow us to store larger numbers. In PostgreSQL case that would be BIGINT - on 8 bytes it has a range from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 possible values - over 9 quintillion - that has to be enough.
Migration plan
In order to do that PostgreSQL has a DDL query for that, right?
Wait! If you run this on your production database you and your users will probably have a bad time. There is a catch - a change like this forces the database engine to rewrite the entire table. It is due to a binary incompatibility of both data types - the data is stored on a disk in sequential blocks and there is no free space in between to extend it to the larger data type. During the rewrite process it will acquire the very aggressive ACCESS EXCLUSIVE
lock that essentially prevents any other operation on the table including reads. This operation and its lock can easily last a couple of hours (depending on the table size and hardware) and cause a complete downtime as other processes will not be able to acquire this resource. At Zemanta we strive for maximum availability to not inconvenience our clients and such a large maintenance window was unacceptable.
We had to come up with a different solution - one that would possibly allow us to mitigate the INT overflow problem without any downtime. In this article we will show you how.
Let’s break it up
To achieve the same goal we can split the process into a couple of independent operations that are by themselves very quick or don’t require such a lock.
Add a new column
The first thing we need to do is to create a new BIGINT
column with a temporary name.
Creating a nullable column with no default value is a quick operation.
Keep them in sync
The next two steps will help us set and keep consistent state with the original column - we create a procedure that will copy the value from the original column to a new column and this procedure will be automatically triggered on every new INSERT
or UPDATE
.
Backfill
Now we need to backfill the existing data - copy the values from the original column. This will be the most time consuming operation that can run for hours. But it is ok because we copy the data in smaller quick transactions so the table is always accessible for other processes.
After all the values are copied the original column and new column should both have the same data and thanks to the trigger it’s also staying in sync with the new incoming data.
Column constraints
As the next step we will do a couple of tricks that will later help us create a primary key (PK) constraint on this column. In essence a PK couples two constraints - the data inside the column has to be unique and not null. Dropping the old PK and creating a new one at this point would also acquire ACCESS EXCLUSIVE
lock for a longer time period because it would build the UNIQUE
index and scan the table to check and set NOT NULL
on the column. Since PostgreSQL 9.1 we can provide an already existing unique index when creating the PK, so let’s build it first separately
Crucial part here is the CONCURRENTLY
keyword - while it takes more time to build the index it will only acquire the weaker SHARE UPDATE EXCLUSIVE
lock that will still allow data changes and reads. Without that, the constraint creation would acquire a SHARE
lock which would allow reads but not data changes.
The next thing we need to solve is the NOT NULL
constraint. Under the hood, creating a PK will run ALTER COLUMN SET NOT NULL
which would trigger a full table scan while locking writes and would not be a quick operation. Since PostgreSQL version 12 this query can skip the full scan if a NOT NULL constraint on the column already exists. In the earlier versions there is unfortunately no way to avoid it (but it’s still a much shorter lock than rewriting the table). We can create the constraint in two steps.
When we create the constraint as NOT VALID
it will enforce the constraint only for the new or updated rows but it will not check the existing rows which would again lock the table for writes. We will check them and validate the constraint with a separate query that only acquires the SHARE UPDATE EXCLUSIVE
lock that allows both writes and reads.
Column switchover
At this point we have everything ready for the final stage. Up until now the previous steps didn’t interfere with the original column so it can be reversed or started over. The following step makes irreversible changes. In one transaction we need to run a series of queries:
- drop the old PK
- create the new PK
- transfer the existing auto increment sequence
- drop the old column
- rename the temporary one (new_id) to the original name (id).
If the previous steps were successfully finished it will be a quick operation as it’s mostly only catalogue changes and we covered the PK prerequisites. Since we drop the old column we also need to reanalyze the table to gather correct statistics about it for the query planner to produce optimal queries. We can do this in 3 steps with different statistics targets (a lower statistics target results in a faster ANALYZE
step) to have at least some usable statistics right away.
One other thing that needs to be kept in mind is that this will work only if this id column is not used as a foreign key in another table. In order to also support that the previous steps would have to be done also for the foreign key column in the referenced table and the next step would also have to be applied on both tables and foreign keys re-created. Here we will assume the simple case without the foreign key constraints.
The process looks like this:
Conclusion
Voila! The id
column is now BIGINT
without any downtime or data loss. In practice we implemented these queries into a script so it allows us better reusability and limits human error in running the queries manually. In this script we also use a couple of useful queries worth mentioning here that check the table state before continuing the execution for extra safety.