Migrating data
This is Part 1 of a guide on migrating from PostgreSQL to ClickHouse. Using a practical example, it demonstrates how to efficiently carry out the migration with a real-time replication (CDC) approach. Many of the concepts covered are also applicable to manual bulk data transfers from PostgreSQL to ClickHouse.
Dataset
As an example dataset to show a typical migration from Postgres to ClickHouse, we use the Stack Overflow dataset documented here. This contains every post, vote, user, comment, and badge that has occurred on Stack Overflow from 2008 to Apr 2024. The PostgreSQL schema for this data is shown below:
DDL commands to create the tables in PostgreSQL are available here.
This schema, while not necessarily the most optimal, exploits a number of popular PostgreSQL features, including primary keys, foreign keys, partitioning, and indexes.
We will migrate each of these concepts to their ClickHouse equivalents.
For those users who wish to populate this dataset into a PostgreSQL instance to test migration steps, we have provided the data in pg_dump format for download with the DDL, and subsequent data load commands are shown below:
While small for ClickHouse, this dataset is substantial for Postgres. The above represents a subset covering the first three months of 2024.
While our example results use the full dataset to show performance differences between Postgres and Clickhouse, all steps documented below are functionally identical with the smaller subset. Users wanting to load the full dataset into Postgres see here. Due to the foreign constraints imposed by the above schema, the full dataset for PostgreSQL only contains rows that satisfy referential integrity. A Parquet version, with no such constraints, can be easily loaded directly into ClickHouse if needed.
Migrating data
Real time replication (CDC)
Refer to this guide to set up ClickPipes for PostgreSQL. The guide is covering many different types of source Postgres instances.
With CDC approach using ClickPipes or PeerDB, each tables in the PostgreSQL database are automatically replicated in ClickHouse.
To handle updates and deletes in near real-time, ClickPipes maps Postgres tables to ClickHouse using ReplacingMergeTree engine, specifically designed to handle updates and deletes in ClickHouse. You can find more information on how the data gets replicated to ClickHouse using ClickPipes here. It is important to note that replication using CDC creates duplicated rows in ClickHouse when replicating updates or deletes operations. See techniques using the FINAL modifier for handling those in ClickHouse.
Let's have a look on how the table users is created in ClickHouse using ClickPipes.
Once set up, ClickPipes starts migrating all data from PostgreSQL to ClickHouse. Depending on the network and size of the deployments, this should take only a few minutes for the Stack Overflow dataset.
Manual bulk load with periodic updates
Using a manual approach, the initial bulk load of the dataset can be achieved via:
- Table functions - Using the Postgres table function in ClickHouse to SELECTdata from Postgres andINSERTit into a ClickHouse table. Relevant to bulk loads up to datasets of several hundred GB.
- Exports - Exporting to intermediary formats such as CSV or SQL script file. These files can then be loaded into ClickHouse from either the client via the INSERT FROM INFILEclause or using object storage and their associated functions i.e. s3, gcs.
When loading data manually from PostgreSQL, you need to first create the tables in ClickHouse. Refer to this Data Modeling documentation to that also uses the Stack Overflow dataset to optimize the table schema in ClickHouse.
Data types between PostgreSQL and ClickHouse might differ. To establish the equivalent types for each of the table columns, we can use the DESCRIBE command with the Postgres table function. The following command describe the table posts in PostgreSQL, modify it according to your environment:
For an overview of data type mapping between PostgreSQL and ClickHouse, refer to the appendix documentation.
The steps for optimizing the types for this schema are identical to if the data has been loaded from other sources e.g. Parquet on S3. Applying the process described in this alternate guide using Parquet results in the following schema:
We can populate this with a simple INSERT INTO SELECT, reading the data from PostgresSQL and inserting into ClickHouse:
Incremental loads can, in turn, be scheduled. If the Postgres table only receives inserts and an incrementing id or timestamp exists, users can use the above table function approach to load increments, i.e. a WHERE clause can be applied to the SELECT.  This approach may also be used to support updates if these are guaranteed to update the same column. Supporting deletes will, however, require a complete reload, which may be difficult to achieve as the table grows.
We demonstrate an initial load and incremental load using the CreationDate (we assume this gets updated if rows are updated)..
ClickHouse will push down simple
WHEREclauses such as=,!=,>,>=,<,<=, and IN to the PostgreSQL server. Incremental loads can thus be made more efficient by ensuring an index exists on columns used to identify the change set.
A possible method to detect UPDATE operations when using query replication is using the
XMINsystem column (transaction IDs) as a watermark - a change in this column is indicative of a change and therefore can be applied to the destination table. Users employing this approach should be aware thatXMINvalues can wrap around and comparisons require a full table scan, making tracking changes more complex.
