BigQuery to ClickHouse migration guide setup

There are two recommended ways to migrate your project from BigQuery to ClickHouse:

Using ClickPipes, a managed CDC solution available in ClickHouse Cloud
By exporting your BigQuery tables to a GCS bucket and importing the data in ClickHouse

Both of the migration guides in this section covering the two approaches above use the Stack Overflow dataset, as an example dataset to show a typical migration from BigQuery to ClickHouse Cloud. The dataset contains every post, vote, user, comment, and badge that has occurred on Stack Overflow from 2008 to Apr 2024.

The BigQuery schema for this data is shown below:

Pre-requisites

Before proceeding, make sure you have the following:

A Google Cloud account

Create a new project

Go to the Google Cloud Console.
Click on the project picker at the top of the page next to the Google Cloud logo.
Click on New project.
Fill in the project name, for example "BigQueryClickHouse."
Optionally, select a parent resource, or leave it as No organisation.
Click on Create.

Create table DDL and insert data

From the console landing page, select the project you just created.
You should see "You are working in" along with the name of the project you just created.
Make note of the Project-ID. You will need this in a later step.
Select the Run a query in BigQuery button to open the studio.
Now select the SQL query button.
Enter the following DDL commands in the query editor, replacing project-ID below with the project ID that you took note of in step 3 above:

Data for these tables is made available in Parquet format in a GCS bucket for convenience.

CREATE SCHEMA `project-ID.stackoverflow`;

CREATE TABLE stackoverflow.posts (
    Id INTEGER,
    PostTypeId INTEGER,
    AcceptedAnswerId INTEGER,
    CreationDate TIMESTAMP,
    Score INTEGER,
    ViewCount INTEGER,
    Body STRING,
    OwnerUserId INTEGER,
    OwnerDisplayName STRING,
    LastEditorUserId INTEGER,
    LastEditorDisplayName STRING,
    LastEditDate TIMESTAMP,
    LastActivityDate TIMESTAMP,
    Title STRING,
    Tags STRING,
    AnswerCount INTEGER,
    CommentCount INTEGER,
    FavoriteCount INTEGER,
    ContentLicense STRING,
    ParentId STRING,
    CommunityOwnedDate TIMESTAMP,
    ClosedDate TIMESTAMP
);

LOAD DATA OVERWRITE stackoverflow.posts (
    Id INTEGER,
    PostTypeId INTEGER,
    AcceptedAnswerId INTEGER,
    CreationDate TIMESTAMP,
    Score INTEGER,
    ViewCount INTEGER,
    Body STRING,
    OwnerUserId INTEGER,
    OwnerDisplayName STRING,
    LastEditorUserId INTEGER,
    LastEditorDisplayName STRING,
    LastEditDate TIMESTAMP,
    LastActivityDate TIMESTAMP,
    Title STRING,
    Tags STRING,
    AnswerCount INTEGER,
    CommentCount INTEGER,
    FavoriteCount INTEGER,
    ContentLicense STRING,
    ParentId STRING,
    CommunityOwnedDate TIMESTAMP,
    ClosedDate TIMESTAMP
)
FROM FILES (
  format = 'Parquet',
  uris = ['gs://clickhouse-public-datasets/stackoverflow/parquet/posts/*.parquet']);

CREATE TABLE stackoverflow.votes (
    Id INTEGER,
    PostId INTEGER,
    VoteTypeId INTEGER,
    CreationDate TIMESTAMP,
    UserId INTEGER,
    BountyAmount INTEGER
);

LOAD DATA OVERWRITE stackoverflow.votes (
    Id INTEGER,
    PostId INTEGER,
    VoteTypeId INTEGER,
    CreationDate TIMESTAMP,
    UserId INTEGER,
    BountyAmount INTEGER
)
FROM FILES (
  format = 'Parquet',
  uris = ['gs://clickhouse-public-datasets/stackoverflow/parquet/votes/*.parquet']);

CREATE TABLE stackoverflow.comments (
    Id INTEGER,
    PostId INTEGER,
    Score INTEGER,
    Text STRING,
    CreationDate TIMESTAMP,
    UserId INTEGER,
    UserDisplayName STRING
);

LOAD DATA OVERWRITE stackoverflow.comments (
    Id INTEGER,
    PostId INTEGER,
    Score INTEGER,
    Text STRING,
    CreationDate TIMESTAMP,
    UserId INTEGER,
    UserDisplayName STRING
)
FROM FILES (
  format = 'Parquet',
  uris = ['gs://clickhouse-public-datasets/stackoverflow/parquet/comments/*.parquet']);

CREATE TABLE stackoverflow.users (
    Id INTEGER,
    Reputation STRING,
    CreationDate TIMESTAMP,
    DisplayName STRING,
    LastAccessDate TIMESTAMP,
    AboutMe STRING,
    Views INTEGER,
    UpVotes INTEGER,
    DownVotes INTEGER,
    WebsiteUrl STRING,
    Location STRING,
    AccountId INTEGER
);

LOAD DATA OVERWRITE stackoverflow.users (
    Id INTEGER,
    Reputation STRING,
    CreationDate TIMESTAMP,
    DisplayName STRING,
    LastAccessDate TIMESTAMP,
    AboutMe STRING,
    Views INTEGER,
    UpVotes INTEGER,
    DownVotes INTEGER,
    WebsiteUrl STRING,
    Location STRING,
    AccountId INTEGER
)
FROM FILES (
  format = 'Parquet',
  uris = ['gs://clickhouse-public-datasets/stackoverflow/parquet/users/*.parquet']);

CREATE TABLE stackoverflow.badges (
    Id INTEGER,
    UserId INTEGER,
    Name STRING,
    Date TIMESTAMP,
    Class INTEGER,
    TagBased BOOL
);

LOAD DATA OVERWRITE stackoverflow.badges (
    Id INTEGER,
    UserId INTEGER,
    Name STRING,
    Date TIMESTAMP,
    Class INTEGER,
    TagBased INTEGER
)
FROM FILES (
  format = 'Parquet',
  uris = ['gs://clickhouse-public-datasets/stackoverflow/parquet/badges/*.parquet']);

CREATE TABLE stackoverflow.postlinks (
    Id INTEGER,
    CreationDate TIMESTAMP,
    PostId INTEGER,
    RelatedPostId INTEGER,
    LinkTypeId INTEGER
);

LOAD DATA OVERWRITE stackoverflow.postlinks (
    Id INTEGER,
    CreationDate TIMESTAMP,
    PostId INTEGER,
    RelatedPostId INTEGER,
    LinkTypeId INTEGER
)
FROM FILES (
  format = 'Parquet',
  uris = ['gs://clickhouse-public-datasets/stackoverflow/parquet/postlinks/*.parquet']);

CREATE TABLE stackoverflow.posthistory (
    Id INTEGER,
    PostHistoryTypeId INTEGER,
    PostId INTEGER,
    RevisionGUID STRING,
    CreationDate TIMESTAMP,
    UserId INTEGER,
    Text STRING,
    ContentLicense STRING,
    Comment STRING,
    UserDisplayName STRING
);

LOAD DATA OVERWRITE stackoverflow.posthistory
-- (
--     Id INTEGER,
--     PostHistoryTypeId INTEGER,
--     PostId INTEGER,
--     RevisionGUID STRING,
--     CreationDate TIMESTAMP,
--     UserId INTEGER,
--     Text STRING,
--     ContentLicense STRING,
--     Comment STRING,
--     UserDisplayName STRING
-- )
FROM FILES (
  format = 'Parquet',
  uris = ['gs://clickhouse-public-datasets/stackoverflow/parquet/posthistory/2008.parquet']);

Click the Run button to execute the statements. It will take about six minutes to complete.

When the queries have finished executing you should see a BigQuery resource in the tab to the left of the query editor. If you click on the resource name to expand it, you should now see the stackoverflow schema and 7 tables.

You have successfully created the tables that we'll be using in these migration guides.

Create a GCS bucket for staging

Note

This step is required regardless of which migration method you will be using.

The initial load process requires a user-provided Google Cloud Storage (GCS) bucket for staging. Refer to the Google Cloud documentation on how to do so.

You will need the bucket's gsutil URI if you intend to use ClickPipes (e.g. gs://bigquery-clickhouse), or the bucket's Cloud Console URL (e.g. https://console.cloud.google.com/storage/browser/bigquery-clickhouse) if you intend to import the data directly.

You can find these by clicking the name of your bucket, and then selecting the Configuration tab. They're listed in a table under the header Overview.

Create a service account file and key

ClickPipes authenticates to your Google Cloud project using a service account key. We recommend creating a dedicated service account with the minimum required set of permissions to allow ClickPipes to export data from BigQuery, load it into the staging GCS bucket, and read it into ClickHouse.

To create a service account:

Select IAM and admin from the navigation menu in the Google Cloud console
Select Service accounts
Click Create service account

Service account permissions

The following service account permissions are required:

BigQuery

The service account must have the following BigQuery roles:

To further scope access, we recommend using IAM conditions to restrict the resources the role has access to. For example, you can restrict the dataViewer role to the specific dataset containing the tables you want to sync:

resource.name.startsWith("projects/<PROJECT_ID>/datasets/<DATASET_NAME>")

Cloud Storage

The service account must have the following Cloud Storage roles:

To further scope access, we recommend using IAM conditions to restrict the resources the role has access to. For example, you can restrict the objectAdmin and bucketViewer roles to the dedicated bucket created for ClickPipes syncs.

resource.name.startsWith("projects/_/buckets/<BUCKET_NAME>")

Create access keys

Depending on the migration approach you plan to use, you will need to create one of the following credentials:

Migration approach	Credential type	Documentation	Notes
ClickPipes for CDC	Service key	Create a service key	Make sure to select JSON rather than P12 as the Key type.
GCS bulk-load	`HMAC` key	Create an `HMAC` key	-

Next steps

After completing the steps in this setup guide, you can now proceed with one of the following migration guides depending on which approach you would like to take:

Pre-requisites​

Create a new project​

Create table DDL and insert data​

Create a GCS bucket for staging​

Create a service account file and key​

Service account permissions​

BigQuery​

Cloud Storage​

Create access keys​

Next steps​