AlloyDB for PostgreSQL - Source Connector

Postgres

This page contains the setup guide and reference information for the Postgres source connector for CDC and non-CDC workflows.

When to use Postgres with CDC

Configure Postgres with CDC if:

  • You need a record of deletions
  • Your table has a primary key but doesn't have a reasonable cursor field for incremental syncing (updated_at). CDC allows you to sync your table incrementally

If your goal is to maintain a snapshot of your table in the destination but the limitations prevent you from using CDC, consider using non-CDC incremental sync [Coming Soon] and occasionally reset the data and re-sync.

If your dataset is small and you just want a snapshot of your table in the destination, consider using Full Refresh replication for your table instead of CDC.

Prerequisites

  • Use Postgres v9.3.x or above for non-CDC workflows and Postgres v10 or above for CDC workflows

Setup guide

Step 1: (Optional) Create a dedicated read-only user

We recommend creating a dedicated read-only user for better permission control and auditing. Alternatively, you can use an existing Postgres user in your database.

To create a dedicated user, run the following command:

CREATE USER <user_name> PASSWORD 'your_password_here';

Grant access to the relevant schema:

GRANT USAGE ON SCHEMA <schema_name> TO <user_name>

NOTE

To replicate data from multiple Postgres schemas, re-run the command to grant access to all the relevant schemas. 

Grant the user read-only access to the relevant tables:

GRANT SELECT ON ALL TABLES IN SCHEMA <schema_name> TO <user_name>;

Allow users to see tables created in the future:

ALTER DEFAULT PRIVILEGES IN SCHEMA <schema_name> GRANT SELECT ON TABLES TO <user_name>;

Additionally, if you plan to configure CDC for the Postgres source connector, grant REPLICATION permissions to the user:

ALTER USER <user_name> REPLICATION;

Syncing a subset of columns​

Currently, there is no way to sync a subset of columns using the Postgres source connector:

  • When setting up a connection, you can only choose which tables to sync, but not columns.
  • If the user can only access a subset of columns, the connection check will pass. However, the data sync will fail with a permission-denied exception.

The workaround for partial table syncing is to create a view on the specific columns, and grant the user read access to that view:

CREATE VIEW <view_name> as SELECT <columns> FROM <table>;
GRANT SELECT ON TABLE <view_name> IN SCHEMA <schema_name> to <user_name>;

Note: The workaround works only for non-CDC setups since CDC requires data to be in tables and not views.

Step 2: Set up the Postgres connector in Connect

  1. Navigate back to Connect and create a new Source.
  2. On the Set up the source page, select Postgres from the Source type dropdown.
  3. Enter a name for your source.
  4. For the Host, Port, and DB Name, enter the hostname, port number, and name for your Postgres database.
  5. List the Schemas you want to sync.
    NOTE
    The schema names are case-sensitive. The 'public' schema is set by default. Multiple schemas may be used at one time. No schemas set explicitly - will sync all of existing.
  6. For User and Password, enter the username and password you created in Step 1.
  7. To customize the JDBC connection beyond common options, specify additional supported JDBC URL parameters as key-value pairs separated by the symbol & in the JDBC URL Parameters (Advanced) field.
    Example: key1=value1&key2=value2&key3=value3
    These parameters will be added at the end of the JDBC URL that the Connect will use to connect to your Postgres database.
    The connector now supports connectTimeout and defaults to 60 seconds. Setting connectTimeout to 0 seconds will set the timeout to the longest time available.
    Note: Do not use the following keys in JDBC URL Params field as they will be overwritten: currentSchema, user, password, ssl, and sslmode.
    DANGER
    This is an advanced configuration option. Users are advised to use it with caution.
  8. For SSL Modes, select:
    • disable to disable encrypted communication between Connect and the source
    • allow to enable encrypted communication only when required by the source
    • prefer to allow unencrypted communication only when the source doesn't support encryption
    • require to always require encryption. Note: The connection will fail if the source doesn't support encryption.
    • verify-ca to always require encryption and verify that the source has a valid SSL certificate
    • verify-full to always require encryption and verify the identity of the source
  9. For Replication Method, select Standard or Logical CDC from the dropdown. Refer to Configuring Postgres connector with Change Data Capture (CDC) for more information.
  10. For SSH Tunnel Method, select:
    • No Tunnel for a direct connection to the database
    • SSH Key Authentication to use an RSA Private as your secret for establishing the SSH tunnel
    • Password Authentication to use a password as your secret for establishing the SSH tunnel

Refer to Connect via SSH Tunnel for more information. 13. Click Set up source.

Connect via SSH Tunnel​

You can connect to a Postgres instance via an SSH tunnel.

When using an SSH tunnel, you are configuring Connect to connect to an intermediate server (also called a bastion or a jump server) that has direct access to the database. Connect connects to the bastion and then asks the bastion to connect directly to the server.

To connect to a Postgres instance via an SSH tunnel:

  1. While setting up the Postgres source connector, from the SSH tunnel dropdown, select:
    • SSH Key Authentication to use a private as your secret for establishing the SSH tunnel
    • Password Authentication to use a password as your secret for establishing the SSH Tunnel
  2. For SSH Tunnel Jump Server Host, enter the hostname or IP address for the intermediate (bastion) server that Connect will connect to.
  3. For SSH Connection Port, enter the port on the bastion server. The default port for SSH connections is 22.
  4. For SSH Login Username, enter the username to use when connecting to the bastion server. Note: This is the operating system username and not the Postgres username.
  5. For authentication:
    • If you selected SSH Key Authentication, set the SSH Private Key to the private Key that you are using to create the SSH connection.
    • If you selected Password Authentication, enter the password for the operating system user to connect to the bastion server. Note: This is the operating system password and not the Postgres password.

Generating a Private Key​

The connector supports any SSH compatible key format such as RSA or Ed25519. To generate an RSA key, for example, run:

ssh-keygen -t rsa -m PEM -f myuser_rsa

The command produces the private key in PEM format and the public key remains in the standard format used by the authorized_keys file on your bastion server. Add the public key to your bastion host to the user you want to use with Connect. The private key is provided via copy-and-paste to the Connect connector configuration screen to allow it to log into the bastion server.

Configuring Postgres connector with Change Data Capture (CDC)

Connect uses logical replication of the Postgres write-ahead log (WAL) to incrementally capture deletes using a replication plugin. To learn more about how Connect implements CDC, refer to Change Data Capture (CDC)

CDC Considerations

  • Incremental sync is only supported for tables with primary keys. For tables without primary keys, use Full Refresh sync.
  • Data must be in tables and not views. If you require data synchronization from a view, you would need to create a new connection with Standard as Replication Method.
  • The modifications you want to capture must be made using DELETE/INSERT/UPDATE. For example, changes made using TRUNCATE/ALTER will not appear in logs and therefore in your destination.
  • Schema changes are not supported automatically for CDC sources. Reset and resync data if you make a schema change.
  • The records produced by DELETE statements only contain primary keys. All other data fields are unset.
  • Log-based replication only works for master instances of Postgres. CDC cannot be run from a read-replica of your primary database.
  • Using logical replication increases disk space used on the database server. The additional data is stored until it is consumed.
    • Set frequent syncs for CDC to ensure that the data doesn't fill up your disk space.
    • If you stop syncing a CDC-configured Postgres instance with Connect, delete the replication slot. Otherwise, it may fill up your disk space.

CONNECTOR CONFIGURATION ARE SUPPORTED ONLY ON PRIMARY/MASTER DB HOST/SERVERS. DO NOT POINT CONNECTOR CONFIGURATION TO REPLICA DB HOSTS, IT WILL NOT WORK.. :::

Setting up CDC for Postgres​

Connect requires a replication slot configured only for its use. Only one source should be configured that uses this replication slot. See Setting up CDC for Postgres for instructions.

Step 1: Enable logical replication​

To enable logical replication on bare metal, VMs (EC2/GCE/etc), or Docker, configure the following parameters in the postgresql.conf file for your Postgres database:

To enable logical replication on AWS Postgres RDS or Aurora​:

  1. Go to the Configuration tab for your DB cluster.
  2. Find your cluster parameter group. Either edit the parameters for this group or create a copy of this parameter group to edit. If you create a copy, change your cluster's parameter group before restarting.
  3. Within the parameter group page, search for rds.logical_replication. Select this row and click Edit parameters. Set this value to 1.
  4. Wait for a maintenance window to automatically restart the instance or restart it manually.

To enable logical replication on Azure Database for Postgres​:

Change the replication mode of your Postgres DB on Azure to logical using the Replication menu of your PostgreSQL instance in the Azure Portal. Alternatively, use the Azure CLI to run the following command:

az postgres server configuration set --resource-group group --server-name server --name azure.replication_support --value logical

 

az postgres server restart --resource-group group --name server

Step 3: Create replication slot​

Connect currently supports pgoutput plugin only. To create a replication slot called connect_slot using pgoutput, run:

SELECT pg_create_logical_replication_slot('connect_slot', 'pgoutput');

Step 4: Create publications and replication identities for tables​

For each table you want to replicate with CDC, add the replication identity (the method of distinguishing between rows) first:

To use primary keys to distinguish between rows for tables that don't have a large amount of data per row, run:

ALTER TABLE tbl1 REPLICA IDENTITY DEFAULT;

In case your tables use data types that support TOAST and have very large field values, use:

ALTER TABLE tbl1 REPLICA IDENTITY FULL;

After setting the replication identity, run:

CREATE PUBLICATION connect_publication FOR TABLE <tbl1, tbl2, tbl3>;`

The publication name is customizable. Refer to the Postgres docs if you need to add or remove tables from your publication in the future.

You must add the replication identity before creating the publication. Otherwise, ALTER/UPDATE/DELETE statements may fail if Postgres cannot determine how to uniquely identify rows. Also, the publication should include all the tables and only the tables that need to be synced. Otherwise, data from these tables may not be replicated correctly. :::

 

Step 5: [Optional] Set up initial waiting time

DANGER

This is an advanced feature. Use it if absolutely necessary.

The Postgres connector may need some time to start processing the data in the CDC mode in the following scenarios:

  • When the connection is set up for the first time and a snapshot is needed
  • When the connector has a lot of change logs to process

The connector waits for the default initial wait time of 5 minutes (300 seconds). Setting the parameter to a longer duration will result in slower syncs, while setting it to a shorter duration may cause the connector to not have enough time to create the initial snapshot or read through the change logs. The valid range is 120 seconds to 1200 seconds.

If you know there are database changes to be synced, but the connector cannot read those changes, the root cause may be insufficient waiting time. In that case, you can increase the waiting time (example: set to 600 seconds) to test if it is indeed the root cause. On the other hand, if you know there are no database changes, you can decrease the wait time to speed up the zero record syncs.

Step 6: Set up the Postgres source connector

In Step 2 of the connector setup guide, enter the replication slot and publication you just created.

Supported sync modes

The Postgres source connector supports the following sync modes:

Supported cursors

  • TIMESTAMP
  • TIMESTAMP_WITH_TIMEZONE
  • TIME
  • TIME_WITH_TIMEZONE
  • DATE
  • BIT