Recovery Process
Scenarios:
The primary Node goes down:
- check nodes status.
psql -h 192.168.122.20 -p 9999 -U pgpool postgres -c "show pool_nodes"
Password for user pgpool:
node_id | hostname | port | status | pg_status | lb_weight | role | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------|----------|------|--------|-----------|-----------|---------|---------|------------|-------------------|-------------------|-------------------|------------------------|---------------------
0 | db01 | 5432 | down | down | 0.333333 | standby | unknown | 0 | false | 0 | | | 2024-08-13 18:54:03
1 | db02 | 5432 | up | up | 0.333333 | primary | primary | 0 | true | 0 | | | 2024-08-13 19:15:39
2 | db03 | 5432 | up | up | 0.333333 | standby | standby | 0 | false | 1984 | | | 2024-08-13 19:22:17
fix steps:
1- Bring back the old primary
so that the PostgreSQL
server is running again:
using pg_rewind
or pg_basebackup
su - postgres
- with pg-rewind:
/usr/pgsql-16/bin/pg_ctl -D /var/lib/pgsql/16/data -m immediate stop
/usr/pgsql-16/bin/pg_rewind -D /var/lib/pgsql/16/data --source-server='host=db02 dbname=postgres user=postgres port=5432'
- with pg-basebackup
rm -rf /var/lib/pgsql/16/data/*
${PGHOME}/bin/pg_basebackup -h db02 -U replicator -p 5432 -D /var/lib/pgsql/16/data -X stream -R -C -S db01
Note
- if you recover the node manually check if postgresql.auto.conf contain
primary_conninfo
&primary_slot_name
or use-R
flag with pg_rewind or-R -C -S
with pg_basebackup to create a correctprimary_conninfo
,primary_slot_name
- or copy paste the provided online recovery scripts (
recovery_1st_stage
,pgpool_remote_start
)to the new primary to enable using (pcp_recovery_node
&pcp_attach_node
) to online recover the old master as shown here. (this script create myrecovery.conf and appends the conninfo and slot_name to it and deletes postgres.auto.conf)- or copy myrecovery.conf file from another standby node and modify it to match the new cluster state and remove
postgresql.auto.conf
. e.g:
rm -rf /var/lib/pgsql/16/data/postgresql.auto.conf
cat << EOF > /var/lib/pgsql/16/data/myrecovery.conf
primary_conninfo = 'host=db02 port=5432 user=replicator application_name=db01 passfile=''/var/lib/pgsql/.pgpass'''
recovery_target_timeline = 'latest'
primary_slot_name = 'db01'
EOF
- also pay attention to the replication slots if it is not created in the new master then you need to manually create it.
psql -h 192.168.122.20 -p 5432 -U postgres postgres -c "select * from pg_create_physical_replication_slot('db01');"
flag | useage |
---|---|
-R --write-recovery-conf | Creates a standby.signal file and appends connection settings to the postgresql.auto.conf file in the target directory. |
-C --create-slot | Specifies that the replication slot named by the --slot option should be created before starting the backup. An error is raised if the slot already exists. only with pg_basebackup |
-S slotname --slot=slotname | This option can only be used together with -X stream. It causes WAL streaming to use the specified replication slot. only with pg_basebackup |
2- Now make sure the the old primary is replicating from the new primary and is up to date:
Connect to the live system (not the node that's down) and run the following:
psql -h 192.168.122.20 -p 9999 -U pgpool postgres -c "select * from pg_stat_replication;"
Password for user pgpool:
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state | reply_time
------|----------|------------|------------------|----------------|-----------------|-------------|-------------------------------|--------------|-----------|------------|------------|------------|------------|-----------|-----------|------------|---------------|------------|-------------------------------|
4580 | 16388 | replicator | walreceiver | 192.168.122.10 | | 58096 | 2024-08-16 16:13:15.475243+02 | | streaming | 0/16000148 | 0/16000148 | 0/16000148 | 0/16000148 | | | | 0 | async | 2024-08-16 16:18:16.193877+02
4195 | 16388 | replicator | walreceiver | 192.168.122.12 | | 45700 | 2024-08-16 16:05:47.754722+02 | | streaming | 0/16000148 | 0/16000148 | 0/16000148 | 0/16000148 | | | | 0 | async | 2024-08-16 16:18:16.174538+02
psql -h 192.168.122.20 -p 9999 -U pgpool postgres -c "select * from pg_replication_slots;"
Password for user pgpool:
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase | conflicting
-----------|--------|-----------|--------|----------|-----------|--------|------------|------|--------------|-------------|---------------------|------------|---------------|-----------|-------------
db03 | | physical | | | f | t | 4195 | | | 0/16000148 | | reserved | | f |
db01 | | physical | | | f | t | 4580 | | | 0/16000148 | | reserved | | f |
3- Reattach the node to pgpool:
pcp_attach_node -n 0 -U pgpool -W
psql -h 192.168.122.20 -p 9999 -U pgpool postgres -c "show pool_nodes"
node_id | hostname | port | status | pg_status | lb_weight | role | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | db01 | 5432 | up | up | 0.333333 | standby | standby | 0 | false | 0 | streaming | async | 2024-08-17 18:13:32 |
1 | db02 | 5432 | up | up | 0.333333 | primary | primary | 5 | true | 0 | 2024-08-17 17:33:50 | ||
2 | db03 | 5432 | up | up | 0.333333 | standby | standby | 0 | false | 0 | streaming | async | 2024-08-17 17:43:36 |