Recovery Process

Scenarios:

The primary Node goes down:

check nodes status.

psql -h 192.168.122.20 -p 9999 -U pgpool postgres -c "show pool_nodes"    
Password for user pgpool: 
 node_id | hostname | port | status | pg_status | lb_weight |  role   | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change  
---------|----------|------|--------|-----------|-----------|---------|---------|------------|-------------------|-------------------|-------------------|------------------------|---------------------
 0       | db01     | 5432 | down   | down      | 0.333333  | standby | unknown | 0          | false             | 0                 |                   |                        | 2024-08-13 18:54:03
 1       | db02     | 5432 | up     | up        | 0.333333  | primary | primary | 0          | true              | 0                 |                   |                        | 2024-08-13 19:15:39
 2       | db03     | 5432 | up     | up        | 0.333333  | standby | standby | 0          | false             | 1984              |                   |                        | 2024-08-13 19:22:17

fix steps:

1- Bring back the old `primary` so that the `PostgreSQL` server is running again:

using pg_rewind or pg_basebackup

su - postgres

with pg-rewind:

/usr/pgsql-16/bin/pg_ctl -D /var/lib/pgsql/16/data -m immediate stop
/usr/pgsql-16/bin/pg_rewind -D /var/lib/pgsql/16/data --source-server='host=db02 dbname=postgres user=postgres port=5432'

with pg-basebackup

rm -rf /var/lib/pgsql/16/data/*
${PGHOME}/bin/pg_basebackup -h db02 -U replicator -p 5432 -D /var/lib/pgsql/16/data  -X stream -R -C -S db01

Note

if you recover the node manually check if postgresql.auto.conf contain primary_conninfo & primary_slot_name or use -R flag with pg_rewind or -R -C -S with pg_basebackup to create a correct primary_conninfo, primary_slot_name

or copy paste the provided online recovery scripts (recovery_1st_stage,pgpool_remote_start)to the new primary to enable using (pcp_recovery_node & pcp_attach_node) to online recover the old master as shown here. (this script create myrecovery.conf and appends the conninfo and slot_name to it and deletes postgres.auto.conf)

or copy myrecovery.conf file from another standby node and modify it to match the new cluster state and remove postgresql.auto.conf. e.g:

rm -rf /var/lib/pgsql/16/data/postgresql.auto.conf 

cat << EOF > /var/lib/pgsql/16/data/myrecovery.conf
primary_conninfo = 'host=db02 port=5432 user=replicator application_name=db01 passfile=''/var/lib/pgsql/.pgpass'''
recovery_target_timeline = 'latest'
primary_slot_name = 'db01'
EOF

also pay attention to the replication slots if it is not created in the new master then you need to manually create it.

psql -h 192.168.122.20 -p 5432 -U postgres postgres -c "select * from pg_create_physical_replication_slot('db01');"

flag	useage
-R --write-recovery-conf	Creates a standby.signal file and appends connection settings to the postgresql.auto.conf file in the target directory.
-C --create-slot	Specifies that the replication slot named by the --slot option should be created before starting the backup. An error is raised if the slot already exists. `only with pg_basebackup`
-S slotname --slot=slotname	This option can only be used together with -X stream. It causes WAL streaming to use the specified replication slot. `only with pg_basebackup`

2- Now make sure the the old primary is replicating from the new primary and is up to date:

Connect to the live system (not the node that's down) and run the following:

 psql -h 192.168.122.20 -p 9999 -U pgpool postgres -c "select * from pg_stat_replication;"
Password for user pgpool: 

 pid  | usesysid |  usename   | application_name |  client_addr   | client_hostname | client_port |         backend_start         | backend_xmin |   state   |  sent_lsn  | write_lsn  | flush_lsn  | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state |          reply_time           
------|----------|------------|------------------|----------------|-----------------|-------------|-------------------------------|--------------|-----------|------------|------------|------------|------------|-----------|-----------|------------|---------------|------------|-------------------------------|
4580 |    16388 | replicator | walreceiver      | 192.168.122.10 |                 |       58096 | 2024-08-16 16:13:15.475243+02 |              | streaming | 0/16000148 | 0/16000148 | 0/16000148 | 0/16000148 | |           |            |             0 | async      | 2024-08-16 16:18:16.193877+02
 4195 |    16388 | replicator | walreceiver      | 192.168.122.12 |                 |       45700 | 2024-08-16 16:05:47.754722+02 |              | streaming | 0/16000148 | 0/16000148 | 0/16000148 | 0/16000148 | |           |            |             0 | async      | 2024-08-16 16:18:16.174538+02

psql -h 192.168.122.20 -p 9999 -U pgpool postgres -c "select * from pg_replication_slots;"
Password for user pgpool:

 slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase | conflicting 
-----------|--------|-----------|--------|----------|-----------|--------|------------|------|--------------|-------------|---------------------|------------|---------------|-----------|-------------
 db03      |        | physical  |        |          | f         | t      |       4195 |      |              | 0/16000148  |                     | reserved   |               | f         | 
 db01      |        | physical  |        |          | f         | t      |       4580 |      |              | 0/16000148  |                     | reserved   |               | f         | 

3- Reattach the node to pgpool:

pcp_attach_node -n 0 -U pgpool -W

psql -h 192.168.122.20 -p 9999 -U pgpool postgres -c "show pool_nodes"

node_id	hostname	port	status	pg_status	lb_weight	role	pg_role	select_cnt	load_balance_node	replication_state	replication_sync_state	last_status_change
0	db01	5432	up	up	0.333333	standby	standby	0	false	streaming	async	2024-08-17 18:13:32
1	db02	5432	up	up	0.333333	primary	primary	5	true			2024-08-17 17:33:50
2	db03	5432	up	up	0.333333	standby	standby	0	false	streaming	async	2024-08-17 17:43:36

Scenarios:​

The primary Node goes down:​

fix steps:​

1- Bring back the old primary so that the PostgreSQL server is running again:​

2- Now make sure the the old primary is replicating from the new primary and is up to date:​

3- Reattach the node to pgpool:​

this way the cluster is resyncd again...​