I have to solve the PostgreSQL HA and Redundancy few weeks ago. It has been written a lot about this topic, but I was not able to find some guide describing pgpool-II and repmgr. After reading some documents I build the solution which I'm going to describe.
In short it contains the Master/Slave DB Streaming replication and pgpool load distribution and HA. The replication "part" is managed by repmgr.
Here is the network diagram:
- Master PostgreSQL database installation - cz01-psql01:
- Slave PostgreSQL database installation - cz01-psql02:
- pgpool server installation (common for primary/secondary node) - cz01-pgpool0{1,2}:
- Primary pgpool server installation - cz01-pgpool01:
- Secondary pgpool server installation - cz01-pgpool02:
Now the all 4 server should be configured according the picture mentioned above. To be sure everything is working properly I decided to do various tests by stopping/starting the databases to see how all components are ready for outages. In the next pare there will be a lot of outputs of logs+commands which can be handy for troubleshooting in the future and which will test the proper configuration.
All the files modified and used for the configuration above can be found here.
Testing
Check the cluster status
cz01-pgpool02 ~ # ssh postgres@cz01-psql02.example.com "/usr/pgsql-9.3/bin/repmgr --verbose -f /var/lib/pgsql/repmgr/repmgr.conf cluster show" Warning: Permanently added 'cz01-psql02.example.com,10.32.243.148' (RSA) to the list of known hosts. [2014-10-23 14:39:40] [INFO] repmgr connecting to database Opening configuration file: /var/lib/pgsql/repmgr/repmgr.conf Role | Connection String * master | host=cz01-psql01.example.com user=repmgr dbname=repmgr standby | host=cz01-psql02.example.com user=repmgr dbname=repmgr cz01-pgpool02 ~ # pcp_node_count 1 localhost 9898 admin password123 2 cz01-pgpool02 ~ # pcp_node_info 1 localhost 9898 admin password123 0 cz01-psql01.example.com 5432 1 0.500000 cz01-pgpool02 ~ # pcp_node_info 1 localhost 9898 admin password123 1 cz01-psql02.example.com 5432 1 0.500000
Check if the replication is working
cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-pgpool-ha.example.com -w -c "create database mydb" CREATE DATABASE cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-pgpool-ha.example.com -w -l | grep mydb mydb | admin | UTF8 | en_US.UTF-8 | en_US.UTF-8 | cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-psql01.example.com -w -l | grep mydb mydb | admin | UTF8 | en_US.UTF-8 | en_US.UTF-8 | cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-psql02.example.com -w -l | grep mydb mydb | admin | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
Check what is the primapry pgpool (it has 2 IPs)
cz01-pgpool02 ~ # ssh -q cz01-pgpool01 "ip a s" 1: lo:mtu 16436 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo 2: eth0: mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 00:50:56:91:0a:86 brd ff:ff:ff:ff:ff:ff inet 10.32.243.157/25 brd 10.32.243.255 scope global eth0 inet 10.32.243.250/24 brd 10.32.243.255 scope global eth0:0
Stop the Master DB
cz01-pgpool02 ~ # date && ssh root@cz01-psql01.example.com "service postgresql-9.3 stop" Thu Oct 23 14:43:09 CEST 2014 Warning: Permanently added 'cz01-psql01.example.com,10.32.243.147' (RSA) to the list of known hosts. Stopping postgresql-9.3 service: [ OK ]
- The pgpool is monitoring both master and slave databases if they are responding. If one of them is not responding the pgpool executes the "failover_stream.sh" file. This script is responsible for promoting the slave to be a new master. The result is that the read-only slave will become read/write master. In the diagram below I used the red colour to see the changes which were done when slave was promoted to master.
pgpool01 logs right after the master was stopped
cz01-pgpool01 ~ # cat /var/log/local0 ... 2014-10-23T14:43:11.547651+02:00 cz01-pgpool01 pgpool[23301]: connect_inet_domain_socket: getsockopt() detected error: Connection refused 2014-10-23T14:43:11.547678+02:00 cz01-pgpool01 pgpool[23301]: make_persistent_db_connection: connection to cz01-psql01.example.com(5432) failed 2014-10-23T14:43:11.562642+02:00 cz01-pgpool01 pgpool[23301]: check_replication_time_lag: could not connect to DB node 0, check sr_check_user and sr_check_password 2014-10-23T14:43:12.327080+02:00 cz01-pgpool01 pgpool[23257]: connect_inet_domain_socket: getsockopt() detected error: Connection refused 2014-10-23T14:43:12.327127+02:00 cz01-pgpool01 pgpool[23257]: make_persistent_db_connection: connection to cz01-psql01.example.com(5432) failed 2014-10-23T14:43:12.332564+02:00 cz01-pgpool01 pgpool[23257]: connect_inet_domain_socket: getsockopt() detected error: Connection refused 2014-10-23T14:43:12.332661+02:00 cz01-pgpool01 pgpool[23257]: make_persistent_db_connection: connection to cz01-psql01.example.com(5432) failed 2014-10-23T14:43:12.332740+02:00 cz01-pgpool01 pgpool[23257]: health check failed. 0 th host cz01-psql01.example.com at port 5432 is down 2014-10-23T14:43:12.332822+02:00 cz01-pgpool01 pgpool[23257]: set 0 th backend down status 2014-10-23T14:43:12.332920+02:00 cz01-pgpool01 pgpool[23257]: wd_start_interlock: start interlocking 2014-10-23T14:43:12.348543+02:00 cz01-pgpool01 pgpool[23257]: wd_assume_lock_holder: become a new lock holder 2014-10-23T14:43:12.372682+02:00 cz01-pgpool01 pgpool[23264]: wd_send_response: WD_STAND_FOR_LOCK_HOLDER received but lock holder exists already 2014-10-23T14:43:13.369470+02:00 cz01-pgpool01 pgpool[23257]: starting degeneration. shutdown host cz01-psql01.example.com(5432) 2014-10-23T14:43:13.369521+02:00 cz01-pgpool01 pgpool[23257]: Restart all children 2014-10-23T14:43:13.369544+02:00 cz01-pgpool01 pgpool[23257]: execute command: /etc/pgpool-II-93/failover_stream.sh 0 cz01-psql02.example.com 2014-10-23T14:43:15.916452+02:00 cz01-pgpool01 pgpool[23257]: find_primary_node_repeatedly: waiting for finding a primary node 2014-10-23T14:43:15.933033+02:00 cz01-pgpool01 pgpool[23257]: find_primary_node: primary node id is 1 2014-10-23T14:43:15.937969+02:00 cz01-pgpool01 pgpool[23257]: wd_end_interlock: end interlocking 2014-10-23T14:43:16.963481+02:00 cz01-pgpool01 pgpool[23257]: failover: set new primary node: 1 2014-10-23T14:43:16.963534+02:00 cz01-pgpool01 pgpool[23257]: failover: set new master node: 1 2014-10-23T14:43:17.051441+02:00 cz01-pgpool01 pgpool[23301]: worker process received restart request 2014-10-23T14:43:17.055720+02:00 cz01-pgpool01 pgpool[23257]: failover done. shutdown host cz01-psql01.example.com(5432) 2014-10-23T14:43:18.059487+02:00 cz01-pgpool01 pgpool[23300]: pcp child process received restart request 2014-10-23T14:43:18.064463+02:00 cz01-pgpool01 pgpool[23257]: PCP child 23300 exits with status 256 in failover() 2014-10-23T14:43:18.064493+02:00 cz01-pgpool01 pgpool[23257]: fork a new PCP child pid 26164 in failover() 2014-10-23T14:43:18.064499+02:00 cz01-pgpool01 pgpool[23257]: worker child 23301 exits with status 256 2014-10-23T14:43:18.065907+02:00 cz01-pgpool01 pgpool[23257]: fork a new worker child pid 26165
psql01 (masted db) logs right after the master was stopped
cz01-psql01 / # cat /var/lib/pgsql/9.3/data/pg_log/postgresql-Thu.log ... 2014-10-23 14:43:10 CEST [18254]: [6-1] user=,db= LOG: received fast shutdown request 2014-10-23 14:43:10 CEST [18254]: [7-1] user=,db= LOG: aborting any active transactions 2014-10-23 14:43:10 CEST [24392]: [13-1] user=admin,db=postgres FATAL: terminating connection due to administrator command 2014-10-23 14:43:10 CEST [24392]: [14-1] user=admin,db=postgres LOG: disconnection: session time: 0:03:01.366 user=admin database=postgres host=10.32.243.157 port=53814 2014-10-23 14:43:10 CEST [24386]: [13-1] user=admin,db=postgres FATAL: terminating connection due to administrator command 2014-10-23 14:43:10 CEST [24386]: [14-1] user=admin,db=postgres LOG: disconnection: session time: 0:03:08.732 user=admin database=postgres host=10.32.243.157 port=53812 2014-10-23 14:43:10 CEST [18266]: [2-1] user=,db= LOG: autovacuum launcher shutting down 2014-10-23 14:43:10 CEST [18263]: [27-1] user=,db= LOG: shutting down 2014-10-23 14:43:10 CEST [18263]: [28-1] user=,db= LOG: checkpoint starting: shutdown immediate 2014-10-23 14:43:10 CEST [18263]: [29-1] user=,db= LOG: checkpoint complete: wrote 1 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.002 s, total=0.479 s; sync files=1, longest=0.002 s, average=0.002 s 2014-10-23 14:43:10 CEST [18263]: [30-1] user=,db= LOG: database system is shut down 2014-10-23 14:43:11 CEST [18438]: [3-1] user=repmgr,db=[unknown] LOG: disconnection: session time: 0:48:37.268 user=repmgr database= host=10.32.243.148 port=50909
psql02 (slave db) logs right after the master was stopped
cz01-psql02 / # cat /var/lib/pgsql/9.3/data/pg_log/postgresql-Thu.log ... 2014-10-23 14:43:11 CEST [18031]: [2-1] user=,db= LOG: replication terminated by primary server 2014-10-23 14:43:11 CEST [18031]: [3-1] user=,db= DETAIL: End of WAL reached on timeline 1 at 0/61000090. 2014-10-23 14:43:11 CEST [18031]: [4-1] user=,db= FATAL: could not send end-of-streaming message to primary: no COPY in progress 2014-10-23 14:43:11 CEST [18030]: [5-1] user=,db= LOG: record with zero length at 0/61000090 2014-10-23 14:43:11 CEST [24120]: [1-1] user=[unknown],db=[unknown] LOG: connection received: host=10.32.243.157 port=52991 2014-10-23 14:43:11 CEST [24120]: [2-1] user=admin,db=postgres LOG: connection authorized: user=admin database=postgres 2014-10-23 14:43:11 CEST [24120]: [3-1] user=admin,db=postgres LOG: disconnection: session time: 0:00:00.010 user=admin database=postgres host=10.32.243.157 port=52991 2014-10-23 14:43:13 CEST [24121]: [1-1] user=[unknown],db=[unknown] LOG: connection received: host=10.32.243.158 port=53841 2014-10-23 14:43:13 CEST [24121]: [2-1] user=admin,db=postgres LOG: connection authorized: user=admin database=postgres 2014-10-23 14:43:13 CEST [24121]: [3-1] user=admin,db=postgres LOG: disconnection: session time: 0:00:00.009 user=admin database=postgres host=10.32.243.158 port=53841 2014-10-23 14:43:13 CEST [23435]: [9-1] user=admin,db=postgres LOG: disconnection: session time: 0:03:11.742 user=admin database=postgres host=10.32.243.157 port=52915 2014-10-23 14:43:13 CEST [23438]: [5-1] user=admin,db=postgres LOG: disconnection: session time: 0:03:04.381 user=admin database=postgres host=10.32.243.157 port=52917 2014-10-23 14:43:13 CEST [24129]: [1-1] user=[unknown],db=[unknown] LOG: connection received: host=10.32.243.148 port=54332 2014-10-23 14:43:13 CEST [24129]: [2-1] user=repmgr,db=repmgr LOG: connection authorized: user=repmgr database=repmgr 2014-10-23 14:43:13 CEST [24129]: [3-1] user=repmgr,db=repmgr LOG: statement: WITH pg_version(ver) AS (SELECT split_part(version(), ' ', 2)) SELECT split_part(ver, '.', 1), split_part(ver, '.', 2) FROM pg_version 2014-10-23 14:43:13 CEST [24129]: [4-1] user=repmgr,db=repmgr LOG: duration: 2.850 ms 2014-10-23 14:43:13 CEST [24129]: [5-1] user=repmgr,db=repmgr LOG: statement: SELECT pg_is_in_recovery() 2014-10-23 14:43:13 CEST [24129]: [6-1] user=repmgr,db=repmgr LOG: duration: 0.339 ms 2014-10-23 14:43:13 CEST [24129]: [7-1] user=repmgr,db=repmgr LOG: statement: SELECT id, conninfo FROM "repmgr_pgsql_cluster".repl_nodes WHERE cluster = 'pgsql_cluster' and not witness 2014-10-23 14:43:13 CEST [24129]: [8-1] user=repmgr,db=repmgr LOG: duration: 2.634 ms 2014-10-23 14:43:13 CEST [24130]: [1-1] user=[unknown],db=[unknown] LOG: connection received: host=10.32.243.148 port=54334 2014-10-23 14:43:13 CEST [24130]: [2-1] user=repmgr,db=repmgr LOG: connection authorized: user=repmgr database=repmgr 2014-10-23 14:43:13 CEST [24130]: [3-1] user=repmgr,db=repmgr LOG: statement: SELECT pg_is_in_recovery() 2014-10-23 14:43:13 CEST [24130]: [4-1] user=repmgr,db=repmgr LOG: duration: 1.347 ms 2014-10-23 14:43:13 CEST [24129]: [9-1] user=repmgr,db=repmgr LOG: statement: SELECT setting FROM pg_settings WHERE name = 'data_directory' 2014-10-23 14:43:13 CEST [24130]: [5-1] user=repmgr,db=repmgr LOG: disconnection: session time: 0:00:00.024 user=repmgr database=repmgr host=10.32.243.148 port=54334 2014-10-23 14:43:13 CEST [24129]: [10-1] user=repmgr,db=repmgr LOG: duration: 4.954 ms 2014-10-23 14:43:13 CEST [24129]: [11-1] user=repmgr,db=repmgr LOG: disconnection: session time: 0:00:00.067 user=repmgr database=repmgr host=10.32.243.148 port=54332 2014-10-23 14:43:13 CEST [18021]: [6-1] user=,db= LOG: received fast shutdown request 2014-10-23 14:43:13 CEST [18021]: [7-1] user=,db= LOG: aborting any active transactions 2014-10-23 14:43:13 CEST [18032]: [28-1] user=,db= LOG: shutting down 2014-10-23 14:43:13 CEST [18032]: [29-1] user=,db= LOG: restartpoint starting: shutdown immediate 2014-10-23 14:43:13 CEST [18032]: [30-1] user=,db= LOG: restartpoint complete: wrote 6 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.002 s, total=0.008 s; sync files=6, longest=0.000 s, average=0.000 s 2014-10-23 14:43:13 CEST [18032]: [31-1] user=,db= LOG: recovery restart point at 0/61000028 2014-10-23 14:43:13 CEST [18032]: [32-1] user=,db= DETAIL: last completed transaction was at log time 2014-10-23 14:40:08.829957+02 2014-10-23 14:43:13 CEST [18032]: [33-1] user=,db= LOG: database system is shut down 2014-10-23 14:43:14 CEST [24141]: [1-1] user=,db= LOG: database system was shut down in recovery at 2014-10-23 14:43:13 CEST 2014-10-23 14:43:14 CEST [24141]: [2-1] user=,db= LOG: database system was not properly shut down; automatic recovery in progress 2014-10-23 14:43:14 CEST [24141]: [3-1] user=,db= LOG: consistent recovery state reached at 0/61000090 2014-10-23 14:43:14 CEST [24141]: [4-1] user=,db= LOG: record with zero length at 0/61000090 2014-10-23 14:43:14 CEST [24141]: [5-1] user=,db= LOG: redo is not required 2014-10-23 14:43:14 CEST [24141]: [6-1] user=,db= LOG: checkpoint starting: end-of-recovery immediate 2014-10-23 14:43:14 CEST [24141]: [7-1] user=,db= LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.000 s, total=0.005 s; sync files=0, longest=0.000 s, average=0.000 s 2014-10-23 14:43:14 CEST [24145]: [1-1] user=,db= LOG: autovacuum launcher started 2014-10-23 14:43:14 CEST [24133]: [5-1] user=,db= LOG: database system is ready to accept connections 2014-10-23 14:43:15 CEST [24148]: [1-1] user=[unknown],db=[unknown] LOG: connection received: host=[local] 2014-10-23 14:43:15 CEST [24148]: [2-1] user=postgres,db=postgres LOG: connection authorized: user=postgres database=postgres 2014-10-23 14:43:15 CEST [24148]: [3-1] user=postgres,db=postgres LOG: disconnection: session time: 0:00:00.019 user=postgres database=postgres host=[local] 2014-10-23 14:43:15 CEST [24149]: [1-1] user=[unknown],db=[unknown] LOG: connection received: host=10.32.243.148 port=54335 2014-10-23 14:43:15 CEST [24149]: [2-1] user=repmgr,db=repmgr LOG: connection authorized: user=repmgr database=repmgr 2014-10-23 14:43:15 CEST [24149]: [3-1] user=repmgr,db=repmgr LOG: statement: SELECT pg_is_in_recovery() 2014-10-23 14:43:15 CEST [24149]: [4-1] user=repmgr,db=repmgr LOG: duration: 2.252 ms 2014-10-23 14:43:15 CEST [24149]: [5-1] user=repmgr,db=repmgr LOG: disconnection: session time: 0:00:00.030 user=repmgr database=repmgr host=10.32.243.148 port=54335
failover_stream.sh output log created once the script was executed on the primary pgpool
cz01-pgpool01 / # cat /tmp/failover_stream.sh.log Thu Oct 23 14:43:13 CEST 2014 Failed node: 0 + /usr/bin/ssh -T -l postgres cz01-psql02.example.com '/usr/pgsql-9.3/bin/repmgr -f /var/lib/pgsql/repmgr/repmgr.conf standby promote 2>/dev/null 1>/dev/null <&-' Warning: Permanently added 'cz01-psql02.example.com,10.32.243.148' (RSA) to the list of known hosts. + exit 0
Check the cluster status
cz01-pgpool02 ~ # ssh postgres@cz01-psql02.example.com "/usr/pgsql-9.3/bin/repmgr --verbose -f /var/lib/pgsql/repmgr/repmgr.conf cluster show" Warning: Permanently added 'cz01-psql02.example.com,10.32.243.148' (RSA) to the list of known hosts. [2014-10-23 14:49:07] [INFO] repmgr connecting to database [2014-10-23 14:49:07] [ERROR] Connection to database failed: could not connect to server: Connection refused Is the server running on host "cz01-psql01.example.com" (10.32.243.147) and accepting TCP/IP connections on port 5432? Opening configuration file: /var/lib/pgsql/repmgr/repmgr.conf Role | Connection String FAILED | host=cz01-psql01.example.com user=repmgr dbname=repmgr * master | host=cz01-psql02.example.com user=repmgr dbname=repmgr cz01-pgpool02 ~ # pcp_node_info 1 localhost 9898 admin password123 0 cz01-psql01.example.com 5432 3 0.500000 cz01-pgpool02 ~ # pcp_node_info 1 localhost 9898 admin password123 1 cz01-psql02.example.com 5432 1 0.500000
Check if everything is working
cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-pgpool-ha.example.com -w -c "drop database mydb" DROP DATABASE cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-pgpool-ha.example.com -w -l | grep mydb cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-psql02.example.com -w -l | grep mydb
Make the "old master" to be a "new slave" cz01-psql01
cz01-pgpool02 ~ # ssh cz01-psql01.example.com 'service postgresql-9.3 stop; su - postgres -c "repmgr -D /var/lib/pgsql/9.3/data -d repmgr -p 5432 -U repmgr -R postgres --verbose --force standby clone cz01-psql02.example.com"; service postgresql-9.3 start;' Warning: Permanently added 'cz01-psql01.example.com,10.32.243.147' (RSA) to the list of known hosts. Stopping postgresql-9.3 service: [ OK ] [2014-10-23 14:52:08] [ERROR] Did not find the configuration file './repmgr.conf', continuing [2014-10-23 14:52:08] [NOTICE] repmgr Destination directory /var/lib/pgsql/9.3/data provided, try to clone everything in it. [2014-10-23 14:52:08] [INFO] repmgr connecting to master database [2014-10-23 14:52:08] [INFO] repmgr connected to master, checking its state [2014-10-23 14:52:09] [INFO] Successfully connected to primary. Current installation size is 188 MB Warning: Permanently added 'cz01-psql02.example.com,10.32.243.148' (RSA) to the list of known hosts. [2014-10-23 14:52:09] [NOTICE] Starting backup... [2014-10-23 14:52:09] [WARNING] directory "/var/lib/pgsql/9.3/data" exists but is not empty [2014-10-23 14:52:09] [INFO] standby clone: master control file '/var/lib/pgsql/9.3/data/global/pg_control' [2014-10-23 14:52:09] [INFO] standby clone: master control file '/var/lib/pgsql/9.3/data/global/pg_control' [2014-10-23 14:52:09] [INFO] rsync command line: 'rsync --archive --checksum --compress --progress --rsh=ssh --delete postgres@cz01-psql02.example.com:/var/lib/pgsql/9.3/data/global/pg_control /var/lib/pgsql/9.3/data/global' Warning: Permanently added 'cz01-psql02.example.com,10.32.243.148' (RSA) to the list of known hosts. receiving incremental file list pg_control 8192 100% 7.81MB/s 0:00:00 (xfer#1, to-check=0/1) sent 102 bytes received 236 bytes 676.00 bytes/sec total size is 8192 speedup is 24.24 [2014-10-23 14:52:09] [INFO] standby clone: master data directory '/var/lib/pgsql/9.3/data' [2014-10-23 14:52:09] [INFO] rsync command line: 'rsync --archive --checksum --compress --progress --rsh=ssh --delete --exclude=pg_xlog* --exclude=pg_log* --exclude=pg_control --exclude=*.pid postgres@cz01-psql02.example.com:/var/lib/pgsql/9.3/data/* /var/lib/pgsql/9.3/data' Warning: Permanently added 'cz01-psql02.example.com,10.32.243.148' (RSA) to the list of known hosts. receiving incremental file list deleting base/16528/pg_filenode.map deleting base/16528/PG_VERSION deleting base/16528/12890 deleting base/16528/12888 ... ... ... pg_stat_tmp/db_16413.stat 6089 100% 849.47kB/s 0:00:00 (xfer#16, to-check=2/1542) pg_stat_tmp/global.stat 1026 100% 125.24kB/s 0:00:00 (xfer#17, to-check=1/1542) sent 3810 bytes received 94096 bytes 39162.40 bytes/sec total size is 198017139 speedup is 2022.52 [2014-10-23 14:52:11] [INFO] standby clone: master config file '/var/lib/pgsql/9.3/data/postgresql.conf' [2014-10-23 14:52:11] [INFO] rsync command line: 'rsync --archive --checksum --compress --progress --rsh=ssh --delete postgres@cz01-psql02.example.com:/var/lib/pgsql/9.3/data/postgresql.conf /var/lib/pgsql/9.3/data' Warning: Permanently added 'cz01-psql02.example.com,10.32.243.148' (RSA) to the list of known hosts. receiving incremental file list sent 11 bytes received 80 bytes 60.67 bytes/sec total size is 20561 speedup is 225.95 [2014-10-23 14:52:12] [INFO] standby clone: master hba file '/var/lib/pgsql/9.3/data/pg_hba.conf' [2014-10-23 14:52:12] [INFO] rsync command line: 'rsync --archive --checksum --compress --progress --rsh=ssh --delete postgres@cz01-psql02.example.com:/var/lib/pgsql/9.3/data/pg_hba.conf /var/lib/pgsql/9.3/data' Warning: Permanently added 'cz01-psql02.example.com,10.32.243.148' (RSA) to the list of known hosts. receiving incremental file list sent 11 bytes received 76 bytes 174.00 bytes/sec total size is 4812 speedup is 55.31 [2014-10-23 14:52:12] [INFO] standby clone: master ident file '/var/lib/pgsql/9.3/data/pg_ident.conf' [2014-10-23 14:52:12] [INFO] rsync command line: 'rsync --archive --checksum --compress --progress --rsh=ssh --delete postgres@cz01-psql02.example.com:/var/lib/pgsql/9.3/data/pg_ident.conf /var/lib/pgsql/9.3/data' Warning: Permanently added 'cz01-psql02.example.com,10.32.243.148' (RSA) to the list of known hosts. receiving incremental file list sent 11 bytes received 78 bytes 178.00 bytes/sec total size is 1636 speedup is 18.38 [2014-10-23 14:52:12] [NOTICE] Finishing backup... NOTICE: pg_stop_backup complete, all required WAL segments have been archived [2014-10-23 14:52:13] [INFO] repmgr requires primary to keep WAL files 000000010000000000000062 until at least 000000010000000000000062 [2014-10-23 14:52:13] [NOTICE] repmgr standby clone complete [2014-10-23 14:52:13] [NOTICE] HINT: You can now start your postgresql server [2014-10-23 14:52:13] [NOTICE] for example : pg_ctl -D /var/lib/pgsql/9.3/data start Opening configuration file: ./repmgr.conf Starting postgresql-9.3 service: [ OK ]
- Once the slave was promoted to the new master, the DB administrator should check what exactly happen to the failed master database. Once the problem is being analyzed it's necessary not to start the "failed" database but make it working as slave. There is currently master db therefore every other database needs to be the slave one. The command above setup a new slave with data replication from the running master. The pgpool needs to be notified as well, that the "new" slave is up and running ready for read only queries (the commands will follow). The diagram below is showing the current status and the red color is showing the additional changes.
Logs right after the new slave (cz01-psql01) was configured+started
cz01-psql01 / # cat /var/lib/pgsql/9.3/data/pg_log/postgresql-Thu.log ... 2014-10-23 14:52:14 CEST [26707]: [1-1] user=,db= LOG: database system was interrupted; last known up at 2014-10-23 14:52:09 CEST 2014-10-23 14:52:14 CEST [26707]: [2-1] user=,db= LOG: entering standby mode 2014-10-23 14:52:14 CEST [26708]: [1-1] user=,db= LOG: started streaming WAL from primary at 0/62000000 on timeline 1 2014-10-23 14:52:14 CEST [26707]: [3-1] user=,db= LOG: redo starts at 0/62000028 2014-10-23 14:52:14 CEST [26707]: [4-1] user=,db= LOG: consistent recovery state reached at 0/620000F0 2014-10-23 14:52:14 CEST [26698]: [5-1] user=,db= LOG: database system is ready to accept read only connections
Logs right after the new slave (cz01-psql01) was configured+started
cz01-psql02 / # cat /var/lib/pgsql/9.3/data/pg_log/postgresql-Thu.log ... 2014-10-23 14:52:08 CEST [25481]: [1-1] user=[unknown],db=[unknown] LOG: connection received: host=10.32.243.147 port=52477 2014-10-23 14:52:08 CEST [25481]: [2-1] user=repmgr,db=repmgr LOG: connection authorized: user=repmgr database=repmgr 2014-10-23 14:52:08 CEST [25481]: [3-1] user=repmgr,db=repmgr LOG: statement: WITH pg_version(ver) AS (SELECT split_part(version(), ' ', 2)) SELECT split_part(ver, '.', 1), split_part(ver, '.', 2) FROM pg_version 2014-10-23 14:52:08 CEST [25481]: [4-1] user=repmgr,db=repmgr LOG: duration: 2.790 ms 2014-10-23 14:52:08 CEST [25481]: [5-1] user=repmgr,db=repmgr LOG: statement: SELECT pg_is_in_recovery() 2014-10-23 14:52:08 CEST [25481]: [6-1] user=repmgr,db=repmgr LOG: duration: 0.451 ms 2014-10-23 14:52:08 CEST [25481]: [7-1] user=repmgr,db=repmgr LOG: statement: SELECT true FROM pg_settings WHERE name = 'wal_level' AND setting = 'hot_standby' 2014-10-23 14:52:09 CEST [25481]: [8-1] user=repmgr,db=repmgr LOG: duration: 7.175 ms 2014-10-23 14:52:09 CEST [25481]: [9-1] user=repmgr,db=repmgr LOG: statement: SELECT true FROM pg_settings WHERE name = 'wal_keep_segments' AND setting::integer >= '5000'::integer 2014-10-23 14:52:09 CEST [25481]: [10-1] user=repmgr,db=repmgr LOG: duration: 3.971 ms 2014-10-23 14:52:09 CEST [25481]: [11-1] user=repmgr,db=repmgr LOG: statement: SELECT true FROM pg_settings WHERE name = 'archive_mode' AND setting = 'on' 2014-10-23 14:52:09 CEST [25481]: [12-1] user=repmgr,db=repmgr LOG: duration: 3.191 ms 2014-10-23 14:52:09 CEST [25481]: [13-1] user=repmgr,db=repmgr LOG: statement: SELECT true FROM pg_settings WHERE name = 'hot_standby' AND setting = 'on' 2014-10-23 14:52:09 CEST [25481]: [14-1] user=repmgr,db=repmgr LOG: duration: 3.147 ms 2014-10-23 14:52:09 CEST [25481]: [15-1] user=repmgr,db=repmgr LOG: statement: SELECT pg_tablespace_location(oid) spclocation FROM pg_tablespace WHERE spcname NOT IN ('pg_default', 'pg_global') 2014-10-23 14:52:09 CEST [25481]: [16-1] user=repmgr,db=repmgr LOG: duration: 2.541 ms 2014-10-23 14:52:09 CEST [25481]: [17-1] user=repmgr,db=repmgr LOG: statement: SELECT name, setting FROM pg_settings WHERE name IN ('data_directory', 'config_file', 'hba_file', 'ident_file', 'stats_temp_directory') 2014-10-23 14:52:09 CEST [25481]: [18-1] user=repmgr,db=repmgr LOG: duration: 3.842 ms 2014-10-23 14:52:09 CEST [25481]: [19-1] user=repmgr,db=repmgr LOG: statement: SELECT pg_size_pretty(SUM(pg_database_size(oid))::bigint) FROM pg_database 2014-10-23 14:52:09 CEST [25481]: [20-1] user=repmgr,db=repmgr LOG: duration: 16.913 ms 2014-10-23 14:52:09 CEST [25481]: [21-1] user=repmgr,db=repmgr LOG: statement: SET synchronous_commit TO OFF 2014-10-23 14:52:09 CEST [25481]: [22-1] user=repmgr,db=repmgr LOG: duration: 0.282 ms 2014-10-23 14:52:09 CEST [25481]: [23-1] user=repmgr,db=repmgr LOG: statement: SELECT pg_xlogfile_name(pg_start_backup('repmgr_standby_clone_1414068729')) 2014-10-23 14:52:09 CEST [24142]: [3-1] user=,db= LOG: checkpoint starting: force wait 2014-10-23 14:52:09 CEST [25489]: [1-1] user=[unknown],db=[unknown] LOG: connection received: host=10.32.243.158 port=53963 2014-10-23 14:52:09 CEST [25489]: [2-1] user=admin,db=postgres LOG: connection authorized: user=admin database=postgres 2014-10-23 14:52:09 CEST [25489]: [3-1] user=admin,db=postgres LOG: disconnection: session time: 0:00:00.008 user=admin database=postgres host=10.32.243.158 port=53963 2014-10-23 14:52:09 CEST [25490]: [1-1] user=[unknown],db=[unknown] LOG: connection received: host=10.32.243.157 port=53120 2014-10-23 14:52:09 CEST [25490]: [2-1] user=admin,db=template1 LOG: connection authorized: user=admin database=template1 2014-10-23 14:52:09 CEST [25490]: [3-1] user=admin,db=template1 LOG: disconnection: session time: 0:00:00.008 user=admin database=template1 host=10.32.243.157 port=53120 2014-10-23 14:52:09 CEST [24142]: [4-1] user=,db= LOG: checkpoint complete: wrote 1 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.000 s, total=0.391 s; sync files=1, longest=0.000 s, average=0.000 s 2014-10-23 14:52:09 CEST [25481]: [24-1] user=repmgr,db=repmgr LOG: duration: 403.777 ms 2014-10-23 14:52:11 CEST [25481]: [25-1] user=repmgr,db=repmgr LOG: statement: SELECT pg_tablespace_location(oid) spclocation FROM pg_tablespace WHERE spcname NOT IN ('pg_default', 'pg_global') 2014-10-23 14:52:11 CEST [25481]: [26-1] user=repmgr,db=repmgr LOG: duration: 0.514 ms 2014-10-23 14:52:12 CEST [25481]: [27-1] user=repmgr,db=repmgr LOG: statement: SELECT pg_xlogfile_name(pg_stop_backup()) 2014-10-23 14:52:13 CEST [25481]: [28-1] user=repmgr,db=repmgr LOG: duration: 1005.802 ms 2014-10-23 14:52:13 CEST [25481]: [29-1] user=repmgr,db=repmgr LOG: disconnection: session time: 0:00:04.406 user=repmgr database=repmgr host=10.32.243.147 port=52477 2014-10-23 14:52:14 CEST [25524]: [1-1] user=[unknown],db=[unknown] LOG: connection received: host=10.32.243.147 port=52484 2014-10-23 14:52:14 CEST [25524]: [2-1] user=repmgr,db=[unknown] LOG: replication connection authorized: user=repmgr
There is nothing new in pgpool after the new slave (cz01-psql01) was configured+started
cz01-pgpool01 / # tail -f /var/log/local0
Check the cluster status
cz01-pgpool02 ~ # ssh postgres@cz01-psql02.example.com "/usr/pgsql-9.3/bin/repmgr --verbose -f /var/lib/pgsql/repmgr/repmgr.conf cluster show" Warning: Permanently added 'cz01-psql02.example.com,10.32.243.148' (RSA) to the list of known hosts. [2014-10-23 14:58:23] [INFO] repmgr connecting to database Opening configuration file: /var/lib/pgsql/repmgr/repmgr.conf Role | Connection String standby | host=cz01-psql01.example.com user=repmgr dbname=repmgr * master | host=cz01-psql02.example.com user=repmgr dbname=repmgr cz01-pgpool02 ~ # pcp_node_info 1 localhost 9898 admin password123 0 cz01-psql01.example.com 5432 3 0.500000 cz01-pgpool02 ~ # pcp_node_info 1 localhost 9898 admin password123 1 cz01-psql02.example.com 5432 1 0.500000
Check if everything is working
cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-pgpool-ha.example.com -w -c "create database mydb" CREATE DATABASE cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-pgpool-ha.example.com -w -l | grep mydb mydb | admin | UTF8 | en_US.UTF-8 | en_US.UTF-8 | cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-psql01.example.com -w -l | grep mydb mydb | admin | UTF8 | en_US.UTF-8 | en_US.UTF-8 | cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-psql02.example.com -w -l | grep mydb mydb | admin | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
Reinitialize the slave in pgpool to be ready for read only queries
cz01-pgpool02 ~ # pcp_detach_node 0 localhost 9898 admin password123 0 cz01-pgpool02 ~ # pcp_attach_node 0 localhost 9898 admin password123 0
logfile right after the slave was enabled for RO queries
cz01-pgpool01 / # cat /var/log/local0 ... 2014-10-23T15:55:38.440946+02:00 cz01-pgpool01 pgpool[23264]: send_failback_request: fail back 0 th node request from pid 23264 2014-10-23T15:55:38.442034+02:00 cz01-pgpool01 pgpool[23257]: wd_start_interlock: start interlocking 2014-10-23T15:55:38.455732+02:00 cz01-pgpool01 pgpool[23257]: wd_assume_lock_holder: become a new lock holder 2014-10-23T15:55:38.459908+02:00 cz01-pgpool01 pgpool[23264]: wd_send_response: WD_STAND_FOR_LOCK_HOLDER received but lock holder exists already 2014-10-23T15:55:38.963952+02:00 cz01-pgpool01 pgpool[23257]: starting fail back. reconnect host cz01-psql01.example.com(5432) 2014-10-23T15:55:38.970549+02:00 cz01-pgpool01 pgpool[23257]: Do not restart children because we are failbacking node id 0 hostcz01-psql01.example.com port:5432 and we are in streaming replication mode 2014-10-23T15:55:38.974760+02:00 cz01-pgpool01 pgpool[23257]: find_primary_node_repeatedly: waiting for finding a primary node 2014-10-23T15:55:39.018890+02:00 cz01-pgpool01 pgpool[23257]: find_primary_node: primary node id is 1 2014-10-23T15:55:39.024830+02:00 cz01-pgpool01 pgpool[23257]: wd_end_interlock: end interlocking 2014-10-23T15:55:40.048487+02:00 cz01-pgpool01 pgpool[23257]: failover: set new primary node: 1 2014-10-23T15:55:40.048514+02:00 cz01-pgpool01 pgpool[23257]: failover: set new master node: 0 2014-10-23T15:55:40.048520+02:00 cz01-pgpool01 pgpool[23257]: failback done. reconnect host cz01-psql01.example.com(5432) 2014-10-23T15:55:40.050908+02:00 cz01-pgpool01 pgpool[26165]: worker process received restart request 2014-10-23T15:55:41.051525+02:00 cz01-pgpool01 pgpool[26164]: pcp child process received restart request 2014-10-23T15:55:41.056496+02:00 cz01-pgpool01 pgpool[23257]: PCP child 26164 exits with status 256 in failover() 2014-10-23T15:55:41.056543+02:00 cz01-pgpool01 pgpool[23257]: fork a new PCP child pid 1392 in failover() 2014-10-23T15:55:41.056565+02:00 cz01-pgpool01 pgpool[23257]: worker child 26165 exits with status 256 2014-10-23T15:55:41.057839+02:00 cz01-pgpool01 pgpool[23257]: fork a new worker child pid 1393
Check pgpool stratus the slave should have a good value now
cz01-pgpool02 ~ # pcp_node_info 1 localhost 9898 admin password123 0 cz01-psql01.example.com 5432 1 0.500000 cz01-pgpool02 ~ # pcp_node_info 1 localhost 9898 admin password123 1 cz01-psql02.example.com 5432 1 0.500000
Check if everything is working fine
cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-pgpool-ha.example.com -w -c "drop database mydb" DROP DATABASE cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-pgpool-ha.example.com -w -l | grep mydb cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-psql01.example.com -w -l | grep mydb cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-psql02.example.com -w -l | grep mydb
Stop the master DB (original slave) - the new master should be promoted from slave automatically
cz01-pgpool02 ~ # ssh cz01-psql02.example.com 'service postgresql-9.3 stop' Warning: Permanently added 'cz01-psql02.example.com,10.32.243.148' (RSA) to the list of known hosts. Stopping postgresql-9.3 service: [ OK ]
- Here is another example what may happen if original slave server, promoted to new master fails. Again the slave is automatically promoted by pgpool to the new master and "original master" (later slave) is master again. Changes are using blue color in the diagram below.
logs right after the master (original slave) was stopped
cz01-psql02 / # cat /var/lib/pgsql/9.3/data/pg_log/postgresql-Thu.log ... 2014-10-23 16:01:30 CEST [24133]: [6-1] user=,db= LOG: received fast shutdown request 2014-10-23 16:01:30 CEST [24133]: [7-1] user=,db= LOG: aborting any active transactions 2014-10-23 16:01:30 CEST [1991]: [7-1] user=admin,db=postgres FATAL: terminating connection due to administrator command 2014-10-23 16:01:30 CEST [1991]: [8-1] user=admin,db=postgres LOG: disconnection: session time: 0:02:13.805 user=admin database=postgres host=10.32.243.157 port=53988 2014-10-23 16:01:30 CEST [1986]: [7-1] user=admin,db=postgres FATAL: terminating connection due to administrator command 2014-10-23 16:01:30 CEST [1986]: [8-1] user=admin,db=postgres LOG: disconnection: session time: 0:02:19.271 user=admin database=postgres host=10.32.243.157 port=53982 2014-10-23 16:01:30 CEST [24145]: [2-1] user=,db= LOG: autovacuum launcher shutting down 2014-10-23 16:01:30 CEST [24142]: [37-1] user=,db= LOG: shutting down 2014-10-23 16:01:30 CEST [24142]: [38-1] user=,db= LOG: checkpoint starting: shutdown immediate 2014-10-23 16:01:30 CEST [24142]: [39-1] user=,db= LOG: checkpoint complete: wrote 1 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.008 s, total=0.449 s; sync files=1, longest=0.008 s, average=0.008 s 2014-10-23 16:01:30 CEST [24142]: [40-1] user=,db= LOG: database system is shut down 2014-10-23 16:01:31 CEST [25524]: [3-1] user=repmgr,db=[unknown] LOG: disconnection: session time: 1:09:17.350 user=repmgr database= host=10.32.243.147 port=52484
logs right after the master (original slave) was stopped
cz01-pgpool01 / # cat /var/log/local0 ... 2014-10-23T16:01:33.199817+02:00 cz01-pgpool01 pgpool[23257]: connect_inet_domain_socket: getsockopt() detected error: Connection refused 2014-10-23T16:01:33.200803+02:00 cz01-pgpool01 pgpool[23257]: make_persistent_db_connection: connection to cz01-psql02.example.com(5432) failed 2014-10-23T16:01:33.200989+02:00 cz01-pgpool01 pgpool[23257]: health check failed. 1 th host cz01-psql02.example.com at port 5432 is down 2014-10-23T16:01:33.201148+02:00 cz01-pgpool01 pgpool[23257]: set 1 th backend down status 2014-10-23T16:01:33.201280+02:00 cz01-pgpool01 pgpool[23257]: wd_start_interlock: start interlocking 2014-10-23T16:01:33.207742+02:00 cz01-pgpool01 pgpool[23264]: wd_send_response: WD_STAND_FOR_LOCK_HOLDER received it 2014-10-23T16:01:33.213943+02:00 cz01-pgpool01 pgpool[23264]: wd_send_response: failover request from other pgpool is canceled because it's while switching 2014-10-23T16:01:33.229107+02:00 cz01-pgpool01 pgpool[1393]: connect_inet_domain_socket: getsockopt() detected error: Connection refused 2014-10-23T16:01:33.229133+02:00 cz01-pgpool01 pgpool[1393]: make_persistent_db_connection: connection to cz01-psql02.example.com(5432) failed 2014-10-23T16:01:33.233566+02:00 cz01-pgpool01 pgpool[1393]: check_replication_time_lag: could not connect to DB node 1, check sr_check_user and sr_check_password 2014-10-23T16:01:33.716568+02:00 cz01-pgpool01 pgpool[23257]: starting degeneration. shutdown host cz01-psql02.example.com(5432) 2014-10-23T16:01:33.716597+02:00 cz01-pgpool01 pgpool[23257]: Restart all children 2014-10-23T16:01:36.720365+02:00 cz01-pgpool01 pgpool[23257]: find_primary_node_repeatedly: waiting for finding a primary node 2014-10-23T16:01:36.735887+02:00 cz01-pgpool01 pgpool[23257]: find_primary_node: primary node id is 0 2014-10-23T16:01:36.736038+02:00 cz01-pgpool01 pgpool[23257]: wd_end_interlock: end interlocking 2014-10-23T16:01:37.243499+02:00 cz01-pgpool01 pgpool[23257]: failover: set new primary node: 0 2014-10-23T16:01:37.243536+02:00 cz01-pgpool01 pgpool[23257]: failover: set new master node: 0 2014-10-23T16:01:37.332464+02:00 cz01-pgpool01 pgpool[1393]: worker process received restart request 2014-10-23T16:01:37.335636+02:00 cz01-pgpool01 pgpool[23257]: failover done. shutdown host cz01-psql02.example.com(5432) 2014-10-23T16:01:38.338656+02:00 cz01-pgpool01 pgpool[1392]: pcp child process received restart request 2014-10-23T16:01:38.345478+02:00 cz01-pgpool01 pgpool[23257]: PCP child 1392 exits with status 256 in failover() 2014-10-23T16:01:38.345510+02:00 cz01-pgpool01 pgpool[23257]: fork a new PCP child pid 2517 in failover() 2014-10-23T16:01:38.345516+02:00 cz01-pgpool01 pgpool[23257]: worker child 1393 exits with status 256 2014-10-23T16:01:38.345899+02:00 cz01-pgpool01 pgpool[23257]: fork a new worker child pid 2518
logs right after the master (original slave) was stopped
cz01-psql01 / # cat /var/lib/pgsql/9.3/data/pg_log/postgresql-Thu.log ... 2014-10-23 16:01:31 CEST [26708]: [2-1] user=,db= LOG: replication terminated by primary server 2014-10-23 16:01:31 CEST [26708]: [3-1] user=,db= DETAIL: End of WAL reached on timeline 1 at 0/64000090. 2014-10-23 16:01:31 CEST [26708]: [4-1] user=,db= FATAL: could not send end-of-streaming message to primary: no COPY in progress 2014-10-23 16:01:31 CEST [26707]: [5-1] user=,db= LOG: record with zero length at 0/64000090
Check status
cz01-pgpool02 ~ # ssh -q postgres@cz01-psql01.example.com "/usr/pgsql-9.3/bin/repmgr --verbose -f /var/lib/pgsql/repmgr/repmgr.conf cluster show" [2014-10-23 16:05:16] [INFO] repmgr connecting to database Opening configuration file: /var/lib/pgsql/repmgr/repmgr.conf Role | Connection String * master | host=cz01-psql01.example.com user=repmgr dbname=repmgr FAILED | host=cz01-psql02.example.com user=repmgr dbname=repmgr [2014-10-23 16:05:16] [ERROR] Connection to database failed: could not connect to server: Connection refused Is the server running on host "cz01-psql02.example.com" (10.32.243.148) and accepting TCP/IP connections on port 5432?
Check the functionality
cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-pgpool-ha.example.com -w -c "drop database mydb" DROP DATABASE cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-pgpool-ha.example.com -w -l | grep mydb cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-psql01.example.com -w -l | grep mydb
Reinitialize the stopped master (original slave) to be slave again - original state
cz01-pgpool02 ~ # ssh cz01-psql02.example.com 'service postgresql-9.3 stop; su - postgres -c "repmgr -D /var/lib/pgsql/9.3/data -d repmgr -p 5432 -U repmgr -R postgres --verbose --force standby clone cz01-psql01.example.com"; service postgresql-9.3 start;' Warning: Permanently added 'cz01-psql02.example.com,10.32.243.148' (RSA) to the list of known hosts. Stopping postgresql-9.3 service: [ OK ] [2014-10-23 16:07:11] [ERROR] Did not find the configuration file './repmgr.conf', continuing [2014-10-23 16:07:11] [NOTICE] repmgr Destination directory /var/lib/pgsql/9.3/data provided, try to clone everything in it. [2014-10-23 16:07:11] [INFO] repmgr connecting to master database [2014-10-23 16:07:11] [INFO] repmgr connected to master, checking its state [2014-10-23 16:07:11] [INFO] Successfully connected to primary. Current installation size is 188 MB Warning: Permanently added 'cz01-psql01.example.com,10.32.243.147' (RSA) to the list of known hosts. [2014-10-23 16:07:11] [NOTICE] Starting backup... [2014-10-23 16:07:11] [WARNING] directory "/var/lib/pgsql/9.3/data" exists but is not empty [2014-10-23 16:07:11] [INFO] standby clone: master control file '/var/lib/pgsql/9.3/data/global/pg_control' [2014-10-23 16:07:11] [INFO] standby clone: master control file '/var/lib/pgsql/9.3/data/global/pg_control' [2014-10-23 16:07:11] [INFO] rsync command line: 'rsync --archive --checksum --compress --progress --rsh=ssh --delete postgres@cz01-psql01.example.com:/var/lib/pgsql/9.3/data/global/pg_control /var/lib/pgsql/9.3/data/global' Warning: Permanently added 'cz01-psql01.example.com,10.32.243.147' (RSA) to the list of known hosts. receiving incremental file list pg_control 8192 100% 7.81MB/s 0:00:00 (xfer#1, to-check=0/1) sent 102 bytes received 235 bytes 674.00 bytes/sec total size is 8192 speedup is 24.31 [2014-10-23 16:07:12] [INFO] standby clone: master data directory '/var/lib/pgsql/9.3/data' [2014-10-23 16:07:12] [INFO] rsync command line: 'rsync --archive --checksum --compress --progress --rsh=ssh --delete --exclude=pg_xlog* --exclude=pg_log* --exclude=pg_control --exclude=*.pid postgres@cz01-psql01.example.com:/var/lib/pgsql/9.3/data/* /var/lib/pgsql/9.3/data' Warning: Permanently added 'cz01-psql01.example.com,10.32.243.147' (RSA) to the list of known hosts. receiving incremental file list backup_label 222 100% 216.80kB/s 0:00:00 (xfer#1, to-check=1239/1241) backup_label.old 222 100% 216.80kB/s 0:00:00 (xfer#2, to-check=1238/1241) recovery.done 102 100% 99.61kB/s 0:00:00 (xfer#3, to-check=1232/1241) base/ base/1/ base/1/pg_internal.init 116404 100% 155.93kB/s 0:00:00 (xfer#4, to-check=1225/1490) base/12896/ base/12896/pg_internal.init 116404 100% 153.41kB/s 0:00:00 (xfer#5, to-check=804/1542) base/16386/ base/16386/pg_internal.init 116404 100% 151.97kB/s 0:00:00 (xfer#6, to-check=559/1542) base/16413/ base/16413/pg_internal.init 116404 100% 150.56kB/s 0:00:00 (xfer#7, to-check=301/1542) deleting pg_stat/global.stat deleting pg_stat/db_16413.stat deleting pg_stat/db_16386.stat deleting pg_stat/db_12896.stat deleting pg_stat/db_1.stat deleting pg_stat/db_0.stat global/ global/12789 8192 100% 7.81MB/s 0:00:00 (xfer#8, to-check=30/1542) global/12791 16384 100% 15.62MB/s 0:00:00 (xfer#9, to-check=27/1542) global/12792 16384 100% 15.62MB/s 0:00:00 (xfer#10, to-check=26/1542) global/12892 8192 100% 7.81MB/s 0:00:00 (xfer#11, to-check=17/1542) global/12894 16384 100% 15.62MB/s 0:00:00 (xfer#12, to-check=16/1542) global/12895 16384 100% 5.21MB/s 0:00:00 (xfer#13, to-check=15/1542) global/pg_internal.init 12784 100% 2.44MB/s 0:00:00 (xfer#14, to-check=13/1542) pg_clog/0000 8192 100% 1.56MB/s 0:00:00 (xfer#15, to-check=12/1542) pg_notify/ pg_stat/ pg_stat_tmp/ pg_stat_tmp/db_0.stat 2540 100% 310.06kB/s 0:00:00 (xfer#16, to-check=6/1542) pg_stat_tmp/db_1.stat 1864 100% 227.54kB/s 0:00:00 (xfer#17, to-check=5/1542) pg_stat_tmp/db_12896.stat 3047 100% 371.95kB/s 0:00:00 (xfer#18, to-check=4/1542) pg_stat_tmp/db_16386.stat 4230 100% 458.98kB/s 0:00:00 (xfer#19, to-check=3/1542) pg_stat_tmp/db_16413.stat 6089 100% 660.70kB/s 0:00:00 (xfer#20, to-check=2/1542) pg_stat_tmp/global.stat 1026 100% 100.20kB/s 0:00:00 (xfer#21, to-check=1/1542) sent 5452 bytes received 93077 bytes 65686.00 bytes/sec total size is 198017139 speedup is 2009.73 [2014-10-23 16:07:13] [INFO] standby clone: master config file '/var/lib/pgsql/9.3/data/postgresql.conf' [2014-10-23 16:07:13] [INFO] rsync command line: 'rsync --archive --checksum --compress --progress --rsh=ssh --delete postgres@cz01-psql01.example.com:/var/lib/pgsql/9.3/data/postgresql.conf /var/lib/pgsql/9.3/data' Warning: Permanently added 'cz01-psql01.example.com,10.32.243.147' (RSA) to the list of known hosts. receiving incremental file list sent 11 bytes received 80 bytes 60.67 bytes/sec total size is 20561 speedup is 225.95 [2014-10-23 16:07:14] [INFO] standby clone: master hba file '/var/lib/pgsql/9.3/data/pg_hba.conf' [2014-10-23 16:07:14] [INFO] rsync command line: 'rsync --archive --checksum --compress --progress --rsh=ssh --delete postgres@cz01-psql01.example.com:/var/lib/pgsql/9.3/data/pg_hba.conf /var/lib/pgsql/9.3/data' Warning: Permanently added 'cz01-psql01.example.com,10.32.243.147' (RSA) to the list of known hosts. receiving incremental file list sent 11 bytes received 76 bytes 174.00 bytes/sec total size is 4812 speedup is 55.31 [2014-10-23 16:07:14] [INFO] standby clone: master ident file '/var/lib/pgsql/9.3/data/pg_ident.conf' [2014-10-23 16:07:14] [INFO] rsync command line: 'rsync --archive --checksum --compress --progress --rsh=ssh --delete postgres@cz01-psql01.example.com:/var/lib/pgsql/9.3/data/pg_ident.conf /var/lib/pgsql/9.3/data' Warning: Permanently added 'cz01-psql01.example.com,10.32.243.147' (RSA) to the list of known hosts. receiving incremental file list sent 11 bytes received 78 bytes 178.00 bytes/sec total size is 1636 speedup is 18.38 [2014-10-23 16:07:14] [NOTICE] Finishing backup... NOTICE: pg_stop_backup complete, all required WAL segments have been archived [2014-10-23 16:07:15] [INFO] repmgr requires primary to keep WAL files 000000010000000000000065 until at least 000000010000000000000065 [2014-10-23 16:07:15] [NOTICE] repmgr standby clone complete [2014-10-23 16:07:15] [NOTICE] HINT: You can now start your postgresql server [2014-10-23 16:07:15] [NOTICE] for example : pg_ctl -D /var/lib/pgsql/9.3/data start Opening configuration file: ./repmgr.conf Starting postgresql-9.3 service: [ OK ]
- Again - after the db administrator find out the cause why master went down he needs to initialize the failed master as a slave (by running command above). Then everything is like before the testing - original master/slave state. The green color is used for showing up the changes in the diagram.
Check status
cz01-pgpool02 ~ # ssh -q postgres@cz01-psql02.example.com "/usr/pgsql-9.3/bin/repmgr --verbose -f /var/lib/pgsql/repmgr/repmgr.conf cluster show" [2014-10-23 16:08:23] [INFO] repmgr connecting to database Opening configuration file: /var/lib/pgsql/repmgr/repmgr.conf Role | Connection String * master | host=cz01-psql01.example.com user=repmgr dbname=repmgr standby | host=cz01-psql02.example.com user=repmgr dbname=repmgr cz01-pgpool02 ~ # pcp_node_info 1 localhost 9898 admin password123 0 cz01-psql01.example.com 5432 1 0.500000 cz01-pgpool02 ~ # pcp_node_info 1 localhost 9898 admin password123 1 cz01-psql02.example.com 5432 3 0.500000
Reenable the slave to be able to receive read-only requests
cz01-pgpool02 ~ # pcp_detach_node 0 localhost 9898 admin password123 1 cz01-pgpool02 ~ # pcp_attach_node 0 localhost 9898 admin password123 1
Logs right after reenabling the slave again
cz01-pgpool01 / # cat /var/log/local0 ... 2014-10-23T16:09:06.675035+02:00 cz01-pgpool01 pgpool[23264]: send_failback_request: fail back 1 th node request from pid 23264 2014-10-23T16:09:06.675868+02:00 cz01-pgpool01 pgpool[23257]: wd_start_interlock: start interlocking 2014-10-23T16:09:06.691133+02:00 cz01-pgpool01 pgpool[23257]: wd_assume_lock_holder: become a new lock holder 2014-10-23T16:09:06.698706+02:00 cz01-pgpool01 pgpool[23264]: wd_send_response: WD_STAND_FOR_LOCK_HOLDER received but lock holder exists already 2014-10-23T16:09:07.200639+02:00 cz01-pgpool01 pgpool[23257]: starting fail back. reconnect host cz01-psql02.example.com(5432) 2014-10-23T16:09:07.205234+02:00 cz01-pgpool01 pgpool[23257]: Do not restart children because we are failbacking node id 1 hostcz01-psql02.example.com port:5432 and we are in streaming replication mode 2014-10-23T16:09:07.209510+02:00 cz01-pgpool01 pgpool[23257]: find_primary_node_repeatedly: waiting for finding a primary node 2014-10-23T16:09:07.225224+02:00 cz01-pgpool01 pgpool[23257]: find_primary_node: primary node id is 0 2014-10-23T16:09:07.228223+02:00 cz01-pgpool01 pgpool[23257]: wd_end_interlock: end interlocking 2014-10-23T16:09:08.244875+02:00 cz01-pgpool01 pgpool[23257]: failover: set new primary node: 0 2014-10-23T16:09:08.244902+02:00 cz01-pgpool01 pgpool[23257]: failover: set new master node: 0 2014-10-23T16:09:08.244908+02:00 cz01-pgpool01 pgpool[23257]: failback done. reconnect host cz01-psql02.example.com(5432) 2014-10-23T16:09:08.247112+02:00 cz01-pgpool01 pgpool[2518]: worker process received restart request 2014-10-23T16:09:09.248473+02:00 cz01-pgpool01 pgpool[2517]: pcp child process received restart request 2014-10-23T16:09:09.252461+02:00 cz01-pgpool01 pgpool[23257]: PCP child 2517 exits with status 256 in failover() 2014-10-23T16:09:09.252492+02:00 cz01-pgpool01 pgpool[23257]: fork a new PCP child pid 3080 in failover() 2014-10-23T16:09:09.252499+02:00 cz01-pgpool01 pgpool[23257]: worker child 2518 exits with status 256 2014-10-23T16:09:09.253362+02:00 cz01-pgpool01 pgpool[23257]: fork a new worker child pid 3081
Logs right after reenabling the slave again
cz01-pgpool02 ~ # ssh -q postgres@cz01-psql02.example.com "/usr/pgsql-9.3/bin/repmgr --verbose -f /var/lib/pgsql/repmgr/repmgr.conf cluster show" [2014-10-23 16:10:33] [INFO] repmgr connecting to database Opening configuration file: /var/lib/pgsql/repmgr/repmgr.conf Role | Connection String * master | host=cz01-psql01.example.com user=repmgr dbname=repmgr standby | host=cz01-psql02.example.com user=repmgr dbname=repmgr cz01-pgpool02 ~ # pcp_node_info 1 localhost 9898 admin password123 0 cz01-psql01.example.com 5432 1 0.500000 cz01-pgpool02 ~ # pcp_node_info 1 localhost 9898 admin password123 1 cz01-psql02.example.com 5432 1 0.500000Huuh... It was a lot of copy & paste work. Anyway it's here if somebody needs it ;-)
Wish i had read this before. It would save a day of mine. Great post all around.
ReplyDeletewaw...nice share ...
ReplyDeleteThanks for all
ReplyDeleteI'v got a pb with de réplication since the master
[root@psql02 BDD]# su - postgres -c "repmgr -D /BDD -d repmgr -p 5432 -U repmgr -R postgres --verbose standby clone psql01.example.com --force"
Opening configuration file: ./repmgr.conf
[2015-04-02 10:50:13] [ERROR] Did not find the configuration file './repmgr.conf', continuing
[2015-04-02 10:50:13] [NOTICE] repmgr Destination directory /BDD provided, try to clone everything in it.
[2015-04-02 10:50:13] [INFO] repmgr connecting to master database
[2015-04-02 10:50:13] [INFO] repmgr connected to master, checking its state
[2015-04-02 10:50:13] [INFO] Successfully connected to primary. Current installation size is 26 MB
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
[2015-04-02 10:50:13] [INFO] Can not connect to the remote host (psql01.example.com)
[2015-04-02 10:50:13] [ERROR] repmgr: Aborting, remote host psql01.example.com is not reachable.
The pb is Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
I can connect with ssh to the master since the slave without password (rsa)
I've you got an idea ?
Thanks for your answer
Could you show what happens when pgpool1 server dies? How do the apps know to go to pgpool2's IP address?
ReplyDeleteHello.
ReplyDeleteAll the traffic from the applications is going to the VIP 10.32.243.158. If the mater pgpool01 dies the VIP is "moved" to the pgpool02 automatically and then all traffic is handled by new master pgpool02.
Regards
PetrR
Hi, Is the delegate IP a VIP.
DeleteHello ,
ReplyDeleteThank you very much for this guide , it works like a charm . But I'd like to ask one more question and will need help on how to do it .
Scenario : psql01-master , psql02-slave are the 2 postgresql nodes
1) Is the postgresql.conf file of both postgresql nodes above have the exact same configurations ?
2) If psql01-master goes down , pgpool detects it and run the failover_stream.sh to make psql02-slave as the new master node and works . But my concern is after psql01-master comes back online , it will also show as "master" , not "standby" . And hence both psql01-master and psql02-slave is shown as master when I run the below command
ssh postgres@xyz@example.com "/usr/pgsql-9.3/bin/repmgr --verbose -f /var/lib/pgsql/repmgr/repmgr/repmgr.conf cluster show"
Role | connection string
* master | host=psql01-master.example.com user=repmgr dbname=repmgr
*master | host=psql02-slave.example.com user=repmgr dbname=repmgr
Hello.
Delete1) Yes I do have postgresql.conf on both PSQL servers the same.
2) You should never start the "failed master", because then as you mentioned below - there will be two masters :-(
If the master goes down you should not start it, but you should keep it down. You should not start it, but make it slave like it's described above.
-> Therefore it's good idea not to start the PGSQL during the boot.
Regards
PetrR
Hello Petr ,
ReplyDeleteYes I totally agree , we shouldn't start postgresql service at boot time . I've checked all possible scenarios and all works fine :-) . Thank you very much . But we're facing some issues with the PGPool cluster , the scenario is as follows :
pgpool01 is the one which has the floating IP , pgpool02 is therefore the secondary node
psql01 is the postgresql master and psql02 is the postgresql slave
psql01 goes down and hence psql02 will be promoted to a master automatically using pgpool01 successfully
Now pgpool01 goes down and hence pgpool02 gets the floating IP , no issue with that . BUT still pgpool02 thinks that psql01 is the postgresql master and psql02 is the slave , so when we create a database using the pgpool floating IP it gives an error indicating that its a read-only transaction , so therefore we restarted pgpool service in pgpool02 node assuming that it will detect the current state of the postgresql nodes but still it shows as 'read-only transaction' when we try to create a database using the pgpool floating IP .
So I guess pgpool02 doesn't fetch the master ID and slave ID of the current setup. Do you know any workaround for this please ?
Thank you
Pasan Opanayaka
Hello Pasan
DeleteI have the same issue. Have you got any solution ?
BR
Anuradha
Ran into a issue making the old master into the new slave and resolved the issue by adding the --rsync-only flag to the command which was added in repmgr v3
ReplyDeletessh cz01-psql01.example.com 'service postgresql-9.3 stop; su - postgres -c "repmgr -D /var/lib/pgsql/9.3/data -d repmgr -p 5432 -U repmgr -R postgres --verbose --rsync-only --force standby clone cz01-psql02.example.com"; service postgresql-9.3 start;
Thank you Travis.
ReplyDeleteWhen I was writing the post it was working fine, but maybe it has changed with some versions. I also didn't specified the exact versions of the software used above.
Thanks again.
PetrR
Hello
ReplyDeleteThanks for great tutorial.
I have same issue mentioned above by Pasan.
Is there a any solution you have ?
Thank you.
BR
Hello Anuradha,
ReplyDeleteSorry still I didn't get any solution for that issue.
Thanks,
Pasan
hi,
ReplyDeleteIs there a way to automate replication of data from running machine to recovered machine?
Regards
Hello, thanks for the very good tutorial.
ReplyDeleteI'm building the same thing as this concept in ubuntu 14.04 with postgresql 9.5.
But i got a lot of problems, does pgpool01 and pgpool2 also have to install postgresql?
Or can you help me find the same tutorial topics for ubuntu 14.04?
thank you
Hello.
DeleteIf you are building "new" environment I can recommend you to use the latest Ubuntu 16.04.
As far as I remember you do not need to install postgresql on "pgpools".
Running something similar on another linux distro should not be a problem.
Unfortunately this "how-to" is pretty old and I do not longer using it, so I can not help you with troubleshooting.
See you
Hello Petr.
ReplyDeleteThanks for great tutorial.
I would like to know if you have ever implemented the postgre-BDR (Multi-Master) and if so, is there any link where I can find an example of its architecture solution. Something like the the picture mentioned above. With the load balancer ...
Thanks a lot
Hello.
ReplyDeleteI'm sorry, but I do not have any experience with "Multi-Master" in PSQL.
PetrR