PostgreSQL cluster monitoring

For monitoring database cluster, various 3rd party monitoring tools exists that are leverage PostgreSQL logging, events and specific queries. However, this section will not cover any external tool, but will rather give a slight overview upon PostgreSQL built in monitoring mechanisms,


Database logs

By default, database logs are located in data folder (e.g. PGSQL-data/pg_log) and is rolled over based on both log file size and file age. Log configuration could be changed in postgres.conf file. Log files should be checked whenever the recovery and/or fail-over is taking place, as well as when the database had started, to verify the process is behaving as expected.

In the context of Streaming Replications used for synchronizing databases between primary and standby databases, when starting the standby database , the following message should appear in the database log:

LOG: entering standby mode
LOG: redo starts at 0/10000028
LOG: consistent recovery state reached at 0/10000130
LOG: database system is ready to accept read only connections
LOG: started streaming WAL from primary at 0/11000000 on timeline 3

It signifies, that all the changes from the primary database has been successfully applied, and the database is started in hot-standby mode, accepting read only connections.

When the primary database fails and the automatic fail-over is setup, the following logs will be recorder on the standby database:

LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 3 at 0/1100B158.
FATAL: could not send end-of-streaming message to primary: no COPY in progress
LOG: invalid record length at 0/1100B158: wanted 24, got 0
FATAL: could not connect to the primary server: could not connect to server: Connection refused (0x0000274D/10061)
Is the server running on host "10.0.124.177" and accepting
TCP/IP connections on port 5432?
...
... after some time when the automatic fail-over promotes this standby database server to be a new primary database ...
...
LOG: received promote request
LOG: redo done at 0/1100B0E8
LOG: last completed transaction was at log time 2019-09-10 12:36:18.353551+02
LOG: selected new timeline ID: 4
LOG: archive recovery complete
LOG: database system is ready to accept connections


More about database logging could be seen in official PostreSQL documentation here https://www.postgresql.org/docs/current/runtime-config-logging.html

Monitor database role

To check, which database is currently the primary database, execute the following query

select pg_is_in_recovery();

If the result is true, queried database is standby; if false, it is primary.

Optionally, pgmetrics tool could be used to list a handy overview of database current status also with various details not only about replication.

pgmetrics.exe -h <IP_address> -U <user_name>

images/download/attachments/284927728/image2019-9-5_15-19-29-version-1-modificationdate-1633099049860-api-v2.png

Queries for primary database

To get list of all standby servers as well as their current replication performance, run the following query:

select
pid,
client_addr,
application_name,
usename,
state,
sync_state,
pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) sending_lag,
pg_wal_lsn_diff(sent_lsn, flush_lsn) receiving_lag,
pg_wal_lsn_diff(flush_lsn, replay_lsn) replaying_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) total_lag
from pg_stat_replication;
  • sending_lag could indicate heavy load on primary

  • receiving_lag could indicate network issues or replica under heavy load

  • replaying_lag could indicate replica under heavy load


To find out which file is currently used for write ahead log, execute this query:

select pg_walfile_name(pg_current_wal_lsn());

Queries for standby database

To check delay on the standby, execute the following query:

SELECT
CASE
WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()
THEN 0
ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp())
END AS log_delay;

The result is the number of seconds between the transaction was aborted or committed on primary and the time it was received on the standby. If the replication did not started or failed to start, this query will return NULL.

Please note, that if the system time of these servers differs, the resulted number may be misleading.

Automated monitoring

Automated monitoring is based on checking all the databases from any server on the network (will refer to this server as Monitoring server) in the predefined intervals and automatically log Windows Log Events in the following cases:

  • primary server is inactive

  • any standby server is inactive

  • more than one primary server is active

This is achieved by Windows Scheduler invoking Powershell script which is using the same monitoring tools as described in this section above - executing SQL queries. Script could be scheduled on any server in the network, if the following preconditions have to be met:

  • server has a network visibility to monitored databases, with allowed connection to the database ports

  • PostrgreSQL CLI utilities has to be in PATH system environment variable (Dispatcher Paragon with embedded PostreSQL has CLI tools at '<PATH_TO_STANDBY_DATABASE_ROOT_FOLDER>\bin' folder. Default installation path of PostgreSQL CLI utilities is C:\Program Files\PostgreSQL\11\bin ).

images/download/attachments/284927728/automatic_monitoring-version-1-modificationdate-1633099049843-api-v2.png

Script location

Script are located in the full installation package <installation package>\Complementary Solutions/PostgreSQL/pg-failover.zip

  1. For executing the script, a new database user with limited privileges has to exists. Its possible to reuse existed user, but its recommended to create a specific one for monitoring purposes only. On the primary database, create new user "cluster_monitor" (user will be replicated to the standby database automatically).

    CREATE USER cluster_monitor ENCRYPTED PASSWORD '<CLUSTER_MONITOR_USER_PASSWORD>';
  2. Add the IP address of the Monitoring server to the pg_hba.conf file on both primary and standby database.

    host postgres cluster_monitor <IP_ADDRESS_OF_THE_MONITOR_SERVER>/32 md5
  3. n the monitoring server add an entry to the pgpass.conf file for each database server. Default location of pgpass.conf is in %APPDATA%\postgresql\pgpass.conf (where %APPDATA% refers to the Application Data subdirectory in the user's profile) . This is standard PostgreSQL security feature - for more details refer to the official documentation: https://www.postgresql.org/docs/current/libpq-pgpass.html

    <PRIMARY_DATABASE_IP_ADDRESS>:<DATABASE_PORT>:postgres:cluster_monitor:<CLUSTER_MONITOR_USER_PASSWORD>
    <STANDBY_DATABASE_IP_ADDRESS>:<DATABASE_PORT>:postgres:cluster_monitor:<CLUSTER_MONITOR_USER_PASSWORD>

    PATH Variable

    Please note, for the script to work properly the System Environment Variable must be set properly. The script needs to access pg_isready.exe and psql.exe, so those files must be on the monitoring server and the proper path must be set in PATH system environment variables. Sometimes a restart of the Windows machine is required in order to force through the changes.

  4. Copy the PowerShell script PgMonitor.ps1 to the Monitoring server

  5. On the Monitoring server

    1. create a new scheduled task in the Task Scheduler

      images/download/attachments/284927728/image2019-9-6_12-50-32-version-1-modificationdate-1633099049837-api-v2.png
    2. In "General" tab, fill the name, description, check "Run whether user is logged on or not" and "Run with highest privileges".

      images/download/attachments/284927728/create_task_monitoring-version-1-modificationdate-1633099049787-api-v2.PNG


    3. In "Triggers" tab, create new trigger, that will run every 5 minutes and never expires.
      images/download/attachments/284927728/edit_trigger-version-1-modificationdate-1633099049747-api-v2.PNG

    4. In the "Actions" tab, create a new action that will start a script. Use Powershell as program to start and add following arguments. At least 2 IP addresses are required.

      -File "<path_to_script>\PgMonitor.ps1" <PRIMARY_DATABASE_IP_ADDRESS>:<DATABASE_PORT>,STANDBY_DATABASE_IP_ADDRESS:<DATABASE_PORT> cluster_monitor

      images/download/attachments/284927728/image2019-9-9_11-27-52-version-1-modificationdate-1633099049807-api-v2.png

  6. Save the scheduled task and trigger it manually to ensure it will run in predefined interval. Wait till the task runs at least once and check Windows Event Application log for errors. In the case of properly configured task, no event is listed there.

  7. (Optional) if you are not sure if the script is invoked correctly, you can turn on logging by add log parameter to turn logging on. Example: -log "C:\logs\PgMonitor.log"

    -File "<path_to_script>\PgMonitor.ps1" <PRIMARY_DATABASE_IP_ADDRESS>:<DATABASE_PORT>,STANDBY_DATABASE_IP_ADDRESS:<DATABASE_PORT> cluster_monitor "C:\logs\PgMonitor.log"

When automatic monitoring script detects the failure, it logs a Warning Windows event in Event Log.

images/download/attachments/284927728/image2019-9-23_12-10-19-version-1-modificationdate-1633099049800-api-v2.png

The following events are logged to the Windows Event Log.

EventID

Title

Description

Level

Source

1

Environment check failed

Some of the PostgreSQL CLI utilities were not found on PATH or the invalid arguments have been provided. See the error message which tells you what happen.

Error

PostgreSQLMonitorScript

2

Any of the standby database is not responding

Number of active standby databases differs from the number of all provided databases minus one primary database.

Warning

PostgreSQLMonitorScript

3

Primary database is not responding

The primary database is not responding.

Warning

PostgreSQLMonitorScript

4

More than one primary database has been detected

More than one active primary database exists.

Warning

PostgreSQLMonitorScript

Windows Event

Please note, that the PgMonitor.ps1 script creates only a Windows Event. Its up to each administrator to setup a subsequent notification. Also there is no internal memory/history of the previous runs. This means the scheduler will execute the PgMonitor.ps1 script every time (according to configured interval), and in case the error, the Windows Event Log is created until the detected issue is fixed or the scheduled task is disabled/removed.

It is recommend to add a custom action on the event ID 2, 3 and 4 (for example to send an email to the administrators).