【Postgres扩展】pg_auto_failover支持高可用性和自动故障转移

作为Citus团队的一员（Citus横向扩展Postgres，但这不是我们要做的全部），我从事pg_auto_failover已有相当一段时间了，我很高兴我们现在已经将pgautofailover作为开源引入了，为您提供自动故障转移和高可用性！

在设计pg_auto_failover时，我们的目标是：为Postgres提供易于设置的业务连续性解决方案，该解决方案实现系统中任何一个节点的容错能力。关于pg_auto_failover架构的文档章节包括以下内容：

重要的是要了解pgautofailover已针对业务连��,�Ի�续性进行了优化。万一丢失单个节点，由于PostgreSQL同步复制，pgautofailover能够继续PostgreSQL服务，并在这样做时防止任何数据丢失。

pg_auto_failover简介

用于Postgres的pg_auto_failover解决方案旨在提供一种易于设置且可靠的自动化故障转移解决方案。该解决方案包括由软件驱动的决策，以决定何时在生产中实施故障转移。

任何自动故障转移系统中最重要的部分是决策策略，我们在线上有完整的文档章节，内容涉及pgautofailover故障容忍机制。

使用pgautofailover时，将部署多个活动代理来跟踪您的生产Postgres安装属性：

监视器是一个本身具有pg_auto_failover扩展名的Postgres数据库，它注册并检查活动Postgres节点的运行状况。
在pg_auto_failover监视器中注册的每个Postgres节点也必须运行本地代理pg_autoctl运行服务。
每个受管理的Postgres服务在同一个组中有两个设置在一起的Postgres节点。一个监视器设置可以根据需要管理多个Postgres组。

通过这样的部署，监控器会定期连接到每个已注册的节点（默认为20秒），并在其pgautofailover.node表中注册成功或失败。

除此之外，每个Postgres节点上的pg_autoctl运行服务还会检查Postgres是否正在运行，并监视其他节点的pgstatreplication统计信息。此Postgres系统视图使我们的本地代理能够发现主节点和备用节点之间的网络连接。本地代理定期每隔5s向监视器报告每个节点的状态，除非需要进行转换，然后立即进行。

pg_auto_failover监视器根据集群中两个节点的已知状态做出决策，并且仅遵循我们精心设计以确保节点收敛的有限状态机。特别是，只有在pg_autoctl代理报告成功实现了确定的过渡到新状态后，FSM才取得进展。关于故障转移逻辑的体系结构文档部分包含FSM的映像，我们使用这些映像来确保pgautofailover中的自动故障转移决策。

pg_auto_failover快速入门

再一次，请参阅pg_auto_failover的“快速入门”文档部分以获取更多详细信息。首次尝试该项目时，最简单的方法是创建一个监视器，然后注册一个主要的Postgres实例，然后注册一个辅助的Postgres实例。

下面列出了一些Shell命令，这些命令在localhost上都实现了简单的部署，以用于项目发现。

监控器

在第一个终端，终端选项卡，屏幕或tmux窗口中，运行以下命令来创建监视器，包括使用initdb初始化Postgres集群，安装我们的pg_auto_failover扩展以及在HBA文件中打开连接特权。

首先，我们在终端中准备环境：

$ mkdir /tmp/pg_auto_failover/test$ export PGDATA=/tmp/pg_auto_failover/test/monitor

然后，我们可以使用刚刚准备的PGDATA环境设置在本地端口6000上的本地主机上创建Monitor Postgres实例：

$ pg_autoctl create monitor --nodename localhost --pgport 600012:12:53 INFO  Initialising a PostgreSQL cluster at "/tmp/pg_auto_failover/test/monitor"12:12:53 INFO  Now using absolute pgdata value "/private/tmp/pg_auto_failover/test/monitor" in the configuration12:12:53 INFO   /Applications/Postgres.app/Contents/Versions/10/bin/pg_ctl --pgdata /tmp/pg_auto_failover/test/monitor --options "-p 6000" --options "-h *" --wait start12:12:53 INFO  Granting connection privileges on 192.168.1.0/2412:12:53 INFO  Your pg_auto_failover monitor instance is now ready on port 6000.12:12:53 INFO  pg_auto_failover monitor is ready at postgres://autoctl_node@localhost:6000/pg_auto_failover12:12:53 INFO  Monitor has been succesfully initialized.

现在我们可以将连接字符串重新显示到监视器：

$ pg_autoctl show uripostgres://autoctl_node@localhost:6000/pg_auto_failover

Postgres主节点

在另一个终端（选项卡，窗口，以通常的方式进行操作）中，现在创建一个主要的PostgreSQL实例：

$ export PGDATA=/tmp/pg_auto_failover/test/node_a$ pg_autoctl create postgres --nodename localhost --pgport 6001 --dbname test --monitor postgres://autoctl_node@localhost:6000/pg_auto_failover12:15:27 INFO  Registered node localhost:6001 with id 1 in formation "default", group 0.12:15:27 INFO  Writing keeper init state file at "/Users/dim/.local/share/pg_autoctl/tmp/pg_auto_failover/test/node_a/pg_autoctl.init"12:15:27 INFO  Successfully registered as "single" to the monitor.12:15:28 INFO  Initialising a PostgreSQL cluster at "/tmp/pg_auto_failover/test/node_a"12:15:28 INFO  Now using absolute pgdata value "/private/tmp/pg_auto_failover/test/node_a" in the configuration12:15:28 INFO  Postgres is not running, starting postgres12:15:28 INFO   /Applications/Postgres.app/Contents/Versions/10/bin/pg_ctl --pgdata /private/tmp/pg_auto_failover/test/node_a --options "-p 6001" --options "-h *" --wait start12:15:28 INFO  CREATE DATABASE test;12:15:29 INFO  FSM transition from "init" to "single": Start as a single node12:15:29 INFO  Initialising postgres as a primary12:15:29 INFO  Transition complete: current state is now "single"12:15:29 INFO  Keeper has been succesfully initialized.

此命令将PostgreSQL实例注册到监视器，使用pg_ctl initdb创建实例，为监视器运行状况检查准备一些连接权限，并为您创建一个名为test的数据库。然后，执行由监视器排序的第一个转换，从状态INIT到达状态SINGLE。

现在，我们仍在测试中，因此在终端中以交互方式启动pg_autoctl运行服务。对于生产设置，这将进入需要引导时间的系统服务，例如systemd。

$ pg_autoctl run12:17:07 INFO  Managing PostgreSQL installation at "/tmp/pg_auto_failover/test/node_a"12:17:07 INFO  pg_autoctl service is starting12:17:07 INFO  Calling node_active for node default/1/0 with current state: single, PostgreSQL is running, sync_state is "", WAL delta is -1.

最后一行将每5s重复一次，这表明主节点运行状况良好，并且可以正常连接到监视器。而且，它现在处于SINGLE状态，一旦新的Postgres节点加入该组，它就会改变。

Postgres辅助节点

现在是时候在另一个终端上创建辅助Postgres实例了：

$ export PGDATA=/tmp/pg_auto_failover/test/node_b$ pg_autoctl create postgres --nodename localhost --pgport 6002 --dbname test --monitor postgres://autoctl_node@localhost:6000/pg_auto_failover12:21:08 INFO  Registered node localhost:6002 with id 5 in formation "default", group 0.12:21:09 INFO  Writing keeper init state file at "/Users/dim/.local/share/pg_autoctl/tmp/pg_auto_failover/test/node_b/pg_autoctl.init"12:21:09 INFO  Successfully registered as "wait_standby" to the monitor.12:21:09 INFO  FSM transition from "init" to "wait_standby": Start following a primary12:21:09 INFO  Transition complete: current state is now "wait_standby"12:21:14 INFO  FSM transition from "wait_standby" to "catchingup": The primary is now ready to accept a standby12:21:14 INFO  The primary node returned by the monitor is localhost:600112:21:14 INFO  Initialising PostgreSQL as a hot standby12:21:14 INFO  Running /Applications/Postgres.app/Contents/Versions/10/bin/pg_basebackup -w -h localhost -p 6001 --pgdata /tmp/pg_auto_failover/test/backup -U pgautofailover_replicator --write-recovery-conf --max-rate 100M --wal-method=stream --slot pgautofailover_standby ...12:21:14 INFO  pg_basebackup: initiating base backup, waiting for checkpoint to completepg_basebackup: checkpoint completedpg_basebackup: write-ahead log start point: 0/2000028 on timeline 1pg_basebackup: starting background WAL receiver32041/32041 kB (100%), 1/1 tablespacepg_basebackup: write-ahead log end point: 0/20000F8pg_basebackup: waiting for background process to finish streaming ...pg_basebackup: base backup completed12:21:14 INFO  Postgres is not running, starting postgres12:21:14 INFO   /Applications/Postgres.app/Contents/Versions/10/bin/pg_ctl --pgdata /tmp/pg_auto_failover/test/node_b --options "-p 6002" --options "-h *" --wait start12:21:15 INFO  PostgreSQL started on port 600212:21:15 WARN  Contents of "/tmp/pg_auto_failover/test/node_b/postgresql-auto-failover.conf" have changed, overwriting12:21:15 INFO  Transition complete: current state is now "catchingup"12:21:15 INFO  Now using absolute pgdata value "/private/tmp/pg_auto_failover/test/node_b" in the configuration12:21:15 INFO  Keeper has been succesfully initialized.

这次向监视器的注册返回了状态WAITSTANDBY，该状态驱动pgautoctl创建辅助节点。这是因为服务器已存在于组中，并且当前为SINGLE。与此并行，监视器将目标状态WAIT_PRIMARY分配给主节点，localpgautoctlagent将在其中从监视器数据库和openpghba.conf中检索新节点的节点名称和端口以进行复制。完成后，辅助节点继续pg_basebackup，安装arecovery.conf`文件，启动本地Postgres服务，并通知监视器有关达到目标状态的信息。

不过，我们仍在CATCHING_UP。这意味着尚无法进行自动故障转移。为了能够安排故障转移，我们需要在新节点上运行本地服务，监视Postgres的运行状况和复制状态，并每5秒向监视器报告一次：

$ pg_autoctl run12:26:26 INFO  Calling node_active for node default/5/0 with current state: catchingup, PostgreSQL is running, sync_state is "", WAL delta is -1.12:26:26 INFO  FSM transition from "catchingup" to "secondary": Convinced the monitor that I'm up and running, and eligible for promotion again12:26:26 INFO  Transition complete: current state is now "secondary"12:26:26 INFO  Calling node_active for node default/5/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 0.

现在，新节点处于SECONDARY状态，并继续向监视器报告，准备在监视器做出决定时提升本地Postgres实例。

使用pg_auto_failover进行自动和手动故障转移

每个节点使用pg_auto_failover来配置具有自动故障转移功能的PostgreSQL集群所需要做的就是：每个节点使用两个命令：首先使用pg_autoctl create ...创建节点，然后运行pg_autoctl来运行本地服务，以实现由监视器决定的转换。

要见证故障转移，最简单的方法是停止pg_autoctl运行服务（在运行它的终端中使用^ C或在其他任何地方使用pg_autoctl stop --pgdata ...；然后也使用pg_ctl停止Postgres实例- D ...停下来。

当仅停止Postgres时，pg_autoctl运行服务将检测到该情况为异常，然后首先尝试重新启动Postgres。仅当使用默认pg_auto_failover参数连续3次未能启动Postgres时，才认为故障转移是适当的。

注入故障转移条件的另一种方法是礼貌地要求监视器为您安排一个：

$ psql postgres://autoctl_node@localhost:6000/pg_auto_failover> select pgautofailover.perform_failover();

应用程序和客户端的连接字符串

整个设置以pg_auto_failover条款的形式运行。默认格式名为default，并且包含两个Postgres实例的单个组。想法是只有一个入口，可以将应用程序连接到任何给定的形式。要获取到我们的pg_auto_failover托管的Postgres服务的连接字符串，请发出以下命令，例如在监视器终端上：

$ pg_autoctl show uri --formation defaultpostgres://localhost:6002,localhost:6001/test?target_session_attrs=read-write

我们在这里使用libpq的多主机功能。当它基于libpq（大多数都是这样）时，可以与任何现代Postgres驱动程序一起使用，并且已知其他本地驱动程序也可以实现相同的功能，例如JDBC Postgres驱动程序。

当然，如果适用于psql：

$ psql postgres://localhost:6002,localhost:6001/test?target_session_attrs=read-writepsql (12devel, server 10.7)Type "help" for help.test# select pg_is_in_recovery(); pg_is_in_recovery═══════════════════ f(1 row)

当使用这样的连接字符串时，连接驱动程序将连接到第一台主机并检查是否接受写操作，如果不是，则连接到第二台主机并再次检查。那是因为我们说过我们希望targetsessionattrs是可读写的。

使用核心Postgres的此功能，我们实现了客户端的高可用性：在发生故障转移的情况下，我们的node_b将成为主要对象，并且我们需要应用程序现在将node_b定位为写入对象，并且该操作将在连接驱动程序中自动完成水平。

高可用性，容错和业务连续性

因此pgautofailover就是关于业务连续性的，并且为每个主要的Postgres服务器使用一个备用服务器。

在用于Postgres的经典HA设置中，我们依靠每个主服务器都有两个备用服务器的同步复制。当您想要实现零或接近零的RTO和RPO目标时，这就是预期的架构。

同样，每个主节点使用两个备用节点的想法是，您会丢失任何备用服务器，并且仍然知道在两个不同的地方仍可以使用数据，因此仍然乐于接受写入。这在许多生产设置中都是非常好的属性，并且是其他现有Postgres HA工具的目标。

在某些情况下，最佳的生产设置折衷方法与当前Postgres HA工具支持的方法有所不同。有时可以在需要执行灾难恢复过程时面对服务中断，因为对这种情况下必要风险的评估符合生产预算，预期的SLA或其组合。

并非所有项目都需要超过99.95％的可用性，即使没有走到最后一英里，有时也需要达到99.999％的目标。此外，尽管物联网和其他一些用例（例如庞大的用户群）需要HA解决方案，这些解决方案需要将TB级数据扩展到PB级数据，但许多项目却是针对较小的受众和数据集的。当您拥有千兆字节的数据，甚至数十千兆字节的数据时，灾难恢复的时机也将不再可能被吞噬，具体取决于您的SLA条款。

数据可用性

pg_auto_failover使用PostgreSQL同步复制来确保在故障转移操作时没有数据丢失。sync rep Postgre功能可确保当客户端应用程序收到来自Postgres的COMMIT消息时，数据便将其发送到我们的辅助节点。

面对系统中任何一个ONE节点丢失的情况，pg_auto_failover可以正常工作。如果丢失了主服务器，然后又丢失了辅助服务器，那么除了备份之外，什么都没有。使用pg_auto_failover时，对于一次丢失多台服务器的情况，您仍然必须设置适当的灾难恢复解决方案。是的，这发生了。

还请注意臭名昭著的_file系统是否已满_，由于我们习惯于部署类似规格的服务器，因此它喜欢同时***主服务器和辅助服务器……

结论

微软在这里的整个Citus团队都对pg_auto_failover扩展的开源版本感到兴奋。我们根据Postgres开放源代码许可发布了pg_auto_failover，因此您可以以与部署Postgres完全相同的能力享受我们的贡献。该项目是完全开放的，欢迎每个人参与并在我们的GitHub存储库上为https://github.com/citusdata/pg_auto_failover做出贡献。我们正在遵循Microsoft开放源代码行为准则，并确保所有人都受到欢迎和聆听。

我的希望是，由于有了pg_auto_failover，你们中的许多人现在将能够使用自动故障转移解决方案在生产中部署Postgres。

推荐资讯

推荐站点