* [pmg-devel] [PATCH pmg-api] fix #5189: cluster: do not synchronize statistics that already exist
@ 2024-01-24 13:56 Friedrich Weber
2024-02-21 14:09 ` Stoiko Ivanov
0 siblings, 1 reply; 2+ messages in thread
From: Friedrich Weber @ 2024-01-24 13:56 UTC (permalink / raw)
To: pmg-devel
After restoring a backup from a cluster on a fresh node with
statistics, and then creating a cluster, the following can happen
(node 1 being master and node 2 being a node): `ClusterInfo` on node
1 has no record about the last-synchronized `CStatistic` row id of
node 2. Thus, pmgmirror on node 1 initializes the record with -1 and
tries to synchronize *all* `CStatistic` rows with cid 2 from node 2.
But (some of) these rows may already exist on cid 1, because they
were part of the backup, so pmgmirror on node 1 triggers a Postgres
unique constraint violation, statistics synchronization on node 1
fails, and node 1 remains in the "synchronizing" state.
Fix this as follows: If `ClusterInfo` does not have a record on the
last-synchronized `CStatistic` row id of some node cid, initialize it
with the maximum row id that exists in the local `CStatistic` for that
node cid, or with -1 if the local `CStatistic` has no row for that
node cid. For this, use a SQL query that is already used in
DBTools::update_client_clusterinfo. As a result, pmgmirror will only
synchronize rows with row ids larger than the maximum row id from the
backup, and avoid running into a unique constraint violation.
Signed-off-by: Friedrich Weber <f.weber@proxmox.com>
---
Notes:
Tested that the steps from #5189 [1] do not produce a constraint
violation anymore, and both cluster nodes eventually become "active".
Also tested that this patch doesn't break anything in the "restore
without statistics" scenario.
However, not 100% sure I haven't overlooked any edge cases, so
additional testing is very welcome!
[1] https://bugzilla.proxmox.com/show_bug.cgi?id=5189
src/PMG/Cluster.pm | 3 +--
src/PMG/DBTools.pm | 15 +++++++++++++++
2 files changed, 16 insertions(+), 2 deletions(-)
diff --git a/src/PMG/Cluster.pm b/src/PMG/Cluster.pm
index 015e66a..3bf4c41 100644
--- a/src/PMG/Cluster.pm
+++ b/src/PMG/Cluster.pm
@@ -698,8 +698,7 @@ sub sync_statistic_db {
my $count;
- PMG::DBTools::create_clusterinfo_default(
- $ldb, $rcid, 'lastid_CStatistic', -1, undef);
+ PMG::DBTools::create_clusterinfo_statistic_default($ldb, $rcid);
do { # get new values
diff --git a/src/PMG/DBTools.pm b/src/PMG/DBTools.pm
index 9e133bc..6e682f8 100644
--- a/src/PMG/DBTools.pm
+++ b/src/PMG/DBTools.pm
@@ -1139,6 +1139,21 @@ sub create_clusterinfo_default {
$sth->finish();
}
+sub create_clusterinfo_statistic_default {
+ my ($dbh, $rcid) = @_;
+
+ my $sth = $dbh->prepare("SELECT * FROM ClusterInfo WHERE " .
+ "CID = ? AND Name = 'lastid_CStatistic'");
+ $sth->execute($rcid);
+ if (!$sth->fetchrow_hashref()) {
+ # initialize to maximum CStatistic rid locally present for $rcid (if
+ # any), might already exist if the node was restored from backup
+ $dbh->do("INSERT INTO ClusterInfo (cid, name, ivalue) SELECT $rcid, 'lastid_CStatistic', " .
+ "COALESCE (max (rid), -1) FROM CStatistic WHERE cid = $rcid");
+ }
+ $sth->finish();
+}
+
sub read_int_clusterinfo {
my ($dbh, $rcid, $name) = @_;
--
2.39.2
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: [pmg-devel] [PATCH pmg-api] fix #5189: cluster: do not synchronize statistics that already exist
2024-01-24 13:56 [pmg-devel] [PATCH pmg-api] fix #5189: cluster: do not synchronize statistics that already exist Friedrich Weber
@ 2024-02-21 14:09 ` Stoiko Ivanov
0 siblings, 0 replies; 2+ messages in thread
From: Stoiko Ivanov @ 2024-02-21 14:09 UTC (permalink / raw)
To: Friedrich Weber; +Cc: pmg-devel
Thanks for finding this and directly tackling it! much appreciated
The approach works in general - and from looking through the code I think
resetting this on every sync makes more sense than once upon cluster-join
(e.g. the clusterinfo table gets truncated/wiped when restoring a backup)
I managed to reproduce the issue and verify the fix - question is why not
take the same approach for the cmailstore sync (from a quick glance the
only other table that gets synced by id not mtime) - and afair we had some
users who had to wait for their quarantine-lifetime to get their cluster
in sync again)
If you want to look even further into the cluster-stack - one of the more
common sources of constraint violation errors in a pmg cluster is removing
a cluster-node by deleting it's stanza from the config (or removing the
cluster.conf, and then rejoining that node, or a fresh one (the cid will
clash again):
https://forum.proxmox.com/threads/ha-cluster-warning-at-creation.122289/
https://forum.proxmox.com/threads/master-aus-postgresdb-entfernen.85255/
afaict this removing and joining a fresh node might be needed to get the
cmailstore sync to fail.
On Wed, 24 Jan 2024 14:56:26 +0100
Friedrich Weber <f.weber@proxmox.com> wrote:
> After restoring a backup from a cluster on a fresh node with
> statistics, and then creating a cluster, the following can happen
> (node 1 being master and node 2 being a node): `ClusterInfo` on node
> 1 has no record about the last-synchronized `CStatistic` row id of
> node 2. Thus, pmgmirror on node 1 initializes the record with -1 and
> tries to synchronize *all* `CStatistic` rows with cid 2 from node 2.
> But (some of) these rows may already exist on cid 1, because they
> were part of the backup, so pmgmirror on node 1 triggers a Postgres
> unique constraint violation, statistics synchronization on node 1
> fails, and node 1 remains in the "synchronizing" state.
>
> Fix this as follows: If `ClusterInfo` does not have a record on the
> last-synchronized `CStatistic` row id of some node cid, initialize it
> with the maximum row id that exists in the local `CStatistic` for that
> node cid, or with -1 if the local `CStatistic` has no row for that
> node cid. For this, use a SQL query that is already used in
> DBTools::update_client_clusterinfo. As a result, pmgmirror will only
> synchronize rows with row ids larger than the maximum row id from the
> backup, and avoid running into a unique constraint violation.
>
> Signed-off-by: Friedrich Weber <f.weber@proxmox.com>
> ---
>
> Notes:
> Tested that the steps from #5189 [1] do not produce a constraint
> violation anymore, and both cluster nodes eventually become "active".
> Also tested that this patch doesn't break anything in the "restore
> without statistics" scenario.
>
> However, not 100% sure I haven't overlooked any edge cases, so
> additional testing is very welcome!
>
> [1] https://bugzilla.proxmox.com/show_bug.cgi?id=5189
>
> src/PMG/Cluster.pm | 3 +--
> src/PMG/DBTools.pm | 15 +++++++++++++++
> 2 files changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/src/PMG/Cluster.pm b/src/PMG/Cluster.pm
> index 015e66a..3bf4c41 100644
> --- a/src/PMG/Cluster.pm
> +++ b/src/PMG/Cluster.pm
> @@ -698,8 +698,7 @@ sub sync_statistic_db {
>
> my $count;
>
> - PMG::DBTools::create_clusterinfo_default(
> - $ldb, $rcid, 'lastid_CStatistic', -1, undef);
> + PMG::DBTools::create_clusterinfo_statistic_default($ldb, $rcid);
>
> do { # get new values
>
> diff --git a/src/PMG/DBTools.pm b/src/PMG/DBTools.pm
> index 9e133bc..6e682f8 100644
> --- a/src/PMG/DBTools.pm
> +++ b/src/PMG/DBTools.pm
> @@ -1139,6 +1139,21 @@ sub create_clusterinfo_default {
> $sth->finish();
> }
>
> +sub create_clusterinfo_statistic_default {
> + my ($dbh, $rcid) = @_;
> +
> + my $sth = $dbh->prepare("SELECT * FROM ClusterInfo WHERE " .
> + "CID = ? AND Name = 'lastid_CStatistic'");
> + $sth->execute($rcid);
> + if (!$sth->fetchrow_hashref()) {
> + # initialize to maximum CStatistic rid locally present for $rcid (if
> + # any), might already exist if the node was restored from backup
> + $dbh->do("INSERT INTO ClusterInfo (cid, name, ivalue) SELECT $rcid, 'lastid_CStatistic', " .
> + "COALESCE (max (rid), -1) FROM CStatistic WHERE cid = $rcid");
> + }
> + $sth->finish();
> +}
> +
> sub read_int_clusterinfo {
> my ($dbh, $rcid, $name) = @_;
>
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2024-02-21 14:09 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-24 13:56 [pmg-devel] [PATCH pmg-api] fix #5189: cluster: do not synchronize statistics that already exist Friedrich Weber
2024-02-21 14:09 ` Stoiko Ivanov
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal