From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id E93A194107 for ; Wed, 21 Feb 2024 15:09:52 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id C241218ECA for ; Wed, 21 Feb 2024 15:09:22 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Wed, 21 Feb 2024 15:09:22 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id C8A1E444AE for ; Wed, 21 Feb 2024 15:09:21 +0100 (CET) Date: Wed, 21 Feb 2024 15:09:20 +0100 From: Stoiko Ivanov To: Friedrich Weber Cc: pmg-devel@lists.proxmox.com Message-ID: <20240221150920.5301e870@rosa.proxmox.com> In-Reply-To: <20240124135626.217720-1-f.weber@proxmox.com> References: <20240124135626.217720-1-f.weber@proxmox.com> X-Mailer: Claws Mail 4.1.1 (GTK 3.24.38; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-SPAM-LEVEL: Spam detection results: 0 AWL 0.084 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record T_SCC_BODY_TEXT_LINE -0.01 - URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [cluster.pm, proxmox.com, dbtools.pm] Subject: Re: [pmg-devel] [PATCH pmg-api] fix #5189: cluster: do not synchronize statistics that already exist X-BeenThere: pmg-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox Mail Gateway development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 21 Feb 2024 14:09:53 -0000 Thanks for finding this and directly tackling it! much appreciated The approach works in general - and from looking through the code I think resetting this on every sync makes more sense than once upon cluster-join (e.g. the clusterinfo table gets truncated/wiped when restoring a backup) I managed to reproduce the issue and verify the fix - question is why not take the same approach for the cmailstore sync (from a quick glance the only other table that gets synced by id not mtime) - and afair we had some users who had to wait for their quarantine-lifetime to get their cluster in sync again) If you want to look even further into the cluster-stack - one of the more common sources of constraint violation errors in a pmg cluster is removing a cluster-node by deleting it's stanza from the config (or removing the cluster.conf, and then rejoining that node, or a fresh one (the cid will clash again): https://forum.proxmox.com/threads/ha-cluster-warning-at-creation.122289/ https://forum.proxmox.com/threads/master-aus-postgresdb-entfernen.85255/ afaict this removing and joining a fresh node might be needed to get the cmailstore sync to fail. On Wed, 24 Jan 2024 14:56:26 +0100 Friedrich Weber wrote: > After restoring a backup from a cluster on a fresh node with > statistics, and then creating a cluster, the following can happen > (node 1 being master and node 2 being a node): `ClusterInfo` on node > 1 has no record about the last-synchronized `CStatistic` row id of > node 2. Thus, pmgmirror on node 1 initializes the record with -1 and > tries to synchronize *all* `CStatistic` rows with cid 2 from node 2. > But (some of) these rows may already exist on cid 1, because they > were part of the backup, so pmgmirror on node 1 triggers a Postgres > unique constraint violation, statistics synchronization on node 1 > fails, and node 1 remains in the "synchronizing" state. > > Fix this as follows: If `ClusterInfo` does not have a record on the > last-synchronized `CStatistic` row id of some node cid, initialize it > with the maximum row id that exists in the local `CStatistic` for that > node cid, or with -1 if the local `CStatistic` has no row for that > node cid. For this, use a SQL query that is already used in > DBTools::update_client_clusterinfo. As a result, pmgmirror will only > synchronize rows with row ids larger than the maximum row id from the > backup, and avoid running into a unique constraint violation. > > Signed-off-by: Friedrich Weber > --- > > Notes: > Tested that the steps from #5189 [1] do not produce a constraint > violation anymore, and both cluster nodes eventually become "active". > Also tested that this patch doesn't break anything in the "restore > without statistics" scenario. > > However, not 100% sure I haven't overlooked any edge cases, so > additional testing is very welcome! > > [1] https://bugzilla.proxmox.com/show_bug.cgi?id=5189 > > src/PMG/Cluster.pm | 3 +-- > src/PMG/DBTools.pm | 15 +++++++++++++++ > 2 files changed, 16 insertions(+), 2 deletions(-) > > diff --git a/src/PMG/Cluster.pm b/src/PMG/Cluster.pm > index 015e66a..3bf4c41 100644 > --- a/src/PMG/Cluster.pm > +++ b/src/PMG/Cluster.pm > @@ -698,8 +698,7 @@ sub sync_statistic_db { > > my $count; > > - PMG::DBTools::create_clusterinfo_default( > - $ldb, $rcid, 'lastid_CStatistic', -1, undef); > + PMG::DBTools::create_clusterinfo_statistic_default($ldb, $rcid); > > do { # get new values > > diff --git a/src/PMG/DBTools.pm b/src/PMG/DBTools.pm > index 9e133bc..6e682f8 100644 > --- a/src/PMG/DBTools.pm > +++ b/src/PMG/DBTools.pm > @@ -1139,6 +1139,21 @@ sub create_clusterinfo_default { > $sth->finish(); > } > > +sub create_clusterinfo_statistic_default { > + my ($dbh, $rcid) = @_; > + > + my $sth = $dbh->prepare("SELECT * FROM ClusterInfo WHERE " . > + "CID = ? AND Name = 'lastid_CStatistic'"); > + $sth->execute($rcid); > + if (!$sth->fetchrow_hashref()) { > + # initialize to maximum CStatistic rid locally present for $rcid (if > + # any), might already exist if the node was restored from backup > + $dbh->do("INSERT INTO ClusterInfo (cid, name, ivalue) SELECT $rcid, 'lastid_CStatistic', " . > + "COALESCE (max (rid), -1) FROM CStatistic WHERE cid = $rcid"); > + } > + $sth->finish(); > +} > + > sub read_int_clusterinfo { > my ($dbh, $rcid, $name) = @_; >