From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <s.ivanov@proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id E93A194107
 for <pmg-devel@lists.proxmox.com>; Wed, 21 Feb 2024 15:09:52 +0100 (CET)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id C241218ECA
 for <pmg-devel@lists.proxmox.com>; Wed, 21 Feb 2024 15:09:22 +0100 (CET)
Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com
 [94.136.29.106])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS
 for <pmg-devel@lists.proxmox.com>; Wed, 21 Feb 2024 15:09:22 +0100 (CET)
Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1])
 by proxmox-new.maurer-it.com (Proxmox) with ESMTP id C8A1E444AE
 for <pmg-devel@lists.proxmox.com>; Wed, 21 Feb 2024 15:09:21 +0100 (CET)
Date: Wed, 21 Feb 2024 15:09:20 +0100
From: Stoiko Ivanov <s.ivanov@proxmox.com>
To: Friedrich Weber <f.weber@proxmox.com>
Cc: pmg-devel@lists.proxmox.com
Message-ID: <20240221150920.5301e870@rosa.proxmox.com>
In-Reply-To: <20240124135626.217720-1-f.weber@proxmox.com>
References: <20240124135626.217720-1-f.weber@proxmox.com>
X-Mailer: Claws Mail 4.1.1 (GTK 3.24.38; x86_64-pc-linux-gnu)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
X-SPAM-LEVEL: Spam detection results:  0
 AWL 0.084 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 DMARC_MISSING             0.1 Missing DMARC policy
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
 T_SCC_BODY_TEXT_LINE    -0.01 -
 URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See
 http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more
 information. [cluster.pm, proxmox.com, dbtools.pm]
Subject: Re: [pmg-devel] [PATCH pmg-api] fix #5189: cluster: do not
 synchronize statistics that already exist
X-BeenThere: pmg-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox Mail Gateway development discussion
 <pmg-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pmg-devel>, 
 <mailto:pmg-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pmg-devel/>
List-Post: <mailto:pmg-devel@lists.proxmox.com>
List-Help: <mailto:pmg-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pmg-devel>, 
 <mailto:pmg-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Wed, 21 Feb 2024 14:09:53 -0000

Thanks for finding this and directly tackling it! much appreciated

The approach works in general - and from looking through the code I think
resetting this on every sync makes more sense than once upon cluster-join
(e.g. the clusterinfo table gets truncated/wiped when restoring a backup)

I managed to reproduce the issue and verify the fix - question is why not
take the same approach for the cmailstore sync (from a quick glance the
only other table that gets synced by id not mtime) - and afair we had some
users who had to wait for their quarantine-lifetime to get their cluster
in sync again)

If you want to look even further into the cluster-stack - one of the more
common sources of constraint violation errors in a pmg cluster is removing
a cluster-node by deleting it's stanza from the config (or removing the
cluster.conf, and then rejoining that node, or a fresh one (the cid will
clash again):
https://forum.proxmox.com/threads/ha-cluster-warning-at-creation.122289/
https://forum.proxmox.com/threads/master-aus-postgresdb-entfernen.85255/

afaict this removing and joining a fresh node might be needed to get the
cmailstore sync to fail.


On Wed, 24 Jan 2024 14:56:26 +0100
Friedrich Weber <f.weber@proxmox.com> wrote:

> After restoring a backup from a cluster on a fresh node with
> statistics, and then creating a cluster, the following can happen
> (node 1 being master and node 2 being a node): `ClusterInfo` on node
> 1 has no record about the last-synchronized `CStatistic` row id of
> node 2. Thus, pmgmirror on node 1 initializes the record with -1 and
> tries to synchronize *all* `CStatistic` rows with cid 2 from node 2.
> But (some of) these rows may already exist on cid 1, because they
> were part of the backup, so pmgmirror on node 1 triggers a Postgres
> unique constraint violation, statistics synchronization on node 1
> fails, and node 1 remains in the "synchronizing" state.
> 
> Fix this as follows: If `ClusterInfo` does not have a record on the
> last-synchronized `CStatistic` row id of some node cid, initialize it
> with the maximum row id that exists in the local `CStatistic` for that
> node cid, or with -1 if the local `CStatistic` has no row for that
> node cid. For this, use a SQL query that is already used in
> DBTools::update_client_clusterinfo. As a result, pmgmirror will only
> synchronize rows with row ids larger than the maximum row id from the
> backup, and avoid running into a unique constraint violation.
> 
> Signed-off-by: Friedrich Weber <f.weber@proxmox.com>
> ---
> 
> Notes:
>     Tested that the steps from #5189 [1] do not produce a constraint
>     violation anymore, and both cluster nodes eventually become "active".
>     Also tested that this patch doesn't break anything in the "restore
>     without statistics" scenario.
>     
>     However, not 100% sure I haven't overlooked any edge cases, so
>     additional testing is very welcome!
>     
>     [1] https://bugzilla.proxmox.com/show_bug.cgi?id=5189
> 
>  src/PMG/Cluster.pm |  3 +--
>  src/PMG/DBTools.pm | 15 +++++++++++++++
>  2 files changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/src/PMG/Cluster.pm b/src/PMG/Cluster.pm
> index 015e66a..3bf4c41 100644
> --- a/src/PMG/Cluster.pm
> +++ b/src/PMG/Cluster.pm
> @@ -698,8 +698,7 @@ sub sync_statistic_db {
>  
>      my $count;
>  
> -    PMG::DBTools::create_clusterinfo_default(
> -	$ldb, $rcid, 'lastid_CStatistic', -1, undef);
> +    PMG::DBTools::create_clusterinfo_statistic_default($ldb, $rcid);
>  
>      do { # get new values
>  
> diff --git a/src/PMG/DBTools.pm b/src/PMG/DBTools.pm
> index 9e133bc..6e682f8 100644
> --- a/src/PMG/DBTools.pm
> +++ b/src/PMG/DBTools.pm
> @@ -1139,6 +1139,21 @@ sub create_clusterinfo_default {
>      $sth->finish();
>  }
>  
> +sub create_clusterinfo_statistic_default {
> +    my ($dbh, $rcid) = @_;
> +
> +    my $sth = $dbh->prepare("SELECT * FROM ClusterInfo WHERE " .
> +	"CID = ? AND Name = 'lastid_CStatistic'");
> +    $sth->execute($rcid);
> +    if (!$sth->fetchrow_hashref()) {
> +	# initialize to maximum CStatistic rid locally present for $rcid (if
> +	# any), might already exist if the node was restored from backup
> +	$dbh->do("INSERT INTO ClusterInfo (cid, name, ivalue) SELECT $rcid, 'lastid_CStatistic', " .
> +		  "COALESCE (max (rid), -1) FROM CStatistic WHERE cid = $rcid");
> +    }
> +    $sth->finish();
> +}
> +
>  sub read_int_clusterinfo {
>      my ($dbh, $rcid, $name) = @_;
>