all lists on lists.proxmox.com
 help / color / mirror / Atom feed
* [pmg-devel] [PATCH v2 pmg-api] fix #5189: cluster: avoid sync errors for statistics and quarantine
@ 2024-02-22 16:35 Friedrich Weber
  2024-02-22 22:01 ` [pmg-devel] applied: " Stoiko Ivanov
  0 siblings, 1 reply; 2+ messages in thread
From: Friedrich Weber @ 2024-02-22 16:35 UTC (permalink / raw)
  To: pmg-devel

After restoring a backup from a cluster on a fresh node with
statistics, and then creating a cluster, the following can happen
(node 1 being master and node 2 being a node): `ClusterInfo` on node
1 has no record about the last-synchronized `CStatistic` row id of
node 2. Thus, pmgmirror on node 1 initializes the record with -1 and
tries to synchronize *all* `CStatistic` rows with cid 2 from node 2.
But (some of) these rows may already exist on cid 1, because they
were part of the backup, so pmgmirror on node 1 triggers a Postgres
unique constraint violation, statistics synchronization on node 1
fails, and node 1 remains in the "synchronizing" state.

Fix this as follows: When a new node is added to a cluster, the master
now initializes its `ClusterInfo` record of the last-synchronized
`CStatistic` row id for that node cid with the maximum row id that
exists in the local `CStatistic` for that node cid, or with -1 if the
local `CStatistic` has no row for that node cid. This is valid because
the newly-added node copies the master's `CStatistic` table during
cluster join.

Do the same for the `CMailStore` table, where a similar sync error
could happen e.g. if the table has rows for both node cids, node 2 is
shut down and manually deleted from the cluster.conf, the maxcid is
manually reset to 1, and a fresh node is joined to the cluster and
gets assigned cid 2.

Signed-off-by: Friedrich Weber <f.weber@proxmox.com>
---

Notes:
    v1 -> v2:
    - initialize master `ClusterInfo` record for `CStatistic` once when
      joining a new node instead of initializing on the first cluster
      sync, as discussed with Stoiko off-list
    - do the same for `CMailStore`, where a similar sync error can happen
    
    As a side effect of this change, `ClusterInfo` on node 2 will now also
    have `lastid_CStatistic` and `lastid_CMailStore` rows for its own cid
    2 after join, because the database is copied over *after* the master
    runs `update_master_clusterinfo`, and `update_client_clusterinfo` on
    node 2 only deletes rows with the master cid 1. However, if I
    understand the cluster sync code correctly (and that's a big if),
    these additional rows should not have any effect, as node 2 will
    only ever read and update `ClusterInfo` rows for other cids, never
    for its own cid.
    
    v1: https://lists.proxmox.com/pipermail/pmg-devel/2024-January/002657.html

 src/PMG/DBTools.pm | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/src/PMG/DBTools.pm b/src/PMG/DBTools.pm
index 6112566..8770d06 100644
--- a/src/PMG/DBTools.pm
+++ b/src/PMG/DBTools.pm
@@ -1132,6 +1132,14 @@ sub update_master_clusterinfo {
 	$dbh->do ("INSERT INTO ClusterInfo (cid, name, ivalue) select $clientcid, 'lastmt_$table', " .
 		  "EXTRACT(EPOCH FROM now())::INTEGER");
     }
+
+    my @lastid_tables = ('CStatistic', 'CMailStore');
+
+    for my $table (@lastid_tables) {
+        $dbh->do("INSERT INTO ClusterInfo (cid, name, ivalue) " .
+	    "SELECT $clientcid, 'lastid_$table', COALESCE (max (rid), -1) FROM $table " .
+	    "WHERE cid = $clientcid");
+    }
 }
 
 sub update_client_clusterinfo {
-- 
2.39.2





^ permalink raw reply	[flat|nested] 2+ messages in thread

* [pmg-devel] applied: [PATCH v2 pmg-api] fix #5189: cluster: avoid sync errors for statistics and quarantine
  2024-02-22 16:35 [pmg-devel] [PATCH v2 pmg-api] fix #5189: cluster: avoid sync errors for statistics and quarantine Friedrich Weber
@ 2024-02-22 22:01 ` Stoiko Ivanov
  0 siblings, 0 replies; 2+ messages in thread
From: Stoiko Ivanov @ 2024-02-22 22:01 UTC (permalink / raw)
  To: Friedrich Weber; +Cc: pmg-devel

Thanks for the patch, thinking this through again, and the exceptional
commit-message!

looked it through, applied it in my reproducer and it fixes the issue - so
applied it

still 2 comments inline:
On Thu, 22 Feb 2024 17:35:35 +0100
Friedrich Weber <f.weber@proxmox.com> wrote:

> After restoring a backup from a cluster on a fresh node with
> statistics, and then creating a cluster, the following can happen
> (node 1 being master and node 2 being a node): `ClusterInfo` on node
> 1 has no record about the last-synchronized `CStatistic` row id of
> node 2. Thus, pmgmirror on node 1 initializes the record with -1 and
> tries to synchronize *all* `CStatistic` rows with cid 2 from node 2.
> But (some of) these rows may already exist on cid 1, because they
> were part of the backup, so pmgmirror on node 1 triggers a Postgres
> unique constraint violation, statistics synchronization on node 1
> fails, and node 1 remains in the "synchronizing" state.
> 
> Fix this as follows: When a new node is added to a cluster, the master
> now initializes its `ClusterInfo` record of the last-synchronized
> `CStatistic` row id for that node cid with the maximum row id that
> exists in the local `CStatistic` for that node cid, or with -1 if the
> local `CStatistic` has no row for that node cid. This is valid because
Would it maybe make sense to scan the existing "lastid_tables" (Cstatistic
and CMailStore) when the cluster is created - and use the maximal value of
both as the maxcid value in the cluster-conf?
(I guess we will then eventually see 5-digit cluster-ids, but it might
help for the few cases of broken clusters due to constraint violations)

Alternatively we could make `pmgcm delete X` update the cid-columns for all
relevant tables, replacing X with a special value (probably 0 should work,
but did not check that). Some kind of idempotent cleanup function yielding
a state where joining a node with the same id just works(tm)

but that's mostly a note to future us, after we see the next actual issue
with cluster-joining that we can reproduce.


> the newly-added node copies the master's `CStatistic` table during
> cluster join.
> 
> Do the same for the `CMailStore` table, where a similar sync error
> could happen e.g. if the table has rows for both node cids, node 2 is
> shut down and manually deleted from the cluster.conf, the maxcid is
> manually reset to 1, and a fresh node is joined to the cluster and
> gets assigned cid 2.
> 
> Signed-off-by: Friedrich Weber <f.weber@proxmox.com>
> ---
> 
> Notes:
>     v1 -> v2:
>     - initialize master `ClusterInfo` record for `CStatistic` once when
>       joining a new node instead of initializing on the first cluster
>       sync, as discussed with Stoiko off-list
>     - do the same for `CMailStore`, where a similar sync error can happen
>     
>     As a side effect of this change, `ClusterInfo` on node 2 will now also
>     have `lastid_CStatistic` and `lastid_CMailStore` rows for its own cid
>     2 after join, because the database is copied over *after* the master
>     runs `update_master_clusterinfo`, and `update_client_clusterinfo` on
>     node 2 only deletes rows with the master cid 1. However, if I
>     understand the cluster sync code correctly (and that's a big if),
>     these additional rows should not have any effect, as node 2 will
>     only ever read and update `ClusterInfo` rows for other cids, never
>     for its own cid.
That's my current thinking too - regarding a node not syncing it's own
information (and updating ClusterInfo alongside it) - short of someone
randomly incrementing/decrementing the id's in the cluster.conf...



>     
>     v1: https://lists.proxmox.com/pipermail/pmg-devel/2024-January/002657.html
> 
>  src/PMG/DBTools.pm | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/src/PMG/DBTools.pm b/src/PMG/DBTools.pm
> index 6112566..8770d06 100644
> --- a/src/PMG/DBTools.pm
> +++ b/src/PMG/DBTools.pm
> @@ -1132,6 +1132,14 @@ sub update_master_clusterinfo {
>  	$dbh->do ("INSERT INTO ClusterInfo (cid, name, ivalue) select $clientcid, 'lastmt_$table', " .
>  		  "EXTRACT(EPOCH FROM now())::INTEGER");
>      }
> +
> +    my @lastid_tables = ('CStatistic', 'CMailStore');
> +
> +    for my $table (@lastid_tables) {
> +        $dbh->do("INSERT INTO ClusterInfo (cid, name, ivalue) " .
> +	    "SELECT $clientcid, 'lastid_$table', COALESCE (max (rid), -1) FROM $table " .
> +	    "WHERE cid = $clientcid");
> +    }
>  }
>  
>  sub update_client_clusterinfo {





^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2024-02-22 22:02 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-22 16:35 [pmg-devel] [PATCH v2 pmg-api] fix #5189: cluster: avoid sync errors for statistics and quarantine Friedrich Weber
2024-02-22 22:01 ` [pmg-devel] applied: " Stoiko Ivanov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal