[PVE-User] replication failures

public inbox for pve-user@lists.proxmox.com
 help / color / mirror / Atom feed

* [PVE-User] replication failures
@ 2025-05-26  3:40 Randy Bush
  2025-05-26 15:11 ` DERUMIER, Alexandre
  0 siblings, 1 reply; 5+ messages in thread
From: Randy Bush @ 2025-05-26  3:40 UTC (permalink / raw)
  To: ProxMox Users

three node debian-12 8.4.1 zfs raidz2 ssd cluster, maybe 20vms, all vms
replicate /15 to the next node to the right.  

on one and only of a couple of similar clusters, and on only one
particular node, we're getting replication failuers of the nature of

    2025-05-26T00:16:17.643854+00:00 vm21 pvescheduler[2641364]: command 'zfs destroy images/vm-107-disk-0@__replicate_107-0_1748217943__' failed: got timeout
    2025-05-26T00:16:37.218095+00:00 vm21 pvescheduler[2641364]: 107-0: got unexpected replication job error - command 'zfs snapshot images/vm-107-disk-0@__replicate_107-0_1748218563__' failed: got timeout

five to 15 times a day.  zfs load?  flaky disk (smartmon reports
nothing)?  weak ether?  moon in klutz?

how do folk diagnose?

randy

_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PVE-User] replication failures
  2025-05-26  3:40 [PVE-User] replication failures Randy Bush
@ 2025-05-26 15:11 ` DERUMIER, Alexandre
  2025-05-26 20:02   ` Randy Bush
  0 siblings, 1 reply; 5+ messages in thread
From: DERUMIER, Alexandre @ 2025-05-26 15:11 UTC (permalink / raw)
  To: pve-user

How much time does it take in you do the delete command manually ? 
(zfs destroy images/vm-107-disk-0@__replicate_107-0_1748217943__)


(maybe the timeout in the code is too short ?)


-------- Message initial --------
De: Randy Bush <randy@psg.com>
Répondre à: Proxmox VE user list <pve-user@lists.proxmox.com>
À: ProxMox Users <pve-user@lists.proxmox.com>
Objet: [PVE-User] replication failures
Date: 26/05/2025 05:40:39

three node debian-12 8.4.1 zfs raidz2 ssd cluster, maybe 20vms, all vms
replicate /15 to the next node to the right.  

on one and only of a couple of similar clusters, and on only one
particular node, we're getting replication failuers of the nature of

    2025-05-26T00:16:17.643854+00:00 vm21 pvescheduler[2641364]:
command 'zfs destroy images/vm-107-disk-0@__replicate_107-
0_1748217943__' failed: got timeout
    2025-05-26T00:16:37.218095+00:00 vm21 pvescheduler[2641364]: 107-0:
got unexpected replication job error - command 'zfs snapshot images/vm-
107-disk-0@__replicate_107-0_1748218563__' failed: got timeout

five to 15 times a day.  zfs load?  flaky disk (smartmon reports
nothing)?  weak ether?  moon in klutz?

how do folk diagnose?

randy

_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://antiphishing.vadesecure.com/v4?f=Rld2eGhGQ3psZjlOWGwxQ1_ZfFbgqZ
TPaooaLkyo9Iz48f3wEJxfdHSaXhsgUlRBwsSa2EvkACP7Jh9e5TXbPw&i=U2pXU09ocHlt
dTEydGM2aUXXbilnQtz5PQDA1D2RBy8&k=1XpP&r=SjA3d003VWxKRk1kazNaeRJgzukDmh
QdY5g-DacBRkZ4pgKdvLOyt2Z87havu-ae7CZLNw-
FYpOPxDnH4AVQTw&s=6f39617ccf400668f694b93aa3fbcb2782f4bc0a65f6c1bc81b8d
c48b06d54f4&u=https%3A%2F%2Flists.proxmox.com%2Fcgi-
bin%2Fmailman%2Flistinfo%2Fpve-user

_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PVE-User] replication failures
  2025-05-26 15:11 ` DERUMIER, Alexandre
@ 2025-05-26 20:02   ` Randy Bush
  2025-05-27  9:07     ` DERUMIER, Alexandre
  0 siblings, 1 reply; 5+ messages in thread
From: Randy Bush @ 2025-05-26 20:02 UTC (permalink / raw)
  To: DERUMIER, Alexandre; +Cc: pve-user

> How much time does it take in you do the delete command manually ? 
> (zfs destroy images/vm-107-disk-0@__replicate_107-0_1748217943__)

picked the latest

    # zfs destroy images/vm-107-disk-0@__replicate_107-0_1748284321__
    could not find any snapshots to destroy; check snapshot names.

> (maybe the timeout in the code is too short ?)

as it seems to have actually happened, perhaps this is the case.

though only this node on this cluster.  hmmmm.

randy

_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PVE-User] replication failures
  2025-05-26 20:02   ` Randy Bush
@ 2025-05-27  9:07     ` DERUMIER, Alexandre
  2025-05-29  6:09       ` dorsy via pve-user
  0 siblings, 1 reply; 5+ messages in thread
From: DERUMIER, Alexandre @ 2025-05-27  9:07 UTC (permalink / raw)
  To: randy; +Cc: pve-user


>>picked the latest
>>
>>    # zfs destroy images/vm-107-disk-0@__replicate_107-0_1748284321__
>>    could not find any snapshots to destroy; check snapshot names.

> (maybe the timeout in the code is too short ?)

>>as it seems to have actually happened, perhaps this is the case.

yes,could be the delete task taking too much time (and correctly
finished in background after the timeout on the pve size)

I really don't known how much is the timeout in the code
_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PVE-User] replication failures
  2025-05-27  9:07     ` DERUMIER, Alexandre
@ 2025-05-29  6:09       ` dorsy via pve-user
  0 siblings, 0 replies; 5+ messages in thread
From: dorsy via pve-user @ 2025-05-29  6:09 UTC (permalink / raw)
  To: pve-user; +Cc: dorsy

[-- Attachment #1: Type: message/rfc822, Size: 8415 bytes --]

From: dorsy <dorsyka@yahoo.com>
To: pve-user@lists.proxmox.com
Subject: Re: [PVE-User] replication failures
Date: Thu, 29 May 2025 08:09:10 +0200
Message-ID: <76c48531-292d-458d-b756-e82a36a94339@yahoo.com>

A 'zpool history' could be a source of info what actually happened 
regarding snapshots and when.

On 5/27/2025 11:07 AM, DERUMIER, Alexandre wrote:
>>> picked the latest
>>>
>>>      # zfs destroy images/vm-107-disk-0@__replicate_107-0_1748284321__
>>>      could not find any snapshots to destroy; check snapshot names.
>> (maybe the timeout in the code is too short ?)
>>> as it seems to have actually happened, perhaps this is the case.
> yes,could be the delete task taking too much time (and correctly
> finished in background after the timeout on the pve size)
>
> I really don't known how much is the timeout in the code
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

-- 
Üdvözlettel,
Dorotovics László
rendszergazda
IKRON Fejlesztő és Szolgáltató Kft.
Székhely: 6721 Szeged, Szilágyi utca 5-1.



[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-05-29  6:19 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-05-26  3:40 [PVE-User] replication failures Randy Bush
2025-05-26 15:11 ` DERUMIER, Alexandre
2025-05-26 20:02   ` Randy Bush
2025-05-27  9:07     ` DERUMIER, Alexandre
2025-05-29  6:09       ` dorsy via pve-user

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal