From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <devzero@web.de>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id 8BA339C34E
 for <pve-user@lists.proxmox.com>; Wed, 31 May 2023 16:55:30 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id 6C15DBEC5
 for <pve-user@lists.proxmox.com>; Wed, 31 May 2023 16:55:30 +0200 (CEST)
Received: from mout.web.de (mout.web.de [212.227.17.11])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest
 SHA256) (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS
 for <pve-user@lists.proxmox.com>; Wed, 31 May 2023 16:55:28 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=web.de;
 s=s29768273; t=1685544922; x=1686149722; i=devzero@web.de;
 bh=NkBKM9A6aHc2iNxNOMN+Wzwef3hmu5QGSfuBCsmZxUc=;
 h=X-UI-Sender-Class:Date:Subject:To:Cc:References:From:In-Reply-To;
 b=vPR3po0h2I8hgaYU+WBGZiYxocxUU6/9z8AQeJZ1RqMkpfzKsFUMT4sBbs5U/ggRveo3SsF
 BeJMZ6pQZuuKon4WPjsIoXL6hv66bJYGt6Gv2FvsL5NPOWB4+ejM5tiIlEnPBftc2W4/vyK3Q
 AE6EldZelFX6CGrMqPlKfSB6AUF++/E6qlLhMHwRhYgj37XHmLlqrWtie40ceN4a6sY8YlXR0
 sbNqFJzoJirQYgNjz39dM2eApaEZ/4MkJ854HI9Viou6TQThCBWmIwo/V09MPRLxXnKyh555q
 iQkEJUS/SJFDEsIaa+jRwMt7X9OP7htt24bLDfk5t7S6NN+9ClQw==
X-UI-Sender-Class: 814a7b36-bfc1-4dae-8640-3722d8ec6cd6
Received: from [172.20.35.164] ([37.24.118.138]) by smtp.web.de (mrweb106
 [213.165.67.124]) with ESMTPSA (Nemesis) id 1M8TBS-1pzxNa17jM-004Ywq; Wed, 31
 May 2023 16:55:22 +0200
Message-ID: <70805eb7-a98e-8169-dfe0-93bfe2f67de5@web.de>
Date: Wed, 31 May 2023 16:55:21 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
 Gecko/20100101 Thunderbird/102.11.1
To: Christian Schoepplein <christian.schoepplein@linova.de>
Cc: pve-user@lists.proxmox.com
References: <ZHYlC3hyASfPL9K2@d5421.linova.de>
 <f6b577ab-9ebe-c439-72db-d1637f8ed219@web.de>
 <ZHdYUCwGBwXmRsWZ@d5421.linova.de>
From: Roland <devzero@web.de>
In-Reply-To: <ZHdYUCwGBwXmRsWZ@d5421.linova.de>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
X-Provags-ID: V03:K1:2B0X3Ska//+en6THZvXccbmQ6LFxhb/HRMlKXe2ynnjNQftCnl8
 srPo4ZcdjWuXG+JLIV3/2KH9GeZFZoEt/Dan0VjLmcIAYoTInoJxSkR0Zk8FhGzOmxuilVt
 cCW9KQfdQUx6HYH6yIfYwfjRS0dXgv3USNZelixVdVdIPQU4p6ABI7XE0L/A5tmYpTivwKc
 VkqIH8XLNlyH37Ef6sDHQ==
X-Spam-Flag: NO
UI-OutboundReport: notjunk:1;M01:P0:pwgqOnoqQYc=;EcUFjtK9zQEIMZ41lEKs019aYRW
 +dlVH9+Ufq3njCBjfRNEVDm9rpLO0y0VWfpYrNccCAKKuGuZAcwZX0W4se0qmIMgjXxC0uiQ4
 9Dp8FYA44/ch8D0BmhzEUewLLD5oHkL0RmC72Taa48Z2cVL28D6Lbt1vBa9f0YCb/C3J+N2XU
 EGsdFsnyQELnKGX3NlXijdNVLqquOxi1bmf7qNhTCmo60oHd8p1jENEjJrl2AEarszi+9L+sc
 xSfBARygPa1GlDV98Y9AiGPgutVo2EEjElEbGkaixGPr2o081ZrVm7XiMkGQUrKobvmS/Vt48
 cd0pBD+12Ai031MO6Txt76P/uF2ae+VAxtq+GgXERva/CSw4nrLE/ol6+kz9ua7vymFGTFYiu
 hcFU9FZXUtnqIWlTzVHFjDSZmQvSXl0hOadUbU1mJSBOQRGPdp0M+7Ii2ZpecfnP8GGHfqfOZ
 p+5HSKj29dPvqByuzmx53Pu40sw9BufrHvMp0/TV8F0Qk7fbJWuwbK/HWvJQfG8BIJR0d93z1
 Lju/tUzQuouvlsv7BE71aovavIBB/IK+ZTZYik9r6c4QpQEM9rHfYPYTBFTxyhxDasSD5seit
 pi3YnP9GuwlHoeUWAZvObu/Bk56f8b+0ySaExiNU5K8H7jQeXpnuJhcT1fXm36hD7sq9kxm3l
 o1seBBZM0x+6/JZsccF/6pv0ez12BOIG1tFO5sv+gYaWT2q/ZWFhQXcxsWqlkP6e5YiLh7M/d
 Y6NJehnXxVQX1AlT3mNdqMzfjO159cip0/ezCySqg2Q0oIRW8TTluOuYpPm2gk69qFjwHtmTE
 ysXVx6VFpcpzoggz4E5IHUVY0ox5T0aQRZbBCGh+1Tja3uV1ctQFcyyNYTzPSc/JRREVUJ27B
 J0nv7G/zDmkVEPgajncVLhXN8EPoXc7/koPwuPzcE9qEI/NVhN2LYEaWKjkq0e3gYOgdLNrd0
 XzzOXOPiRe6Ti5W96di8vVYr+u0=
X-SPAM-LEVEL: Spam detection results:  0
 AWL 0.140 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 DKIM_SIGNED               0.1 Message has a DKIM or DK signature,
 not necessarily valid
 DKIM_VALID -0.1 Message has at least one valid DKIM or DK signature
 DKIM_VALID_AU -0.1 Message has a valid DKIM or DK signature from author's
 domain
 DKIM_VALID_EF -0.1 Message has a valid DKIM or DK signature from envelope-from
 domain DMARC_PASS               -0.1 DMARC pass policy
 FREEMAIL_FROM 0.001 Sender email is commonly abused enduser mail provider
 NICE_REPLY_A            -0.09 Looks like a legit reply (A)
 RCVD_IN_MSPIKE_H3       0.001 Good reputation (+3)
 RCVD_IN_MSPIKE_WL       0.001 Mailspike good senders
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
 T_SCC_BODY_TEXT_LINE    -0.01 -
Subject: Re: [PVE-User] Proxmox and glusterfs: VMs get corupted
X-BeenThere: pve-user@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE user list <pve-user.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-user>, 
 <mailto:pve-user-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-user/>
List-Post: <mailto:pve-user@lists.proxmox.com>
List-Help: <mailto:pve-user-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user>, 
 <mailto:pve-user-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Wed, 31 May 2023 14:55:30 -0000

>Was k=C3=B6nnte ich denn noch versuchen? W=C3=BCrde es vielleicht Sinn ma=
chen das
>Image-Format von qcow2 auf raw umzustellen? wir

ja, ich w=C3=BCrd mal testhalber auf raw umstellen. und discard aus machen=
.

>wir haben qcow2 vor allem wegen
>der Snapshots und Platzersparnis gew=C3=A4hlt

dito. fahren damit ganz gut, obwohl wir qcow2 on top zfs datasets haben...=
.

>Wir haben glusterfs deshalb gew=C3=A4hlt, weil es uns am unkompliziertest=
en
>schien und weil wir etwas Respekt vor z.B. Ceph haben.

das geht mir exakt auch so, ich habe deshalb bislang einen bogen um ceph g=
emacht und schiele
auch schon eine weile auf glusterfs, habe aber den eindruck da=C3=9F das i=
m kontext proxmox (oder auch
sonstwie) irgendwie sehr exotisch ist (sonst h=C3=A4tte hier vielleicht au=
ch mal wer geantwortet
https://forum.proxmox.com/threads/glusterfs-sharding-afr-question.118554/ =
).

wir haben aus diesem grund bislang auf shared storage verzichtet.

ich bin irgendwie gebranntes kind was san und shared storage angeht, habe =
fr=C3=BCher sogar mal
SANs inkl. San-Virtualisierung mit IBM SVC an der backe gehabt und auch no=
ch nie so viel =C3=A4rger
mit IT gehabt und auch nie so schlecht geschlafen)

wir haben aktuell nur local storage , setzen auf cold-standy ausfall-hardw=
are und replizieren
die lokalen storages mit sanoid auf ein ausfall-system.

ich h=C3=A4tte aber auch lieber eine einfache und gut wartbare online/redu=
ndanz-l=C3=B6sung.
mit ceph kann ich mir nicht anfreunden. und wenn ich so lese was leute da =
f=C3=BCr n=C3=B6te ud
herausforderungen mit haben bin ich auch offengesagt froh drum es NICHT an=
 der backe zu haben.

gr=C3=BCsse
roland

Am 31.05.23 um 16:23 schrieb Christian Schoepplein:
> Hallo Roland,
>
> danke f=C3=BCr deine Antwort und Tipps.
>
> Ich hab nun mehrere hundert gr=C3=B6=C3=9Fere und sehr gro=C3=9Fe Files =
nach
> /mnt/pve/gfs_vms geschrieben und die md5-Summen verglichen, alles kein
> Problem. Auch beim Lesen nicht.
>
> Wenn ich aio auf threads setze, wird es gef=C3=BChlt leider sogar noch s=
chlimmer
> mit den kaputten VMs. Ich hab Folgendes in der VM Konfig stehen:
>
> scsi0: gfs_vms:200/vm-200-disk-0.qcow2,discard=3Don,aio=3Dthreads,size=
=3D10444M
>
> Ist das so richtig? Laut den Prozessen sollte es stimmen:
>
> root     1708993  4.3  1.7 3370764 1174016 ?     Sl   15:32   1:40
> /usr/bin/kvm -id 200 -name testvm,debug-threads=3Don -no-shutdown -chard=
ev socket,id=3Dqmp,path=3D/var/run/qemu-server/200.qmp,server=3Don,wait=3D=
off -mon chardev=3Dqmp,mode=3Dcontrol -ch
> ardev socket,id=3Dqmp-event,path=3D/var/run/qmeventd.sock,reconnect=3D5 =
-mon
> chardev=3Dqmp-event,mode=3Dcontrol -pidfile /var/run/qemu-server/200.pid=
 -daemonize -smbios type=3D1,uuid=3D0da99a1f-a9ac-4999-a6c4-203cd39ff72e -=
smp 1,sockets=3D1,cores=3D1,maxcpus
> =3D1 -nodefaults -boot
> menu=3Don,strict=3Don,reboot-timeout=3D1000,splash=3D/usr/share/qemu-ser=
ver/bootsplash.jpg -vnc unix:/var/run/qemu-server/200.vnc,password=3Don -c=
pu kvm64,enforce,+kvm_pv_eoi,+kvm_pv_unhalt,+lahf_lm,+sep -m 2048 -object =
memory-ba
> ckend-ram,id=3Dram-node0,size=3D2048M -numa
> node,nodeid=3D0,cpus=3D0,memdev=3Dram-node0 -readconfig /usr/share/qemu-=
server/pve-q35-4.0.cfg -device vmgenid,guid=3Ddc4109a1-7b6f-4735-9685-ca50=
a38744e2 -device usb-tablet,id=3Dtablet,bus=3Dehci.0,port=3D1 -chard
> ev
> socket,id=3Dserial0,path=3D/var/run/qemu-server/200.serial0,server=3Don,=
wait=3Doff -device isa-serial,chardev=3Dserial0 -device VGA,id=3Dvga,bus=
=3Dpcie.0,addr=3D0x1 -chardev socket,path=3D/var/run/qemu-server/200.qga,s=
erver=3Don,wait=3Doff,id=3Dqga0 -device vir
> tio-serial,id=3Dqga0,bus=3Dpci.0,addr=3D0x8 -device
> virtserialport,chardev=3Dqga0,name=3Dorg.qemu.guest_agent.0 -device virt=
io-balloon-pci,id=3Dballoon0,bus=3Dpci.0,addr=3D0x3,free-page-reporting=3D=
on -iscsi initiator-name=3Diqn.1993-08.org.debian:01:cbb6926f9
> 59d -drive
> file=3Dgluster://gluster1.linova.de/gfs_vms/images/200/vm-200-cloudinit.=
qcow2,if=3Dnone,id=3Ddrive-ide2,media=3Dcdrom,aio=3Dio_uring -device ide-c=
d,bus=3Dide.1,unit=3D0,drive=3Ddrive-ide2,id=3Dide2 -device virtio-scsi-pc=
i,id=3Dscsihw0,bus=3Dpci.0,addr
> =3D0x5 -drive
> file=3Dgluster://gluster1.linova.de/gfs_vms/images/200/vm-200-disk-0.qco=
w2,if=3Dnone,id=3Ddrive-scsi0,aio=3Dthreads,discard=3Don,format=3Dqcow2,ca=
che=3Dnone,detect-zeroes=3Dunmap -device scsi-hd,bus=3Dscsihw0.0,channel=
=3D0,scsi-id=3D0,lun=3D0,drive=3Ddri
> ve-scsi0,id=3Dscsi0,bootindex=3D101 -netdev
> type=3Dtap,id=3Dnet0,ifname=3Dtap200i0,script=3D/var/lib/qemu-server/pve=
-bridge,downscript=3D/var/lib/qemu-server/pve-bridgedown,vhost=3Don -devic=
e virtio-net-pci,mac=3D5E:1F:9A:04:D6:6C,netdev=3Dnet0,bus=3Dpci.0,addr=3D
> 0x12,id=3Dnet0,rx_queue_size=3D1024,tx_queue_size=3D1024 -machine type=
=3Dq35+pve0
>
> Ich werd das Ganze jeztt nochmal mit einem lokalen Storage Backend
> probieren, geh aber davon aus, dass es damit l=C3=A4uft.
>
> Leider hat das gluster-Zeugs ein Kollege aufgesetzt, wenn es daran also
> liegt, muss ich mich wohl n=C3=A4her damit besch=C3=A4ftigen...
>
> Wir haben glusterfs deshalb gew=C3=A4hlt, weil es uns am unkompliziertes=
ten
> schien und weil wir etwas Respekt vor z.B. Ceph haben.
>
> Was k=C3=B6nnte ich denn noch versuchen? W=C3=BCrde es vielleicht Sinn m=
achen das
> Image-Format von qcow2 auf raw umzustellen? wir haben qcow2 vor allem we=
gen
> der Snapshots und Platzersparnis gew=C3=A4hlt, falls das mit glusterfs n=
icht
> vern=C3=BCnftig funktioniert, m=C3=BCssten wir da ggf. auch nochmal scha=
uen.
>
> Ich selbst habe bisher virtuelle Maschinen immer nur mit libvirt betrieb=
en,
> ohne ein zentrales Storage. Daher kommen gerade viele neue Themen zusamm=
en,
> die alle recht komplex sind :-(. Daher w=C3=A4re ich =C3=BCber jeden Tip=
p f=C3=BCr ein
> sinnvolles Setup froh :-).
>
> Ciao und danke,
>
>    Christian
>
>
> On Tue, May 30, 2023 at 06:46:51PM +0200, Roland wrote:
>> if /mnt/pve/gfs_vms is a writeable path from inside pve host, did you c=
heck if there is
>> also corruption when reading/writing large files there and compare with=
 md5sum after copy ?
>>
>> furthermore, i remember there was a gluster/qcow2 issue with aio=3Dnati=
ve some years ago,
>> could you retry with aio=3Dthreads for the virtual disks ?
>>
>> regards
>> roland
>>
>> Am 30.05.23 um 18:32 schrieb Christian Schoepplein:
>>> Hi,
>>>
>>> we are testing the current proxmox version with a glusterfs storage ba=
ckend
>>> and have a strange issue with file getting corupted inside the virtual
>>> machines. For what reason ever from one moment to another binaries can=
 not
>>> longer be executed, scripts are damaged and so on. In the logs I get e=
rrors
>>> like this:
>>>
>>> May 30 11:22:36 ns1 dockerd[1234]: time=3D"2023-05-30T11:22:36.8747650=
91+02:00" level=3Dwarning msg=3D"Running modprobe bridge br_netfilter fail=
ed with message: modprobe: ERROR: could not insert 'bridge': Exec format e=
rror\nmodprobe: ERROR: could not insert 'br_netfilter': Exec format error\=
ninsmod /lib/modules/5.15.0-72-generic/kernel/net/802/stp.ko \ninsmod /lib=
/modules/5.15.0-72-generic/kernel/net/802/stp.ko \n, error: exit status 1"
>>>
>>> On such a broken system a file brings the following:
>>>
>>> root@ns1:~# file /lib/modules/5.15.0-72-generic/kernel/net/802/stp.ko
>>> /lib/modules/5.15.0-72-generic/kernel/net/802/stp.ko: data
>>> root@ns1:~#
>>>
>>> On a normal system it looks like this:
>>>
>>> root@gluster1:~# file /lib/modules/5.15.0-72-generic/kernel/net/802/st=
p.ko
>>> /lib/modules/5.15.0-72-generic/kernel/net/802/stp.ko: ELF 64-bit LSB
>>> relocatable, x86-64, version 1 (SYSV), BuildID[sha1]=3D1084f7cfcffbd4c=
607724fba287c0ea7fc5775
>>> root@gluster1:~#
>>>
>>> there are not only kernel modules afected. I saw the same behaviour fo=
r
>>> scripts, icinga check modules, the sendmail binary and so on, I think =
it is
>>> totaly random :-(.
>>>
>>> We have the problems with newly installed VMs, VMs cloned from a templ=
ate
>>> create on our proxmox host and with VMs which we used before with libv=
irtd
>>> and migrated to our new proxmox machine. So IMHO it can not be related=
 to
>>> the way we create new virtual machines...
>>>
>>> We are using the following software:
>>>
>>> root@proxmox1:~# pveversion -v
>>> proxmox-ve: 7.4-1 (running kernel: 5.15.104-1-pve)
>>> pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
>>> pve-kernel-5.15: 7.4-1
>>> pve-kernel-5.15.104-1-pve: 5.15.104-2
>>> pve-kernel-5.15.102-1-pve: 5.15.102-1
>>> ceph-fuse: 15.2.17-pve1
>>> corosync: 3.1.7-pve1
>>> criu: 3.15-1+pve-1
>>> glusterfs-client: 9.2-1
>>> ifupdown2: 3.1.0-1+pmx3
>>> ksm-control-daemon: 1.4-1
>>> libjs-extjs: 7.0.0-1
>>> libknet1: 1.24-pve2
>>> libproxmox-acme-perl: 1.4.4
>>> libproxmox-backup-qemu0: 1.3.1-1
>>> libproxmox-rs-perl: 0.2.1
>>> libpve-access-control: 7.4-2
>>> libpve-apiclient-perl: 3.2-1
>>> libpve-common-perl: 7.3-4
>>> libpve-guest-common-perl: 4.2-4
>>> libpve-http-server-perl: 4.2-3
>>> libpve-rs-perl: 0.7.5
>>> libpve-storage-perl: 7.4-2
>>> libspice-server1: 0.14.3-2.1
>>> lvm2: 2.03.11-2.1
>>> lxc-pve: 5.0.2-2
>>> lxcfs: 5.0.3-pve1
>>> novnc-pve: 1.4.0-1
>>> proxmox-backup-client: 2.4.1-1
>>> proxmox-backup-file-restore: 2.4.1-1
>>> proxmox-kernel-helper: 7.4-1
>>> proxmox-mail-forward: 0.1.1-1
>>> proxmox-mini-journalreader: 1.3-1
>>> proxmox-widget-toolkit: 3.6.5
>>> pve-cluster: 7.3-3
>>> pve-container: 4.4-3
>>> pve-docs: 7.4-2
>>> pve-edk2-firmware: 3.20230228-2
>>> pve-firewall: 4.3-1
>>> pve-firmware: 3.6-4
>>> pve-ha-manager: 3.6.0
>>> pve-i18n: 2.12-1
>>> pve-qemu-kvm: 7.2.0-8
>>> pve-xtermjs: 4.16.0-1
>>> qemu-server: 7.4-3
>>> smartmontools: 7.2-pve3
>>> spiceterm: 3.2-2
>>> swtpm: 0.8.0~bpo11+3
>>> vncterm: 1.7-1
>>> zfsutils-linux: 2.1.9-pve1
>>> root@proxmox1:~#
>>>
>>> root@proxmox1:~# cat /etc/pve/storage.cfg
>>> dir: local
>>>           path /var/lib/vz
>>>           content rootdir,iso,images,vztmpl,backup,snippets
>>>
>>> zfspool: local-zfs
>>>           pool rpool/data
>>>           content images,rootdir
>>>           sparse 1
>>>
>>> glusterfs: gfs_vms
>>>           path /mnt/pve/gfs_vms
>>>           volume gfs_vms
>>>           content images
>>>           prune-backups keep-all=3D1
>>>           server gluster1.linova.de
>>>           server2 gluster2.linova.de
>>>
>>> root@proxmox1:~#
>>>
>>> The config of a typical VM looks like this:
>>>
>>> root@proxmox1:~# cat /etc/pve/qemu-server/101.conf
>>> #ns1
>>> agent: enabled=3D1,fstrim_cloned_disks=3D1
>>> boot: c
>>> bootdisk: scsi0
>>> cicustom: user=3Dlocal:snippets/user-data
>>> cores: 1
>>> hotplug: disk,network,usb
>>> ide2: gfs_vms:101/vm-101-cloudinit.qcow2,media=3Dcdrom,size=3D4M
>>> ipconfig0: ip=3D10.200.32.9/22,gw=3D10.200.32.1
>>> kvm: 1
>>> machine: q35
>>> memory: 2048
>>> meta: creation-qemu=3D7.2.0,ctime=3D1683718002
>>> name: ns1
>>> nameserver: 10.200.0.5
>>> net0: virtio=3D1A:61:75:25:C6:30,bridge=3Dvmbr0
>>> numa: 1
>>> ostype: l26
>>> scsi0: gfs_vms:101/vm-101-disk-0.qcow2,discard=3Don,size=3D10444M
>>> scsihw: virtio-scsi-pci
>>> searchdomain: linova.de
>>> serial0: socket
>>> smbios1: uuid=3De2f503fe-4a66-4085-86c0-bb692add6b7a
>>> sockets: 1
>>> vmgenid: 3be6ec9d-7cfd-47c0-9f86-23c2e3ce5103
>>>
>>> root@proxmox1:~#
>>>
>>> Our glusterfs storage backend consists of three servers all running Ub=
untu
>>> 22.04 and glusterfs version 10.1. There are no errors in the logs on t=
he
>>> glusterfs hosts when a VM crashes and because some times also icinga p=
lugins
>>> get corupted I do get a very exact time range to search in the logs fo=
r
>>> errors and warnings.
>>>
>>> However, I think it has something to do with our glusterfs setup. If I=
 clone
>>> a VM from a template I get the following:
>>>
>>> root@proxmox1:~# qm clone 9000 200 --full --name testvm --description
>>> "testvm" --storage gfs_vms                                            =
                                                                          =
                   [62/62]
>>> create full clone of drive ide2 (gfs_vms:9000/vm-9000-cloudinit.qcow2)
>>> Formatting
>>> 'gluster://gluster1.linova.de/gfs_vms/images/200/vm-200-cloudinit.qcow=
2', fmt=3Dqcow2 cluster_size=3D65536 extended_l2=3Doff preallocation=3Dmet=
adata compression_type=3Dzlib size=3D4194304 lazy_refcounts=3Doff refcount=
_bits=3D16
>>> [2023-05-30 16:18:17.753152 +0000] I
>>> [io-stats.c:3706:ios_sample_buf_size_configure] 0-gfs_vms: Configure i=
os_sample_buf  size is 1024 because ios_sample_interval is 0
>>> [2023-05-30 16:18:17.876879 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-=
0: All subvolumes are down. Going offline until at least one of them comes=
 back up.
>>> [2023-05-30 16:18:17.877606 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-=
1: All subvolumes are down. Going offline until at least one of them comes=
 back up.
>>> [2023-05-30 16:18:17.878275 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-=
2: All subvolumes are down. Going offline until at least one of them comes=
 back up.
>>> [2023-05-30 16:18:27.761247 +0000] I [io-stats.c:4038:fini] 0-gfs_vms:
>>> io-stats translator unloaded
>>> [2023-05-30 16:18:28.766999 +0000] I
>>> [io-stats.c:3706:ios_sample_buf_size_configure] 0-gfs_vms: Configure i=
os_sample_buf  size is 1024 because ios_sample_interval is 0
>>> [2023-05-30 16:18:28.936449 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-=
0:
>>> All subvolumes are down. Going offline until at least one of them come=
s back up.
>>> [2023-05-30 16:18:28.937547 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-=
1: All subvolumes are down. Going offline until at least one of them comes=
 back up.
>>> [2023-05-30 16:18:28.938115 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-=
2: All subvolumes are down. Going offline until at least one of them comes=
 back up.
>>> [2023-05-30 16:18:38.774387 +0000] I [io-stats.c:4038:fini] 0-gfs_vms:
>>> io-stats translator unloaded
>>> create full clone of drive scsi0 (gfs_vms:9000/base-9000-disk-0.qcow2)
>>> Formatting
>>> 'gluster://gluster1.linova.de/gfs_vms/images/200/vm-200-disk-0.qcow2',=
 fmt=3Dqcow2 cluster_size=3D65536 extended_l2=3Doff preallocation=3Dmetada=
ta compression_type=3Dzlib size=3D10951327744 lazy_refcounts=3Doff refcoun=
t_bits=3D16
>>> [2023-05-30 16:18:39.962238 +0000] I
>>> [io-stats.c:3706:ios_sample_buf_size_configure] 0-gfs_vms: Configure i=
os_sample_buf  size is 1024 because ios_sample_interval is 0
>>> [2023-05-30 16:18:40.084300 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-=
0: All subvolumes are down. Going offline until at least one of them comes=
 back up.
>>> [2023-05-30 16:18:40.084996 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-=
1: All subvolumes are down. Going offline until at least one of them comes=
 back up.
>>> [2023-05-30 16:18:40.085505 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-=
2: All subvolumes are down. Going offline until at least one of them comes=
 back up.
>>> [2023-05-30 16:18:49.970199 +0000] I [io-stats.c:4038:fini] 0-gfs_vms:
>>> io-stats translator unloaded
>>> [2023-05-30 16:18:50.975729 +0000] I
>>> [io-stats.c:3706:ios_sample_buf_size_configure] 0-gfs_vms: Configure i=
os_sample_buf  size is 1024 because ios_sample_interval is 0
>>> [2023-05-30 16:18:51.768619 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-=
0: All subvolumes are down. Going offline until at least one of them comes=
 back up.
>>> [2023-05-30 16:18:51.769330 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-=
1: All subvolumes are down. Going offline until at least one of them comes=
 back up.
>>> [2023-05-30 16:18:51.769822 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-=
2: All subvolumes are down. Going offline until at least one of them comes=
 back up.
>>> [2023-05-30 16:19:00.984578 +0000] I [io-stats.c:4038:fini] 0-gfs_vms:
>>> io-stats translator unloaded
>>> transferred 0.0 B of 10.2 GiB (0.00%)
>>> [2023-05-30 16:19:02.030902 +0000] I
>>> [io-stats.c:3706:ios_sample_buf_size_configure] 0-gfs_vms: Configure i=
os_sample_buf  size is 1024 because ios_sample_interval is 0
>>> transferred 112.8 MiB of 10.2 GiB (1.08%)
>>> transferred 230.8 MiB of 10.2 GiB (2.21%)
>>> transferred 340.5 MiB of 10.2 GiB (3.26%)
>>> ...
>>> transferred 10.1 GiB of 10.2 GiB (99.15%)
>>> transferred 10.2 GiB of 10.2 GiB (100.00%)
>>> transferred 10.2 GiB of 10.2 GiB (100.00%)
>>> [2023-05-30 16:19:29.804006 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-=
0: All subvolumes are down. Going offline until at least one of them comes=
 back up.
>>> [2023-05-30 16:19:29.804807 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-=
1: All subvolumes are down. Going offline until at least one of them comes=
 back up.
>>> [2023-05-30 16:19:29.805486 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-=
2: All subvolumes are down. Going offline until at least one of them comes=
 back up.
>>> [2023-05-30 16:19:32.044693 +0000] I [io-stats.c:4038:fini] 0-gfs_vms:
>>> io-stats translator unloaded
>>> root@proxmox1:~#
>>>
>>> Is this message about the subvolumes which are down normal or might th=
is be
>>> the reason for our strange problems?
>>>
>>> I have no idea how to further debug the problem so any helping idea or=
 hint
>>> would be great. Pleae let me also know if I can provide more infos reg=
arding
>>> our setup.
>>>
>>> Ciao and thanks a lot,
>>>
>>>     Schoepp
>>>