From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 01A3E970BE for ; Fri, 27 Jan 2023 14:49:57 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id BDD4595E6 for ; Fri, 27 Jan 2023 14:49:26 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Fri, 27 Jan 2023 14:49:25 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id B445046810 for ; Fri, 27 Jan 2023 14:49:25 +0100 (CET) Message-ID: <5105f09e-2f15-02ee-dd41-a427a6262a91@proxmox.com> Date: Fri, 27 Jan 2023 14:49:20 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.6.0 Content-Language: en-US From: Fiona Ebner To: Proxmox VE development discussion X-SPAM-LEVEL: Spam detection results: 0 AWL 2.251 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment KAM_NUMSUBJECT 0.5 Subject ends in numbers excluding current years RCVD_IN_DNSWL_HI -5 Sender listed at https://www.dnswl.org/, high trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [monitor-sector-zero.pl, proxmox.com] Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.29 Subject: [pve-devel] Script for bug #2874 X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 27 Jan 2023 13:49:57 -0000 The attached script allows monitoring the first sector of the bootdisk for running VMs (all or a selection of IDs) for people affected by bug #2874 [0]. The hope is to pinpoint when the sector gets corrupted to be able to correlate the timing with operations that might cause it. The script also dumps the contents, because it might help to see how the sector gets corrupted. Note that the script needs to be executed on each node and that you can specify IDs for VMs not currently on that node, which is useful to catch migrating VMs (or don't specify any IDs to monitor all running VMs). The script parses the VM config to determine the boot disk, looks up the path and uses qemu-img dd and base64 to save the contents of the first 512 bytes in a non-binary format and will dump the contents whenever they change. Example invocations: # monitor all running VMs, check every 5 minutes perl monitor-sector-zero.pl --interval 300 # only monitor 166 and 167, check every minute, log to file perl monitor-sector-zero.pl 166 167 &> /path/to/file Feedback from users and other developers is highly appreciated! [0]: https://bugzilla.proxmox.com/show_bug.cgi?id=2874 >From f.ebner@proxmox.com Fri Jan 27 14:58:15 2023 Return-Path: X-Original-To: pve-devel@lists.proxmox.com Delivered-To: pve-devel@lists.proxmox.com Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id D1DBA970CE for ; Fri, 27 Jan 2023 14:58:15 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id ABF52970F for ; Fri, 27 Jan 2023 14:57:45 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Fri, 27 Jan 2023 14:57:44 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 20F4A4680C for ; Fri, 27 Jan 2023 14:57:44 +0100 (CET) Message-ID: Date: Fri, 27 Jan 2023 14:57:38 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.6.0 To: pve-devel@lists.proxmox.com References: <5105f09e-2f15-02ee-dd41-a427a6262a91@proxmox.com> Content-Language: en-US From: Fiona Ebner In-Reply-To: <5105f09e-2f15-02ee-dd41-a427a6262a91@proxmox.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-SPAM-LEVEL: Spam detection results: 0 AWL 2.824 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment KAM_NUMSUBJECT 0.5 Subject ends in numbers excluding current years NICE_REPLY_A -1.148 Looks like a legit reply (A) RCVD_IN_DNSWL_HI -5 Sender listed at https://www.dnswl.org/, high trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [proxmox.com, monitor-sector-zero.pl] Subject: Re: [pve-devel] Script for bug #2874 X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 27 Jan 2023 13:58:15 -0000 Am 27.01.23 um 14:49 schrieb Fiona Ebner: > The attached script allows monitoring the first sector of the bootdisk > for running VMs (all or a selection of IDs) for people affected by bug > #2874 [0]. The hope is to pinpoint when the sector gets corrupted to be > able to correlate the timing with operations that might cause it. The > script also dumps the contents, because it might help to see how the > sector gets corrupted. > > Note that the script needs to be executed on each node and that you can > specify IDs for VMs not currently on that node, which is useful to catch > migrating VMs (or don't specify any IDs to monitor all running VMs). > > The script parses the VM config to determine the boot disk, looks up the > path and uses qemu-img dd and base64 to save the contents of the first > 512 bytes in a non-binary format and will dump the contents whenever > they change. > > Example invocations: > # monitor all running VMs, check every 5 minutes > perl monitor-sector-zero.pl --interval 300 > # only monitor 166 and 167, check every minute, log to file > perl monitor-sector-zero.pl 166 167 &> /path/to/file > > Feedback from users and other developers is highly appreciated! > > [0]: https://bugzilla.proxmox.com/show_bug.cgi?id=2874 > _______________________________________________ > pve-devel mailing list > pve-devel@lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel > > Well, apparently the attachment got removed. So here it is: #!/bin/perl use strict; use warnings; use Getopt::Long qw(GetOptions); use POSIX qw(strftime); use PVE::Cluster; use PVE::QemuConfig; use PVE::QemuServer::Drive qw(drive_is_cdrom is_valid_drivename parse_drive); use PVE::QemuServer::Helpers; use PVE::Storage; # START OF HELPER FUNCTIONS sub pprint { my ($msg, $vmid, $volid) = @_; chomp($msg); my $time = strftime("%F %H:%M:%S", localtime); my $time_prefix = "$time - "; my $vmid_prefix = $vmid ? "$vmid - " : ''; my $volid_prefix = $volid ? "$volid - " : ''; print "$time_prefix$vmid_prefix$volid_prefix$msg\n"; } my $fixed_vmlist; sub get_vmids { return $fixed_vmlist if $fixed_vmlist; my $list = []; my $vmlist = PVE::Cluster::get_vmlist(); for my $vmid (keys $vmlist->{ids}->%*) { next if $vmlist->{ids}->{$vmid}->{type} ne 'qemu'; push $list->@*, $vmid; } return $list; } my $running = {}; sub update_running { my ($vmid) = @_; my $old_running = $running->{$vmid}; $running->{$vmid} = eval { PVE::QemuServer::Helpers::vm_running_locally($vmid); }; pprint("could not check if VM is running - $@", $vmid) if $@; pprint("stop monitoring - not running", $vmid) if !$running->{$vmid} && $old_running; pprint("start monitoring - now running", $vmid) if $running->{$vmid} && !$old_running; return $running->{$vmid}; } sub get_bootdisk_volid { my ($vmid) = @_; my $conf = PVE::QemuConfig->load_config($vmid); my $bootdisks = PVE::QemuServer::Drive::get_bootdisks($conf); for my $bootdisk ($bootdisks->@*) { next if !is_valid_drivename($bootdisk); next if !$conf->{$bootdisk}; my $drive = parse_drive($bootdisk, $conf->{$bootdisk}); next if !defined($drive); next if drive_is_cdrom($drive); my $volid = $drive->{file}; next if !$volid; return $volid; } die "no bootdisk found in config\n"; } my $errors = {}; sub should_skip { my ($vmid) = @_; return $errors->{$vmid} >= 3; } # END OF HELPER FUNCTIONS my $interval = 60; GetOptions('interval=i' => \$interval); if (scalar(@ARGV)) { $fixed_vmlist = [@ARGV]; pprint("monitoring VMs " . join(',', sort {$a <=> $b} $fixed_vmlist->@*)); } else { pprint("no list of VMIDs provided - monitoring all VMs"); } my $contents = {}; while (1) { PVE::Cluster::cfs_update(); my $storecfg = PVE::Storage::config(); my $vmids = get_vmids(); for my $vmid ($vmids->@*) { $errors->{$vmid} //= 0; next if should_skip($vmid); next if !update_running($vmid); eval { my $volid = get_bootdisk_volid($vmid); my $path = PVE::Storage::path($storecfg, $volid); my $cmd = [ ['qemu-img', 'dd', 'bs=512', 'count=1', "if=$path"], ['base64', '--wrap', '0'], ]; my $content; PVE::Tools::run_command($cmd, outfunc => sub { $content = shift }); die "no output\n" if !$content; if (!defined($contents->{$vmid})) { pprint("registered content for first sector", $vmid, $volid); print "$content\n"; $contents->{$vmid} //= $content; } if ($content ne $contents->{$vmid}) { pprint("detected changed content for first sector!", $vmid, $volid); print "$content\n"; $contents->{$vmid} = $content; } }; if (my $err = $@) { pprint("can't determine content for first sector - $err", $vmid); $errors->{$vmid}++; pprint("too many errors - skipping from now on", $vmid) if should_skip($vmid); } else { $errors->{$vmid} = 0; } } sleep $interval; }