public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] Script for bug #2874
@ 2023-01-27 13:49 Fiona Ebner
  0 siblings, 0 replies; only message in thread
From: Fiona Ebner @ 2023-01-27 13:49 UTC (permalink / raw)
  To: Proxmox VE development discussion

The attached script allows monitoring the first sector of the bootdisk
for running VMs (all or a selection of IDs) for people affected by bug
#2874 [0]. The hope is to pinpoint when the sector gets corrupted to be
able to correlate the timing with operations that might cause it. The
script also dumps the contents, because it might help to see how the
sector gets corrupted.

Note that the script needs to be executed on each node and that you can
specify IDs for VMs not currently on that node, which is useful to catch
migrating VMs (or don't specify any IDs to monitor all running VMs).

The script parses the VM config to determine the boot disk, looks up the
path and uses qemu-img dd and base64 to save the contents of the first
512 bytes in a non-binary format and will dump the contents whenever
they change.

Example invocations:
# monitor all running VMs, check every 5 minutes
perl monitor-sector-zero.pl --interval 300
# only monitor 166 and 167, check every minute, log to file
perl monitor-sector-zero.pl 166 167 &> /path/to/file

Feedback from users and other developers is highly appreciated!

[0]: https://bugzilla.proxmox.com/show_bug.cgi?id=2874
From f.ebner@proxmox.com  Fri Jan 27 14:58:15 2023
Return-Path: <f.ebner@proxmox.com>
X-Original-To: pve-devel@lists.proxmox.com
Delivered-To: pve-devel@lists.proxmox.com
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id D1DBA970CE
 for <pve-devel@lists.proxmox.com>; Fri, 27 Jan 2023 14:58:15 +0100 (CET)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id ABF52970F
 for <pve-devel@lists.proxmox.com>; Fri, 27 Jan 2023 14:57:45 +0100 (CET)
Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com
 [94.136.29.106])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS
 for <pve-devel@lists.proxmox.com>; Fri, 27 Jan 2023 14:57:44 +0100 (CET)
Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1])
 by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 20F4A4680C
 for <pve-devel@lists.proxmox.com>; Fri, 27 Jan 2023 14:57:44 +0100 (CET)
Message-ID: <e6f641f6-14cc-29ee-fd59-a8dcb7209d38@proxmox.com>
Date: Fri, 27 Jan 2023 14:57:38 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.6.0
To: pve-devel@lists.proxmox.com
References: <5105f09e-2f15-02ee-dd41-a427a6262a91@proxmox.com>
Content-Language: en-US
From: Fiona Ebner <f.ebner@proxmox.com>
In-Reply-To: <5105f09e-2f15-02ee-dd41-a427a6262a91@proxmox.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-SPAM-LEVEL: Spam detection results:  0
 AWL 2.824 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 KAM_NUMSUBJECT 0.5 Subject ends in numbers excluding current years
 NICE_REPLY_A           -1.148 Looks like a legit reply (A)
 RCVD_IN_DNSWL_HI           -5 Sender listed at https://www.dnswl.org/,
 high trust
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
 URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See
 http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more
 information. [proxmox.com, monitor-sector-zero.pl]
Subject: Re: [pve-devel] Script for bug #2874
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Fri, 27 Jan 2023 13:58:15 -0000

Am 27.01.23 um 14:49 schrieb Fiona Ebner:
> The attached script allows monitoring the first sector of the bootdisk
> for running VMs (all or a selection of IDs) for people affected by bug
> #2874 [0]. The hope is to pinpoint when the sector gets corrupted to be
> able to correlate the timing with operations that might cause it. The
> script also dumps the contents, because it might help to see how the
> sector gets corrupted.
> 
> Note that the script needs to be executed on each node and that you can
> specify IDs for VMs not currently on that node, which is useful to catch
> migrating VMs (or don't specify any IDs to monitor all running VMs).
> 
> The script parses the VM config to determine the boot disk, looks up the
> path and uses qemu-img dd and base64 to save the contents of the first
> 512 bytes in a non-binary format and will dump the contents whenever
> they change.
> 
> Example invocations:
> # monitor all running VMs, check every 5 minutes
> perl monitor-sector-zero.pl --interval 300
> # only monitor 166 and 167, check every minute, log to file
> perl monitor-sector-zero.pl 166 167 &> /path/to/file
> 
> Feedback from users and other developers is highly appreciated!
> 
> [0]: https://bugzilla.proxmox.com/show_bug.cgi?id=2874
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
> 
> 

Well, apparently the attachment got removed. So here it is:

#!/bin/perl

use strict;
use warnings;

use Getopt::Long qw(GetOptions);
use POSIX qw(strftime);

use PVE::Cluster;
use PVE::QemuConfig;
use PVE::QemuServer::Drive qw(drive_is_cdrom is_valid_drivename parse_drive);
use PVE::QemuServer::Helpers;
use PVE::Storage;

# START OF HELPER FUNCTIONS

sub pprint {
     my ($msg, $vmid, $volid) = @_;

     chomp($msg);

     my $time = strftime("%F %H:%M:%S", localtime);
     my $time_prefix = "$time - ";

     my $vmid_prefix = $vmid ? "$vmid - " : '';
     my $volid_prefix = $volid ? "$volid - " : '';

     print "$time_prefix$vmid_prefix$volid_prefix$msg\n";
}

my $fixed_vmlist;
sub get_vmids {
     return $fixed_vmlist if $fixed_vmlist;

     my $list = [];
     my $vmlist = PVE::Cluster::get_vmlist();
     for my $vmid (keys $vmlist->{ids}->%*) {
	next if $vmlist->{ids}->{$vmid}->{type} ne 'qemu';
	push $list->@*, $vmid;
     }
     return $list;
}

my $running = {};
sub update_running {
     my ($vmid) = @_;

     my $old_running = $running->{$vmid};
     $running->{$vmid} = eval { PVE::QemuServer::Helpers::vm_running_locally($vmid); };
     pprint("could not check if VM is running - $@", $vmid) if $@;

     pprint("stop monitoring - not running", $vmid) if !$running->{$vmid} && $old_running;
     pprint("start monitoring - now running", $vmid) if $running->{$vmid} && !$old_running;

     return $running->{$vmid};
}

sub get_bootdisk_volid {
     my ($vmid) = @_;

     my $conf = PVE::QemuConfig->load_config($vmid);
     my $bootdisks = PVE::QemuServer::Drive::get_bootdisks($conf);
     for my $bootdisk ($bootdisks->@*) {
	next if !is_valid_drivename($bootdisk);
	next if !$conf->{$bootdisk};

	my $drive = parse_drive($bootdisk, $conf->{$bootdisk});
	next if !defined($drive);
	next if drive_is_cdrom($drive);

	my $volid = $drive->{file};
	next if !$volid;
	return $volid;
     }
     die "no bootdisk found in config\n";
}

my $errors = {};
sub should_skip {
     my ($vmid) = @_;

     return $errors->{$vmid} >= 3;
}

# END OF HELPER FUNCTIONS

my $interval = 60;
GetOptions('interval=i' => \$interval);

if (scalar(@ARGV)) {
     $fixed_vmlist = [@ARGV];
     pprint("monitoring VMs " . join(',', sort {$a <=> $b} $fixed_vmlist->@*));
} else {
     pprint("no list of VMIDs provided - monitoring all VMs");
}

my $contents = {};

while (1) {
     PVE::Cluster::cfs_update();

     my $storecfg = PVE::Storage::config();

     my $vmids = get_vmids();
     for my $vmid ($vmids->@*) {
	$errors->{$vmid} //= 0;
	next if should_skip($vmid);

	next if !update_running($vmid);

	eval {
	    my $volid = get_bootdisk_volid($vmid);
	    my $path = PVE::Storage::path($storecfg, $volid);

	    my $cmd = [
		['qemu-img', 'dd', 'bs=512', 'count=1', "if=$path"],
		['base64', '--wrap', '0'],
	    ];

	    my $content;
	    PVE::Tools::run_command($cmd, outfunc => sub { $content = shift });
	    die "no output\n" if !$content;

	    if (!defined($contents->{$vmid})) {
		pprint("registered content for first sector", $vmid, $volid);
		print "$content\n";
		$contents->{$vmid} //= $content;
	    }

	    if ($content ne $contents->{$vmid}) {
		pprint("detected changed content for first sector!", $vmid, $volid);
		print "$content\n";
		$contents->{$vmid} = $content;
	    }
	};
	if (my $err = $@) {
	    pprint("can't determine content for first sector - $err", $vmid);
	    $errors->{$vmid}++;
	    pprint("too many errors - skipping from now on", $vmid) if should_skip($vmid);
	} else {
	    $errors->{$vmid} = 0;
	}
     }

     sleep $interval;
}




^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2023-01-27 13:49 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-27 13:49 [pve-devel] Script for bug #2874 Fiona Ebner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal