From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id AB4E372F4C for ; Wed, 25 May 2022 10:11:18 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 94066289A2 for ; Wed, 25 May 2022 10:10:48 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 14F4928996 for ; Wed, 25 May 2022 10:10:48 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id C8AC642168 for ; Wed, 25 May 2022 10:10:47 +0200 (CEST) Message-ID: Date: Wed, 25 May 2022 10:10:43 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.8.0 Content-Language: en-US To: pve-devel@lists.proxmox.com References: <20220524113050.179182-1-f.ebner@proxmox.com> From: Fabian Ebner In-Reply-To: <20220524113050.179182-1-f.ebner@proxmox.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-SPAM-LEVEL: Spam detection results: 0 AWL 1.582 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment NICE_REPLY_A -3.328 Looks like a legit reply (A) POISEN_SPAM_PILL 0.1 Meta: its spam POISEN_SPAM_PILL_1 0.1 random spam to be learned in bayes POISEN_SPAM_PILL_3 0.1 random spam to be learned in bayes SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record T_SCC_BODY_TEXT_LINE -0.01 - Subject: Re: [pve-devel] [RFC/PATCH qemu] PVE-Backup: avoid segfault issues upon backup-cancel X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 May 2022 08:11:18 -0000 There might still be an edge case where completion and cancel race (I didn't run into this in practice yet, but at a first glance it seems possible): 1. job_exit -> job_completed -> job_finalize_single starts 2. pvebackup_co_complete_stream gets spawned in completion callback 3. job finalize_single finishes -> job's refcount hits zero -> job is freed 4. qmp_backup_cancel comes in and locks backup_state.backup_mutex before pvebackup_co_complete_stream can remove the job from the di_list 5. qmp_backup_cancel/job_cancel_bh will operate on the already freed memory It /would/ be fine if pvebackup_co_complete_stream is guaranteed to run/take the backup_mutex before qmp_backup_cancel. It *is* spawned earlier so maybe it is, but I haven't looked into ordering guarantees for coroutines yet and it does have another yield point when taking &backup_state.stat.lock, so I'm not so sure. Possible fix: ref jobs when adding them to di_list and unref them when removing them from di_list (instead of the proposed ref/unref used in this patch). Yet another issue (not directly related, but thematically): in create_backup_jobs_bh, in the error case, job_cancel_sync is called for each job, but since it's a transaction, the first call will cancel and free all jobs, also leading to segfaults in scenarios where creation of a non-first job fails. And even if the first one fails, the job_unref there is also wrong, since the job was freed during job_cancel_sync.