From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 32C4092254 for ; Mon, 13 Mar 2023 22:24:16 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 1BCB618CC4 for ; Mon, 13 Mar 2023 22:24:16 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Mon, 13 Mar 2023 22:24:12 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 554D3457F9 for ; Mon, 13 Mar 2023 22:24:09 +0100 (CET) From: Stoiko Ivanov To: pmg-devel@lists.proxmox.com Date: Mon, 13 Mar 2023 22:23:48 +0100 Message-Id: <20230313212351.111977-6-s.ivanov@proxmox.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20230313212351.111977-1-s.ivanov@proxmox.com> References: <20230313212351.111977-1-s.ivanov@proxmox.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-SPAM-LEVEL: Spam detection results: 0 AWL 0.142 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: [pmg-devel] [PATCH pmg-api 5/7] config: add spam option for extract_text X-BeenThere: pmg-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox Mail Gateway development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Mar 2023 21:24:16 -0000 toggling the configuration options for the ExtractText SA plugin (see [0]). The config is copied from the module itself, the informational headers were not added, as I don't see too much gain, apart from verifying that the plugin is working. the external dependencies for the plugin to work are added as Recommends, as it is a possible config to not have them installed and simply disable the option [0] https://metacpan.org/pod/Mail::SpamAssassin::Plugin::ExtractText Signed-off-by: Stoiko Ivanov --- debian/control | 9 ++++++++- src/PMG/Config.pm | 6 ++++++ src/templates/v400.pre.in | 34 ++++++++++++++++++++++++++++++---- 3 files changed, 44 insertions(+), 5 deletions(-) diff --git a/debian/control b/debian/control index 93ad72c..d2ed7da 100644 --- a/debian/control +++ b/debian/control @@ -98,7 +98,14 @@ Depends: apt (>= 2~), ucf, ${misc:Depends}, ${perl:Depends}, -Recommends: ifupdown2, proxmox-offline-mirror-helper +Recommends: antiword, + docx2txt, + ifupdown2, + odt2txt, + poppler-utils, + proxmox-offline-mirror-helper, + tesseract-ocr, + unrtf Suggests: zfsutils-linux Description: Proxmox Mailgateway API Server Implementation This implements a REST API to configure Proxmox Mailgateway. diff --git a/src/PMG/Config.pm b/src/PMG/Config.pm index 5dcffb7..699a622 100755 --- a/src/PMG/Config.pm +++ b/src/PMG/Config.pm @@ -211,6 +211,11 @@ sub properties { minimum => 64, default => 256*1024, }, + extract_text => { + description => "Extract text from attachments (doc, pdf, rtf, images) and scan for spam.", + type => 'boolean', + default => 0, + }, }; } @@ -225,6 +230,7 @@ sub options { bounce_score => { optional => 1 }, rbl_checks => { optional => 1 }, maxspamsize => { optional => 1 }, + extract_text => { optional => 1 }, }; } diff --git a/src/templates/v400.pre.in b/src/templates/v400.pre.in index 052e73e..4d68d6c 100644 --- a/src/templates/v400.pre.in +++ b/src/templates/v400.pre.in @@ -16,11 +16,37 @@ # added to new files, named according to the release they're added in. ########################################################################### + +[% IF pmg.spam.extract_text %] # ExtractText - Extract text from documents or images for matching -# -# Requires manual configuration, see plugin documentation. -# -# loadplugin Mail::SpamAssassin::Plugin::ExtractText +# informational headers and hits not configured +loadplugin Mail::SpamAssassin::Plugin::ExtractText + +ifplugin Mail::SpamAssassin::Plugin::ExtractText + + extracttext_external pdftotext /usr/bin/pdftotext -nopgbrk -layout -enc UTF-8 {} - + extracttext_use pdftotext .pdf application/pdf + + # http://docx2txt.sourceforge.net + extracttext_external docx2txt /usr/bin/docx2txt {} - + extracttext_use docx2txt .docx application/docx + + extracttext_external antiword /usr/bin/antiword -t -w 0 -m UTF-8.txt {} + extracttext_use antiword .doc application/(?:vnd\.?)?ms-?word.* + + extracttext_external unrtf /usr/bin/unrtf --nopict {} + extracttext_use unrtf .doc .rtf application/rtf text/rtf + + extracttext_external odt2txt /usr/bin/odt2txt --encoding=UTF-8 {} + extracttext_use odt2txt .odt .ott application/.*?opendocument.*text + extracttext_use odt2txt .sdw .stw application/(?:x-)?soffice application/(?:x-)?starwriter + + extracttext_external tesseract {OMP_THREAD_LIMIT=1} /usr/bin/tesseract -c page_separator= {} - + extracttext_use tesseract .jpg .png .bmp .tif .tiff image/(?:jpeg|png|x-ms-bmp|tiff) + +endif + +[% END %] # DecodeShortUrl - Check for shortened URLs # -- 2.30.2