From: Stoiko Ivanov <s.ivanov@proxmox.com>
To: pmg-devel@lists.proxmox.com
Subject: [pmg-devel] [PATCH pmg-api 5/7] config: add spam option for extract_text
Date: Mon, 13 Mar 2023 22:23:48 +0100 [thread overview]
Message-ID: <20230313212351.111977-6-s.ivanov@proxmox.com> (raw)
In-Reply-To: <20230313212351.111977-1-s.ivanov@proxmox.com>
toggling the configuration options for the ExtractText SA plugin (see
[0]).
The config is copied from the module itself, the informational headers
were not added, as I don't see too much gain, apart from verifying
that the plugin is working.
the external dependencies for the plugin to work are added as
Recommends, as it is a possible config to not have them installed and
simply disable the option
[0] https://metacpan.org/pod/Mail::SpamAssassin::Plugin::ExtractText
Signed-off-by: Stoiko Ivanov <s.ivanov@proxmox.com>
---
debian/control | 9 ++++++++-
src/PMG/Config.pm | 6 ++++++
src/templates/v400.pre.in | 34 ++++++++++++++++++++++++++++++----
3 files changed, 44 insertions(+), 5 deletions(-)
diff --git a/debian/control b/debian/control
index 93ad72c..d2ed7da 100644
--- a/debian/control
+++ b/debian/control
@@ -98,7 +98,14 @@ Depends: apt (>= 2~),
ucf,
${misc:Depends},
${perl:Depends},
-Recommends: ifupdown2, proxmox-offline-mirror-helper
+Recommends: antiword,
+ docx2txt,
+ ifupdown2,
+ odt2txt,
+ poppler-utils,
+ proxmox-offline-mirror-helper,
+ tesseract-ocr,
+ unrtf
Suggests: zfsutils-linux
Description: Proxmox Mailgateway API Server Implementation
This implements a REST API to configure Proxmox Mailgateway.
diff --git a/src/PMG/Config.pm b/src/PMG/Config.pm
index 5dcffb7..699a622 100755
--- a/src/PMG/Config.pm
+++ b/src/PMG/Config.pm
@@ -211,6 +211,11 @@ sub properties {
minimum => 64,
default => 256*1024,
},
+ extract_text => {
+ description => "Extract text from attachments (doc, pdf, rtf, images) and scan for spam.",
+ type => 'boolean',
+ default => 0,
+ },
};
}
@@ -225,6 +230,7 @@ sub options {
bounce_score => { optional => 1 },
rbl_checks => { optional => 1 },
maxspamsize => { optional => 1 },
+ extract_text => { optional => 1 },
};
}
diff --git a/src/templates/v400.pre.in b/src/templates/v400.pre.in
index 052e73e..4d68d6c 100644
--- a/src/templates/v400.pre.in
+++ b/src/templates/v400.pre.in
@@ -16,11 +16,37 @@
# added to new files, named according to the release they're added in.
###########################################################################
+
+[% IF pmg.spam.extract_text %]
# ExtractText - Extract text from documents or images for matching
-#
-# Requires manual configuration, see plugin documentation.
-#
-# loadplugin Mail::SpamAssassin::Plugin::ExtractText
+# informational headers and hits not configured
+loadplugin Mail::SpamAssassin::Plugin::ExtractText
+
+ifplugin Mail::SpamAssassin::Plugin::ExtractText
+
+ extracttext_external pdftotext /usr/bin/pdftotext -nopgbrk -layout -enc UTF-8 {} -
+ extracttext_use pdftotext .pdf application/pdf
+
+ # http://docx2txt.sourceforge.net
+ extracttext_external docx2txt /usr/bin/docx2txt {} -
+ extracttext_use docx2txt .docx application/docx
+
+ extracttext_external antiword /usr/bin/antiword -t -w 0 -m UTF-8.txt {}
+ extracttext_use antiword .doc application/(?:vnd\.?)?ms-?word.*
+
+ extracttext_external unrtf /usr/bin/unrtf --nopict {}
+ extracttext_use unrtf .doc .rtf application/rtf text/rtf
+
+ extracttext_external odt2txt /usr/bin/odt2txt --encoding=UTF-8 {}
+ extracttext_use odt2txt .odt .ott application/.*?opendocument.*text
+ extracttext_use odt2txt .sdw .stw application/(?:x-)?soffice application/(?:x-)?starwriter
+
+ extracttext_external tesseract {OMP_THREAD_LIMIT=1} /usr/bin/tesseract -c page_separator= {} -
+ extracttext_use tesseract .jpg .png .bmp .tif .tiff image/(?:jpeg|png|x-ms-bmp|tiff)
+
+endif
+
+[% END %]
# DecodeShortUrl - Check for shortened URLs
#
--
2.30.2
next prev parent reply other threads:[~2023-03-13 21:24 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-13 21:23 [pmg-devel] [PATCH pmg-api 0/7] adapt to SpamAssassin 4.0.0 Stoiko Ivanov
2023-03-13 21:23 ` [pmg-devel] [PATCH pmg-api 1/7] ruledb: spam: adapt to spamassassin 4.0.0 Stoiko Ivanov
2023-03-13 21:23 ` [pmg-devel] [PATCH pmg-api 2/7] templates: sync spamassassin templates with 4.0.0 upstream Stoiko Ivanov
2023-03-13 21:23 ` [pmg-devel] [PATCH pmg-api 3/7] templates: add template for spamassassin's v342.pre Stoiko Ivanov
2023-03-13 21:23 ` [pmg-devel] [PATCH pmg-api 4/7] templates: add template for spamassassin's v400.pre Stoiko Ivanov
2023-03-13 21:23 ` Stoiko Ivanov [this message]
2023-03-13 21:23 ` [pmg-devel] [PATCH pmg-api 6/7] templates: enable DecodeShortUrls for SpamAssassin 4.0.0 Stoiko Ivanov
2023-03-13 21:23 ` [pmg-devel] [PATCH pmg-api 7/7] templates: enable DMARC plugin in v400.pre.in Stoiko Ivanov
2023-03-13 21:23 ` [pmg-devel] [PATCH pmg-gui 1/1] spamdetector: add extract_text option Stoiko Ivanov
2023-03-27 18:06 ` [pmg-devel] applied: " Thomas Lamprecht
2023-03-15 15:55 ` [pmg-devel] applied: [PATCH pmg-api 0/7] adapt to SpamAssassin 4.0.0 Thomas Lamprecht
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230313212351.111977-6-s.ivanov@proxmox.com \
--to=s.ivanov@proxmox.com \
--cc=pmg-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox