From: Stoiko Ivanov <s.ivanov@proxmox.com>
To: pmg-devel@lists.proxmox.com
Subject: [pmg-devel] [PATCH pmg-api 5/7] config: add spam option for extract_text
Date: Mon, 13 Mar 2023 22:23:48 +0100 [thread overview]
Message-ID: <20230313212351.111977-6-s.ivanov@proxmox.com> (raw)
In-Reply-To: <20230313212351.111977-1-s.ivanov@proxmox.com>
toggling the configuration options for the ExtractText SA plugin (see
[0]).
The config is copied from the module itself, the informational headers
were not added, as I don't see too much gain, apart from verifying
that the plugin is working.
the external dependencies for the plugin to work are added as
Recommends, as it is a possible config to not have them installed and
simply disable the option
[0] https://metacpan.org/pod/Mail::SpamAssassin::Plugin::ExtractText
Signed-off-by: Stoiko Ivanov <s.ivanov@proxmox.com>
---
debian/control | 9 ++++++++-
src/PMG/Config.pm | 6 ++++++
src/templates/v400.pre.in | 34 ++++++++++++++++++++++++++++++----
3 files changed, 44 insertions(+), 5 deletions(-)
diff --git a/debian/control b/debian/control
index 93ad72c..d2ed7da 100644
--- a/debian/control
+++ b/debian/control
@@ -98,7 +98,14 @@ Depends: apt (>= 2~),
ucf,
${misc:Depends},
${perl:Depends},
-Recommends: ifupdown2, proxmox-offline-mirror-helper
+Recommends: antiword,
+ docx2txt,
+ ifupdown2,
+ odt2txt,
+ poppler-utils,
+ proxmox-offline-mirror-helper,
+ tesseract-ocr,
+ unrtf
Suggests: zfsutils-linux
Description: Proxmox Mailgateway API Server Implementation
This implements a REST API to configure Proxmox Mailgateway.
diff --git a/src/PMG/Config.pm b/src/PMG/Config.pm
index 5dcffb7..699a622 100755
--- a/src/PMG/Config.pm
+++ b/src/PMG/Config.pm
@@ -211,6 +211,11 @@ sub properties {
minimum => 64,
default => 256*1024,
},
+ extract_text => {
+ description => "Extract text from attachments (doc, pdf, rtf, images) and scan for spam.",
+ type => 'boolean',
+ default => 0,
+ },
};
}
@@ -225,6 +230,7 @@ sub options {
bounce_score => { optional => 1 },
rbl_checks => { optional => 1 },
maxspamsize => { optional => 1 },
+ extract_text => { optional => 1 },
};
}
diff --git a/src/templates/v400.pre.in b/src/templates/v400.pre.in
index 052e73e..4d68d6c 100644
--- a/src/templates/v400.pre.in
+++ b/src/templates/v400.pre.in
@@ -16,11 +16,37 @@
# added to new files, named according to the release they're added in.
###########################################################################
+
+[% IF pmg.spam.extract_text %]
# ExtractText - Extract text from documents or images for matching
-#
-# Requires manual configuration, see plugin documentation.
-#
-# loadplugin Mail::SpamAssassin::Plugin::ExtractText
+# informational headers and hits not configured
+loadplugin Mail::SpamAssassin::Plugin::ExtractText
+
+ifplugin Mail::SpamAssassin::Plugin::ExtractText
+
+ extracttext_external pdftotext /usr/bin/pdftotext -nopgbrk -layout -enc UTF-8 {} -
+ extracttext_use pdftotext .pdf application/pdf
+
+ # http://docx2txt.sourceforge.net
+ extracttext_external docx2txt /usr/bin/docx2txt {} -
+ extracttext_use docx2txt .docx application/docx
+
+ extracttext_external antiword /usr/bin/antiword -t -w 0 -m UTF-8.txt {}
+ extracttext_use antiword .doc application/(?:vnd\.?)?ms-?word.*
+
+ extracttext_external unrtf /usr/bin/unrtf --nopict {}
+ extracttext_use unrtf .doc .rtf application/rtf text/rtf
+
+ extracttext_external odt2txt /usr/bin/odt2txt --encoding=UTF-8 {}
+ extracttext_use odt2txt .odt .ott application/.*?opendocument.*text
+ extracttext_use odt2txt .sdw .stw application/(?:x-)?soffice application/(?:x-)?starwriter
+
+ extracttext_external tesseract {OMP_THREAD_LIMIT=1} /usr/bin/tesseract -c page_separator= {} -
+ extracttext_use tesseract .jpg .png .bmp .tif .tiff image/(?:jpeg|png|x-ms-bmp|tiff)
+
+endif
+
+[% END %]
# DecodeShortUrl - Check for shortened URLs
#
--
2.30.2
next prev parent reply other threads:[~2023-03-13 21:24 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-13 21:23 [pmg-devel] [PATCH pmg-api 0/7] adapt to SpamAssassin 4.0.0 Stoiko Ivanov
2023-03-13 21:23 ` [pmg-devel] [PATCH pmg-api 1/7] ruledb: spam: adapt to spamassassin 4.0.0 Stoiko Ivanov
2023-03-13 21:23 ` [pmg-devel] [PATCH pmg-api 2/7] templates: sync spamassassin templates with 4.0.0 upstream Stoiko Ivanov
2023-03-13 21:23 ` [pmg-devel] [PATCH pmg-api 3/7] templates: add template for spamassassin's v342.pre Stoiko Ivanov
2023-03-13 21:23 ` [pmg-devel] [PATCH pmg-api 4/7] templates: add template for spamassassin's v400.pre Stoiko Ivanov
2023-03-13 21:23 ` Stoiko Ivanov [this message]
2023-03-13 21:23 ` [pmg-devel] [PATCH pmg-api 6/7] templates: enable DecodeShortUrls for SpamAssassin 4.0.0 Stoiko Ivanov
2023-03-13 21:23 ` [pmg-devel] [PATCH pmg-api 7/7] templates: enable DMARC plugin in v400.pre.in Stoiko Ivanov
2023-03-13 21:23 ` [pmg-devel] [PATCH pmg-gui 1/1] spamdetector: add extract_text option Stoiko Ivanov
2023-03-27 18:06 ` [pmg-devel] applied: " Thomas Lamprecht
2023-03-15 15:55 ` [pmg-devel] applied: [PATCH pmg-api 0/7] adapt to SpamAssassin 4.0.0 Thomas Lamprecht
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230313212351.111977-6-s.ivanov@proxmox.com \
--to=s.ivanov@proxmox.com \
--cc=pmg-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.