public inbox for pmg-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: Stoiko Ivanov <s.ivanov@proxmox.com>
To: pmg-devel@lists.proxmox.com
Subject: [pmg-devel] [PATCH pmg-api 5/7] config: add spam option for extract_text
Date: Mon, 13 Mar 2023 22:23:48 +0100	[thread overview]
Message-ID: <20230313212351.111977-6-s.ivanov@proxmox.com> (raw)
In-Reply-To: <20230313212351.111977-1-s.ivanov@proxmox.com>

toggling the configuration options for the ExtractText SA plugin (see
[0]).

The config is copied from the module itself, the informational headers
were not added, as I don't see too much gain, apart from verifying
that the plugin is working.

the external dependencies for the plugin to work are added as
Recommends, as it is a possible config to not have them installed and
simply disable the option

[0] https://metacpan.org/pod/Mail::SpamAssassin::Plugin::ExtractText
Signed-off-by: Stoiko Ivanov <s.ivanov@proxmox.com>
---
 debian/control            |  9 ++++++++-
 src/PMG/Config.pm         |  6 ++++++
 src/templates/v400.pre.in | 34 ++++++++++++++++++++++++++++++----
 3 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/debian/control b/debian/control
index 93ad72c..d2ed7da 100644
--- a/debian/control
+++ b/debian/control
@@ -98,7 +98,14 @@ Depends: apt (>= 2~),
          ucf,
          ${misc:Depends},
          ${perl:Depends},
-Recommends: ifupdown2, proxmox-offline-mirror-helper
+Recommends: antiword,
+            docx2txt,
+            ifupdown2,
+            odt2txt,
+            poppler-utils,
+            proxmox-offline-mirror-helper,
+            tesseract-ocr,
+            unrtf
 Suggests: zfsutils-linux
 Description: Proxmox Mailgateway API Server Implementation
  This implements a REST API to configure Proxmox Mailgateway.
diff --git a/src/PMG/Config.pm b/src/PMG/Config.pm
index 5dcffb7..699a622 100755
--- a/src/PMG/Config.pm
+++ b/src/PMG/Config.pm
@@ -211,6 +211,11 @@ sub properties {
 	    minimum => 64,
 	    default => 256*1024,
 	},
+	extract_text => {
+	    description => "Extract text from attachments (doc, pdf, rtf, images) and scan for spam.",
+	    type => 'boolean',
+	    default => 0,
+	},
     };
 }
 
@@ -225,6 +230,7 @@ sub options {
 	bounce_score => { optional => 1 },
 	rbl_checks => { optional => 1 },
 	maxspamsize => { optional => 1 },
+	extract_text => { optional => 1 },
     };
 }
 
diff --git a/src/templates/v400.pre.in b/src/templates/v400.pre.in
index 052e73e..4d68d6c 100644
--- a/src/templates/v400.pre.in
+++ b/src/templates/v400.pre.in
@@ -16,11 +16,37 @@
 # added to new files, named according to the release they're added in.
 ###########################################################################
 
+
+[% IF pmg.spam.extract_text %]
 # ExtractText - Extract text from documents or images for matching
-#
-# Requires manual configuration, see plugin documentation.
-#
-# loadplugin Mail::SpamAssassin::Plugin::ExtractText
+# informational headers and hits not configured
+loadplugin Mail::SpamAssassin::Plugin::ExtractText
+
+ifplugin Mail::SpamAssassin::Plugin::ExtractText
+
+  extracttext_external  pdftotext  /usr/bin/pdftotext -nopgbrk -layout -enc UTF-8 {} -
+  extracttext_use       pdftotext  .pdf application/pdf
+
+  # http://docx2txt.sourceforge.net
+  extracttext_external  docx2txt   /usr/bin/docx2txt {} -
+  extracttext_use       docx2txt   .docx application/docx
+
+  extracttext_external  antiword   /usr/bin/antiword -t -w 0 -m UTF-8.txt {}
+  extracttext_use       antiword   .doc application/(?:vnd\.?)?ms-?word.*
+
+  extracttext_external  unrtf      /usr/bin/unrtf --nopict {}
+  extracttext_use       unrtf      .doc .rtf application/rtf text/rtf
+
+  extracttext_external  odt2txt    /usr/bin/odt2txt --encoding=UTF-8 {}
+  extracttext_use       odt2txt    .odt .ott application/.*?opendocument.*text
+  extracttext_use       odt2txt    .sdw .stw application/(?:x-)?soffice application/(?:x-)?starwriter
+
+  extracttext_external  tesseract  {OMP_THREAD_LIMIT=1} /usr/bin/tesseract -c page_separator= {} -
+  extracttext_use       tesseract  .jpg .png .bmp .tif .tiff image/(?:jpeg|png|x-ms-bmp|tiff)
+
+endif
+
+[% END %]
 
 # DecodeShortUrl - Check for shortened URLs
 #
-- 
2.30.2





  parent reply	other threads:[~2023-03-13 21:24 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-13 21:23 [pmg-devel] [PATCH pmg-api 0/7] adapt to SpamAssassin 4.0.0 Stoiko Ivanov
2023-03-13 21:23 ` [pmg-devel] [PATCH pmg-api 1/7] ruledb: spam: adapt to spamassassin 4.0.0 Stoiko Ivanov
2023-03-13 21:23 ` [pmg-devel] [PATCH pmg-api 2/7] templates: sync spamassassin templates with 4.0.0 upstream Stoiko Ivanov
2023-03-13 21:23 ` [pmg-devel] [PATCH pmg-api 3/7] templates: add template for spamassassin's v342.pre Stoiko Ivanov
2023-03-13 21:23 ` [pmg-devel] [PATCH pmg-api 4/7] templates: add template for spamassassin's v400.pre Stoiko Ivanov
2023-03-13 21:23 ` Stoiko Ivanov [this message]
2023-03-13 21:23 ` [pmg-devel] [PATCH pmg-api 6/7] templates: enable DecodeShortUrls for SpamAssassin 4.0.0 Stoiko Ivanov
2023-03-13 21:23 ` [pmg-devel] [PATCH pmg-api 7/7] templates: enable DMARC plugin in v400.pre.in Stoiko Ivanov
2023-03-13 21:23 ` [pmg-devel] [PATCH pmg-gui 1/1] spamdetector: add extract_text option Stoiko Ivanov
2023-03-27 18:06   ` [pmg-devel] applied: " Thomas Lamprecht
2023-03-15 15:55 ` [pmg-devel] applied: [PATCH pmg-api 0/7] adapt to SpamAssassin 4.0.0 Thomas Lamprecht

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230313212351.111977-6-s.ivanov@proxmox.com \
    --to=s.ivanov@proxmox.com \
    --cc=pmg-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal