all lists on lists.proxmox.com
 help / color / mirror / Atom feed
From: Stoiko Ivanov <s.ivanov@proxmox.com>
To: pmg-devel@lists.proxmox.com
Subject: [pmg-devel] [PATCH pmg-api 5/7] config: add spam option for extract_text
Date: Mon, 13 Mar 2023 22:23:48 +0100	[thread overview]
Message-ID: <20230313212351.111977-6-s.ivanov@proxmox.com> (raw)
In-Reply-To: <20230313212351.111977-1-s.ivanov@proxmox.com>

toggling the configuration options for the ExtractText SA plugin (see
[0]).

The config is copied from the module itself, the informational headers
were not added, as I don't see too much gain, apart from verifying
that the plugin is working.

the external dependencies for the plugin to work are added as
Recommends, as it is a possible config to not have them installed and
simply disable the option

[0] https://metacpan.org/pod/Mail::SpamAssassin::Plugin::ExtractText
Signed-off-by: Stoiko Ivanov <s.ivanov@proxmox.com>
---
 debian/control            |  9 ++++++++-
 src/PMG/Config.pm         |  6 ++++++
 src/templates/v400.pre.in | 34 ++++++++++++++++++++++++++++++----
 3 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/debian/control b/debian/control
index 93ad72c..d2ed7da 100644
--- a/debian/control
+++ b/debian/control
@@ -98,7 +98,14 @@ Depends: apt (>= 2~),
          ucf,
          ${misc:Depends},
          ${perl:Depends},
-Recommends: ifupdown2, proxmox-offline-mirror-helper
+Recommends: antiword,
+            docx2txt,
+            ifupdown2,
+            odt2txt,
+            poppler-utils,
+            proxmox-offline-mirror-helper,
+            tesseract-ocr,
+            unrtf
 Suggests: zfsutils-linux
 Description: Proxmox Mailgateway API Server Implementation
  This implements a REST API to configure Proxmox Mailgateway.
diff --git a/src/PMG/Config.pm b/src/PMG/Config.pm
index 5dcffb7..699a622 100755
--- a/src/PMG/Config.pm
+++ b/src/PMG/Config.pm
@@ -211,6 +211,11 @@ sub properties {
 	    minimum => 64,
 	    default => 256*1024,
 	},
+	extract_text => {
+	    description => "Extract text from attachments (doc, pdf, rtf, images) and scan for spam.",
+	    type => 'boolean',
+	    default => 0,
+	},
     };
 }
 
@@ -225,6 +230,7 @@ sub options {
 	bounce_score => { optional => 1 },
 	rbl_checks => { optional => 1 },
 	maxspamsize => { optional => 1 },
+	extract_text => { optional => 1 },
     };
 }
 
diff --git a/src/templates/v400.pre.in b/src/templates/v400.pre.in
index 052e73e..4d68d6c 100644
--- a/src/templates/v400.pre.in
+++ b/src/templates/v400.pre.in
@@ -16,11 +16,37 @@
 # added to new files, named according to the release they're added in.
 ###########################################################################
 
+
+[% IF pmg.spam.extract_text %]
 # ExtractText - Extract text from documents or images for matching
-#
-# Requires manual configuration, see plugin documentation.
-#
-# loadplugin Mail::SpamAssassin::Plugin::ExtractText
+# informational headers and hits not configured
+loadplugin Mail::SpamAssassin::Plugin::ExtractText
+
+ifplugin Mail::SpamAssassin::Plugin::ExtractText
+
+  extracttext_external  pdftotext  /usr/bin/pdftotext -nopgbrk -layout -enc UTF-8 {} -
+  extracttext_use       pdftotext  .pdf application/pdf
+
+  # http://docx2txt.sourceforge.net
+  extracttext_external  docx2txt   /usr/bin/docx2txt {} -
+  extracttext_use       docx2txt   .docx application/docx
+
+  extracttext_external  antiword   /usr/bin/antiword -t -w 0 -m UTF-8.txt {}
+  extracttext_use       antiword   .doc application/(?:vnd\.?)?ms-?word.*
+
+  extracttext_external  unrtf      /usr/bin/unrtf --nopict {}
+  extracttext_use       unrtf      .doc .rtf application/rtf text/rtf
+
+  extracttext_external  odt2txt    /usr/bin/odt2txt --encoding=UTF-8 {}
+  extracttext_use       odt2txt    .odt .ott application/.*?opendocument.*text
+  extracttext_use       odt2txt    .sdw .stw application/(?:x-)?soffice application/(?:x-)?starwriter
+
+  extracttext_external  tesseract  {OMP_THREAD_LIMIT=1} /usr/bin/tesseract -c page_separator= {} -
+  extracttext_use       tesseract  .jpg .png .bmp .tif .tiff image/(?:jpeg|png|x-ms-bmp|tiff)
+
+endif
+
+[% END %]
 
 # DecodeShortUrl - Check for shortened URLs
 #
-- 
2.30.2





  parent reply	other threads:[~2023-03-13 21:24 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-13 21:23 [pmg-devel] [PATCH pmg-api 0/7] adapt to SpamAssassin 4.0.0 Stoiko Ivanov
2023-03-13 21:23 ` [pmg-devel] [PATCH pmg-api 1/7] ruledb: spam: adapt to spamassassin 4.0.0 Stoiko Ivanov
2023-03-13 21:23 ` [pmg-devel] [PATCH pmg-api 2/7] templates: sync spamassassin templates with 4.0.0 upstream Stoiko Ivanov
2023-03-13 21:23 ` [pmg-devel] [PATCH pmg-api 3/7] templates: add template for spamassassin's v342.pre Stoiko Ivanov
2023-03-13 21:23 ` [pmg-devel] [PATCH pmg-api 4/7] templates: add template for spamassassin's v400.pre Stoiko Ivanov
2023-03-13 21:23 ` Stoiko Ivanov [this message]
2023-03-13 21:23 ` [pmg-devel] [PATCH pmg-api 6/7] templates: enable DecodeShortUrls for SpamAssassin 4.0.0 Stoiko Ivanov
2023-03-13 21:23 ` [pmg-devel] [PATCH pmg-api 7/7] templates: enable DMARC plugin in v400.pre.in Stoiko Ivanov
2023-03-13 21:23 ` [pmg-devel] [PATCH pmg-gui 1/1] spamdetector: add extract_text option Stoiko Ivanov
2023-03-27 18:06   ` [pmg-devel] applied: " Thomas Lamprecht
2023-03-15 15:55 ` [pmg-devel] applied: [PATCH pmg-api 0/7] adapt to SpamAssassin 4.0.0 Thomas Lamprecht

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230313212351.111977-6-s.ivanov@proxmox.com \
    --to=s.ivanov@proxmox.com \
    --cc=pmg-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal