Mar 14 2005

Site wide Bayesian Filtering Spamassassin

Tag:markmaldony @ 15:28

1. Setting up Site-Wide Bayesian FilteringIn local.cf, tell SpamAssassin where to find the Bayesian database files:

bayes_path /etc/mail/spamassassin/bayes

This tells the system that the Bayesian filter database files will be /etc/mail/spamassassin/bayes_msgcount, _seen and _toks. Feel free to move it wherever you want.

Now start feeding the Bayesian filter spam and ham messages. Tell sa-learn to use /etc/mail/spamassassin as the configuration directory (i.e. where to find the bayes_msgcount, _seen and _toks files):

sa-learn –spam -C /etc/mail/spamassassin –showdots –dir /path/to/directory/full/of/spam/msgs
sa-learn –ham -C /etc/mail/spamassassin –showdots –dir /path/to/directory/full/of/ham/msgs

See SiteWideBayesFeedback for more tips on getting an entire site to feed back spam and ham messages into the Bayesian filter. Just use -C to make sure that the correct database files are used.

Also restart spamd if you’re running it already so that it will re-read local.cf and enable the Bayes filter:

ps axo %p%a | awk ‘/spamd/ { print $1 }’
spamd -x -q -d -L -u nobody

(your spamd options may be different than mine)

You may experience difficulties with permissions. Make sure you chmod your bayes files to readable/writable by your user group.

If you are running spamd in setuid mode (setuid’s to the user who ran spamc), you will probably need to set bayes_file_mode in local.cf. Otherwise, the bayes file permissions will default to 0700 when the first caller causes updates, and subsequent callers will lack the permissions to open these file.

In local.cf (your setttings may vary):

bayes_file_mode 0770

See Mail::SpamAssassin::Conf(3) for details.

1. How can my users feed back mail for the Bayesian learner?If you want to set up site-wide use of Bayesian classification, you should set up a way for your users to send in misclassified mail to be “learned” from.

If you create mailboxes for false positives and false negatives, you can then run a cron job intermittently to learn all the mails in that mailbox as spam (or non-spam). Details on having your users redirect from mail clients to these mailboxes without mangling the headers are at ResendingMailWithHeaders.

2. Using Procmail with learningFor one approach, see ProcmailToForwardMail.

3. Submitting multiple messages at onceFor those who want an “easier” way, and that also works with Outlook Express AND Outlook… (This also allows users to submit many mail pieces at once.) This maintains full headers and bodies. (Or as best as I can tell - someone tell me if I’m wrong!)

Create a *new* mail message in Outlook/Express. Resize the windows so that you can see both your new message as well as the main O/OE window. Select the messages you want to send as Spam or Ham (probably not both in the same message) and drag them “into” the new message. This will send all the messages as attachments to the main email.

As an admin, I like to review all these submissions to be sure they are really valid to submit to Bayes for training. I open the mail account with IMAP - usually using OE, and drag the appropriate attached messages to the IMAP folder(s) I want to feed to sa-learn. This allows you to review all messages before they’re learned, and gives the users a pretty easy way to submit FP’s & FN’s as well as any other submissions you need.

(I also setup two “drop boxes” for mail - say qqqspam and qqqnospam - as re-iterated below, make them difficult to “guess” as you don’t want spammers filling up your spam or ham drop.)

4. Using procmail to remove forwarding infoFor MUAs (Like Netscape/Mozilla) that do a good job with keeping original headers intact, (almost) all you need to do is forward the email to the feedback account and strip off the header added by the forward. See BayesFeedbackViaForwarding for details.

Pine bouncing, for instance, adds headers like:

ReSent-Date: Wed, 27 Oct 2004 08:57:14 -0500 (CDT)
ReSent-From: My Name
ReSent-To: myname+spam@company.com
ReSent-Subject: !!! How can you refuse such a sexy plus size single community!
ReSent-Message-ID:

which may be stripped with an appropriate .procmailrc / sed stanza: :0fw: splitmsg.lck
| sed -e ‘/^ReSent-/ d’

5. IMAP foldersAnother option, and one that’s easier for most users to use, is to set up two public IMAP folders on your IMAP server, one for MissedSpam, one for NotSpam.

Then ask your users to move messages that SpamAssassin misses into the MissedSpam folder, and move messages that SpamAssassin marked incorrectly as spam into the NotSpam folder.

You can then run sa-learn from a cron job over those folders to update the Bayesian databases.

Also see RemoteImapFolder

6. How to set up site wide aliases on postfix where ham and spam can be sent for learning with PostfixThe cookbook is available at http://jousset.org/pub/sa-postfix.en.html, it works fine for postfix. Note: don’t call your aliases spam and ham unless you want spammers to flood the ham box