POPfile: sorting the mail

When Sobig’s author unleashed his spam (and bounced email) plague on us last August it became clear I needed to automate my mail sorting process; I was spending far too many hours writing rules.  After checking out the sites for a couple filtering products I’d heard of, I decided to see if POPfile met my needs.  I loaded it on my machine, spend a couple hours making setup decisions, and did the necessary configuration of both POPfile and Eudora.

An essential fact:  While POPfile usually functions as a spam filter, its design supports sophisticated sorting of email into a large number of categories.  I’m using it as a mail sorter; the spam filter is important, but the software’s smart about all of my mail, and in a real sense the spam folder’s just another target for the sorter.

Basic Information

I receive between 50 and 100 e-mails each day, and read about 60% of those (the unread ones are either duplicates or spam). I used to read about 85% of my mail; the change in percentage is largely because of the increasing spam load. (Eudora has a reporting function; these numbers have some relation to reality.) Perhaps 65% of the real mail has baseball content of some sort or other; the rest is on a wide range of topics.

These get sorted into a couple dozen categories; I tinker with these a bit, but they are essentially the same categories I used for sorting e-mail in 1995.  A large percentage of my mail originates from the Society for American Baseball Research list called SABR-L, which has its own folder; the remaining folders group mail in ways which largely reflect my mental prioritizations.  One folder, called “Lists,” is the target for mailing lists on miscellaneous topics.  I sometimes ignore SABR-L for months; I check my eBay mail daily.

After reading the POPfile documentation, I decided to see how well it sorted the total daily package.  I set up “buckets” to match the folders, replaced several hundred Eudora rules with twenty-five, and set about teaching POPfile how to sort things. This story begins on August 18.

Here’s my report….

First Thousand

Since you train POPfile by correcting its errors, the first few dozen messages are basically all errors and the first few hundred are unreliable.  I took an accounting after message 1,049, which arrived on September 30.

  • 104 (10.0%) were sent to the wrong bucket.
    • 90.0% were sent to the right bucket….
  • 207 (19.7%) were spam.
  • 25 (2.4%) were probably virus-laden.
  • 114 (10.9%) were bounced email.
  • PF had particular problems with the Auction bucket; it made 15 wrong guesses (11 false positives & 4 false negatives) in a category with only 11 total messages.
  • PF also had significant problems with the Vendor bucket, with eight sorting errors among only nine total messages.
  • The List category, which seems to me the most difficult to train, received 40 messages; PF generated 12 false positives and 4 false negatives.

Second Thousand

POPfile weathered its adolescence in the first half of October, and reached message 999 on October 18.

  • 41 (4.1%) were sent to the wrong bucket.
    • 95.9% were sent to the right bucket….
  • 249 (25.7%) were spam.
  • 5 (0.6%) were probably virus-laden.
  • 0 (0.0%) were bounced email.
  • PF stopped having problems with Auction; 30 messages, with no false positives and three false negatives.
  • PF’s Vendor bucket issues seemed to abate, with only five sorting errors among twenty-one total messages. Better, but still unacceptable.
  • The List category continued about as before: 55 messages, with 12 false positives and 2 false negatives.

Third Thousand

This test span ended November 4 at 1,008 messages.

  • 41 (4.1%) were sent to the wrong bucket.
    • 95.9% were sent to the right bucket….
  • 408 (40.7%) were spam. Wow!
  • 2 (0.2%) were probably virus-laden.
  • 2 (0.2%) were bounced email.
  • Auction was basically clean; 20 messages, with one false positive and one false negative.
  • PF’s Vendor bucket sort deteriorated, with thirteen sorting errors among twenty-nine total messages. Yucky.
  • The List category remains problematic: 51 messages, with 15 false positives and 3 false negatives. I suspect this will only improve if I split the category into logical sub-groups.

Since November 4

I’ve received 712 messages; 97.6% are being sorted correctly. Not bad, if you ask me. I’ll not give you a further breakdown ’til I reach 1,000.

POPfile’s principal author, John Graham-Cumming, announced a new version a couple weeks ago, which I’ve not yet installed. I’ll do that in a day or two.

This entry was posted in Semi-Geekery and tagged . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.