PopFile occasional report

Readers will recall that I’ve been using John Graham-Cummings’ PopFile as a spam filter/mail sorter for over a year.  Time, methinks, for another update; I last mentioned the program in July.  I covered the important background information about a year ago, and shan’t repeat it today; you may also want to check the PopFile cross-references below to see what I’ve said before.

I’m still using version 0.20.1, which puts me a couple editions behind.  Since I’m satisfied with my version’s performance and don’t want to fight my way through the Mac upgrade process I’ll likely stay here for a while; John and his team will need to add something compelling for me to change.  (A proper Macintosh install would be helpful….) 

This version’s a little slow on my machine but not to the point it bothers me; your mileage may well vary.

On to specific results, again organized as I’ve done in past entries:

The test period ended November 11, 2004, at 7,952 messages.

  • 93 (1.2%) were sent to the wrong bucket.
    • (Therefore) 98.8% were sent to the right bucket.
    • This is my first report which didn’t include significant training, so 99% looks like the “norm” for my system.  One way to read this stat is that I decide to reclassify about one message per day.  Better than writing rules….
  • 3,219 (40.5%) were spam.  (This is a decrease from the previous 46.6%, which would seem to merit comment.  Not sure what that comment should be, though.)
  • There are areas where the app has, well, issues:
    • 431 messages were auction-related, with 10 false positives and 3 false negatives.  (As you might surmise, I’m again active on eBay.)  There’s enough noise in auction e-mail that some errors are inevitable.  PopFile is very good, though, at spotting eBay and PayPal phishing messages.
    • The sorter has significant problems getting my mailing lists right (407 messages/14 false +/8 false -), mostly because they cover a wide range of territory.
      • On the other hand, last time there were 48 false positives; it’s learning….
    • Vendor mail (118/8 false +/9 false -) is another bucket with some problems.  Again that’s likely because I catch a number of types of messages there.
    • I’ve pretty much abandoned the effort to get the Change Detection mail into the right boxes, and am effectively treating the whole set as one mailbox.  It’s more trouble than it’s worth, I’ve apparently decided; the app’s just refusing to notice how those emails differ.  Since these are mainly baseball-related sites, the issue’s not currently important.  Next spring I may try something.

On the whole, this is excellent performance, with some minor (and predictable) blind spots due to peculiarities that are as much mine as the program’s.  Except for the lack of a good loader for Apple systems, I can heartily recommend the program; the installation issues appear to be unique to the Mac platform, and shouldn’t trouble Windows or Linux users.  Prospective users shouldn’t expect perfection, and some effort is required to train PopFile about your mail system.  But it’s automatic, reliable, and quite impressive.

Living with POPFile

Time, I think, for a POPFile update.  It’s been a bit over three months, and over seven thousand messages, since I last discussed the program.  Quickly reviewed:  I started using the program in the wake of last August’s spam (virus) epidemic.  Right from the start I’ve used PF as a mail sorting program, not just a spam filter; basically, I replaced a few hundred rules with a couple dozen PF buckets.  POPFile’s very good, but not perfect, at the task; complications include categories which are quite similar, and categories which are catch-alls.  Creative spam and virus authors are likewise problematical.  Despite these confusions, I’m very satisfied–much more than I anticipated–with the program.  Now, if they’d just simplify the installation routine for Mac users.

Here’s a summary of the last three months usage, in the format I’ve used on prior reports:

The test period ended July 3, 2004, at 7,292 messages.

  • 168 (2.3 %) were sent to the wrong bucket.
    • (Therefore) 97.7 % were sent to the right bucket.
    • This percentage took a significant hit at the start of the baseball season, when a bunch of email sources came back to life.
  • 3,397 (46.6%) were spam.  (This is a significant increase, I’d say, from the previous 41.0%.)
    • A handful of these are from legitimate e-mail lists whose owners make it difficult to unsubscribe, but the impact is minimal.
  • Only 11 messages were auction-related; 3 of these were false negatives and 1 was a false positive.  I seem to have stopped hanging around eBay, at least for now.
  • The Vendor (100 messages/16 false +/11 false -) and Mailing List (402/48 f+/8 f-) categories, both of which are catch-alls, seem to show real improvement, though this is still a significant source of error.  The problem continues to be that “well-designed” spam looks superficially like these categories.
  • The problem I reported with e-mails from Change Detection still exists and remains annoying, but has improved; basically, PF sees several classes of messages as too similar to differentiate.  It’s pretty clear to me that the algorithm isn’t looking at the problem the way I think it should.

Every now and then a spammer finds a hole in this defense, but after a couple days PF has things sorted out again.  That’s how things should work.

For the record, I’m currently using POPFile version 0.20.1, which uses the BerkeleyDB for storage.  The developers moved to a SQL engine in March with version 0.21.0 (currently 0.21.1), but didn’t convince me a change was necessary; I’m unlikely to change until there’s a major upgrade.   Version 0.20 is slower than version 0.19 was, but not in ways which bother me.  Your mileage may vary, of course.

Thus my current report.  I remain very satisfied with the tool.

POPFile on PowerBook

You’ll perhaps recall that when I moved my e-mail to the PowerBook, I provisionally moved it into Apple’s mail.app.  That provisional decision has become permanent; for my purposes Mac Mail (with POPfile) is a fine application.

I resisted installing POPfile for several weeks–partly because I wanted to be more familiar with the Mac environment before installing something so far out of the ordinary, and partly because I wanted to give mail.app’s junk filter a test.  By January’s end, it was pretty clear that the Junk Mail filter doesn’t work as well as I’d like, and I missed POPfile’s more general mail sorting capabilities.  As I’ve mentioned before, I sort incoming mail into a couple dozen categories.  Teaching POPfile to recognize those categories lets me get by with two dozen rules, rather than a couple hundred.  Much better.

So I installed POPfile on the Mac on January 31.  While the process was harder than I’d have liked (I’d walk you through it, but John Graham-Cumming is aware of the problem and plans to simplify the Mac install), in the end I had a working installation.  Three thousand messages later, I’ve got POPfile trained again and continue to be delighted with the system.

I’ll again use the format I was using for last fall’s reports….

The test span ended March 25, 2004, at 3,028 messages.

  • 211 (7.0%) were sent to the wrong bucket.
    • (Therefore) 93.0 % were sent to the right bucket.
    • Keep in mind that this is a new install, and the first several hundred messages are sacrifices to training….
  • 1,241 (41.0%) were spam.  (Basically no change from November.)
    • I’ve dropped the virus & bounced categories; they’re now counted as spam….
  • 110 messages were auction-related; 9 of these were false negatives and 9 were false positives.  That’s about like before; this will improve with training.
  • The Vendor (90 messages/15 false +/21 false -) and Mailing List (251/45 f+/19 f-) categories, both of which are catch-alls, need serious training.  This reflects my earlier experience.  The problem in both cases is that “well-designed” spam looks superficially like these categories.
  • There’s a rather odd behavior which wasn’t a problem in the previous installation:  I use Change Detection to track a number of web pages which really ought to have RSS feeds.  For some reason, POPfile’s having difficulty telling notifications about Blogs (4/20 f+/2 f-) from notifications about General Baseball (77/22 f+/29 f-)–which suggests it’s more aware of the similarities (which are numerous) than the differences (which I consider really blatant).  The really odd thing is that I’ve got several other Change Detection categories, which it’s handling well.  We’ll have to see how this plays out over the next few weeks, when the baseball sites get really active.
    • (This could, of course, turn out to be operator error.  But I think I’m smarter than that.)

Thus my early report.  Things are about where they were at 3,000 messages last time Iinstalled PF, so I’m satisfied.  I’ll keep you informed.

A Dabbler’s Powerbook: finding my way

When I wasn’t shopping this weekend, I was trying to move past the “what a neat toy” phase with my new laptop. OS X is enough like XP to be familiar, and enough different to be both annoying and fascinating. That’s been covered elsewhere; I’ll likely leave it alone….

This is a powerful machine. For much of yesterday afternoon and evening I was:

  • making massive file transfers across my WiFi network.
  • copying and playing Christmas CDs, and
  • checking for useful advice, using Safari.

My desktop system, which isn’t a slouch, would have been (actually, was) stressed with all that activity. The Mac just puttered along. Pretty impressive.

Today’s efforts were largely devoted to moving my e-mail focus from Eudora on the PC to whatever I could get working on the laptop. That turned out to be Apple’s Mail, though I’m not committed to that decision. The transition was not a pretty effort; while it seems like this ought to be easy, it turns out to be exasperating. You can’t just tell the new program to import the old program’s files, so I tried a number of variations on “copy the files to the laptop and see if the client can import that.” Nothing worked well. Eudora’s instructions cover the basics but the results were pretty flakey. I’m also not convinced that Eudora’s Mac interface meets my needs, though I’ll likely give it another chance.

I can wholeheartedly endorse a format conversion tool called Emailchemy, by the way. A very fine piece of shareware.

Sometime soon I’ll need to solve PopFile on OS X. That promises to be interesting.

PopFile Revisited: another thousand messages received; a new version installed

Today we reached another thousand. After printing and resetting the report, I loaded POPfile’s new version. I’ll certainly keep you informed….

Continuing in the same format I used in my earlier note about PF:

Fourth Thousand

This test span ended November 21 at 1,000 messages.

  • 26 (2.6%) were sent to the wrong bucket.
    • 97.4% were sent to the right bucket….
  • 415 (41.5%) were spam. Again: Wow!
  • 4 (0.4%) were probably virus-laden.
  • 4 (0.4%) were bounced email.
  • Auction seems to be fully solved; 39 messages, with one false positive and one false negative.
  • The Vendor category may have finally improved: 14 messages; only three errors.
  • Lists looks better: Ten false positives and no false negatives associated with 51 messages.
  • A new category, created (with my new e-mail address) to service this weblog, had 8 errors–to go with seven messages. New categories are always problems….

All in all, that’s a rather impressive performance. The increasing spam count is also rather impressive; after all, I added this tool to my kit because the junk seemed to be getting out of hand.

The new POPfile version has a thoroughly-revamped back end, and some modifications to the code in the engine. We’ll see how it goes.

Jon Udell’s also talking about using Bayesian categorizers, at both a higher level of abstraction and greater detail. Worth a look.

POPfile: sorting the mail

When Sobig’s author unleashed his spam (and bounced email) plague on us last August it became clear I needed to automate my mail sorting process; I was spending far too many hours writing rules.  After checking out the sites for a couple filtering products I’d heard of, I decided to see if POPfile met my needs.  I loaded it on my machine, spend a couple hours making setup decisions, and did the necessary configuration of both POPfile and Eudora.

An essential fact:  While POPfile usually functions as a spam filter, its design supports sophisticated sorting of email into a large number of categories.  I’m using it as a mail sorter; the spam filter is important, but the software’s smart about all of my mail, and in a real sense the spam folder’s just another target for the sorter.

Basic Information

I receive between 50 and 100 e-mails each day, and read about 60% of those (the unread ones are either duplicates or spam). I used to read about 85% of my mail; the change in percentage is largely because of the increasing spam load. (Eudora has a reporting function; these numbers have some relation to reality.) Perhaps 65% of the real mail has baseball content of some sort or other; the rest is on a wide range of topics.

These get sorted into a couple dozen categories; I tinker with these a bit, but they are essentially the same categories I used for sorting e-mail in 1995.  A large percentage of my mail originates from the Society for American Baseball Research list called SABR-L, which has its own folder; the remaining folders group mail in ways which largely reflect my mental prioritizations.  One folder, called “Lists,” is the target for mailing lists on miscellaneous topics.  I sometimes ignore SABR-L for months; I check my eBay mail daily.

After reading the POPfile documentation, I decided to see how well it sorted the total daily package.  I set up “buckets” to match the folders, replaced several hundred Eudora rules with twenty-five, and set about teaching POPfile how to sort things. This story begins on August 18.

Here’s my report….

First Thousand

Since you train POPfile by correcting its errors, the first few dozen messages are basically all errors and the first few hundred are unreliable.  I took an accounting after message 1,049, which arrived on September 30.

  • 104 (10.0%) were sent to the wrong bucket.
    • 90.0% were sent to the right bucket….
  • 207 (19.7%) were spam.
  • 25 (2.4%) were probably virus-laden.
  • 114 (10.9%) were bounced email.
  • PF had particular problems with the Auction bucket; it made 15 wrong guesses (11 false positives & 4 false negatives) in a category with only 11 total messages.
  • PF also had significant problems with the Vendor bucket, with eight sorting errors among only nine total messages.
  • The List category, which seems to me the most difficult to train, received 40 messages; PF generated 12 false positives and 4 false negatives.

Second Thousand

POPfile weathered its adolescence in the first half of October, and reached message 999 on October 18.

  • 41 (4.1%) were sent to the wrong bucket.
    • 95.9% were sent to the right bucket….
  • 249 (25.7%) were spam.
  • 5 (0.6%) were probably virus-laden.
  • 0 (0.0%) were bounced email.
  • PF stopped having problems with Auction; 30 messages, with no false positives and three false negatives.
  • PF’s Vendor bucket issues seemed to abate, with only five sorting errors among twenty-one total messages. Better, but still unacceptable.
  • The List category continued about as before: 55 messages, with 12 false positives and 2 false negatives.

Third Thousand

This test span ended November 4 at 1,008 messages.

  • 41 (4.1%) were sent to the wrong bucket.
    • 95.9% were sent to the right bucket….
  • 408 (40.7%) were spam. Wow!
  • 2 (0.2%) were probably virus-laden.
  • 2 (0.2%) were bounced email.
  • Auction was basically clean; 20 messages, with one false positive and one false negative.
  • PF’s Vendor bucket sort deteriorated, with thirteen sorting errors among twenty-nine total messages. Yucky.
  • The List category remains problematic: 51 messages, with 15 false positives and 3 false negatives. I suspect this will only improve if I split the category into logical sub-groups.

Since November 4

I’ve received 712 messages; 97.6% are being sorted correctly. Not bad, if you ask me. I’ll not give you a further breakdown ’til I reach 1,000.

POPfile’s principal author, John Graham-Cumming, announced a new version a couple weeks ago, which I’ve not yet installed. I’ll do that in a day or two.