Living with POPFile

Time, I think, for a POPFile update.  It’s been a bit over three months, and over seven thousand messages, since I last discussed the program.  Quickly reviewed:  I started using the program in the wake of last August’s spam (virus) epidemic.  Right from the start I’ve used PF as a mail sorting program, not just a spam filter; basically, I replaced a few hundred rules with a couple dozen PF buckets.  POPFile’s very good, but not perfect, at the task; complications include categories which are quite similar, and categories which are catch-alls.  Creative spam and virus authors are likewise problematical.  Despite these confusions, I’m very satisfied–much more than I anticipated–with the program.  Now, if they’d just simplify the installation routine for Mac users.

Here’s a summary of the last three months usage, in the format I’ve used on prior reports:


The test period ended July 3, 2004, at 7,292 messages.

  • 168 (2.3 %) were sent to the wrong bucket.
    • (Therefore) 97.7 % were sent to the right bucket.
    • This percentage took a significant hit at the start of the baseball season, when a bunch of email sources came back to life.
  • 3,397 (46.6%) were spam.  (This is a significant increase, I’d say, from the previous 41.0%.)
    • A handful of these are from legitimate e-mail lists whose owners make it difficult to unsubscribe, but the impact is minimal.
       
  • Only 11 messages were auction-related; 3 of these were false negatives and 1 was a false positive.  I seem to have stopped hanging around eBay, at least for now.
     
  • The Vendor (100 messages/16 false +/11 false -) and Mailing List (402/48 f+/8 f-) categories, both of which are catch-alls, seem to show real improvement, though this is still a significant source of error.  The problem continues to be that “well-designed” spam looks superficially like these categories.
  • The problem I reported with e-mails from Change Detection still exists and remains annoying, but has improved; basically, PF sees several classes of messages as too similar to differentiate.  It’s pretty clear to me that the algorithm isn’t looking at the problem the way I think it should.

Every now and then a spammer finds a hole in this defense, but after a couple days PF has things sorted out again.  That’s how things should work.


For the record, I’m currently using POPFile version 0.20.1, which uses the BerkeleyDB for storage.  The developers moved to a SQL engine in March with version 0.21.0 (currently 0.21.1), but didn’t convince me a change was necessary; I’m unlikely to change until there’s a major upgrade.   Version 0.20 is slower than version 0.19 was, but not in ways which bother me.  Your mileage may vary, of course.


Thus my current report.  I remain very satisfied with the tool.