May 18

Something went wrong on the 18th….

Margie was tracking down a lost transaction this morning when she noticed a hole in the logs for May 18, and asked me to verify. I looked, and agreed: There was a four hour stretch–basically, all afternoon–when the web app wasn’t working. Near as we can tell, the back end was accepting transaction information from the web, but wasn’t doing anything with it. While that’s technically harmless behavior (ignoring a couple problem cases just as the system fell apart), it must have been pretty annoying to our customers.

If you’re paying attention, I mentioned three problems in that paragraph. In order of increasing significance:

  1. We lost a transaction. I know enough about that to judge it “just a bug”–we’ll get it fixed. It’s a new bug, though.
  2. We had a non-functioning website for half a day. Steve’s decided that’s his problem. It’s probably related to the problem we’ve been chasing for the past month, but this is a new manifestation. We really need to get that component fixed.
  3. It took us three weeks to figure this out. We obviously need to implement a better monitoring system. Jamie has taken responsibility for figuring this one out. One thing I need is a log analysis tool which fits these logs. Working on it….

All in all, today was a really ugly day.