Checked my e-mail this morning, and found a note from Matt, our network administrator:
- We had a power outage last night; many servers had been out of service for an unspecified time. The outage included the “back end” for our web app (you may recall that it’s actually a web face on a client/server app; internal staff uses a graphic interface with pretty much the same functionality as the web interface).
- He believed most of the systems were functioning as of early morning.
I echoed the message to several folks–suggested that they verify things actually were running and that they keep their eyes open for problems. (This is SOP, of course.) Then I fired up Terminal Services, logged into our system’s app server, and checked Task Manager to see if anything looked odd. Nope. Checked the log: Went down at 8:08 pm; recovered at 4:32 am. Long outage; hadn’t expected that.
Another e-mail, to the same people, reporting the length of the outage. Then logged into the GUI, did a test transaction; logged into the web interface and did the same. First approximation: everything is running.
James (the section supervisor) called to report that the GUI was not running in the unit. Neither were other systems, including Exchange. Clearly a sub-net crash, and beyond my ability to fix, so I suggested that he call Jamie, who’s our IT contact.
Back to Terminal Services; checked every server this time, and checked more carefully. Again checked Task Manager; glanced at all the tabs this time. Launched an app on each server just to see what happened. Checked to see what services were running on one server; checked Component Services on another; checked the MQ channel on a third. No unexpected behaviors; things looked good. Switched to PC Anywhere for the FileNet server; then (finally) checked in on the Siebel server (it’s in another server room; not likely to be part of this problem.) (Thank goodness.)
Touched base with James. He’d contacted Jamie, who’d learned that the network folks were already chasing problems on peripheral systems, including two crashed sub-nets. OK; second phase shows problems, but someone’s working on those.
Meantime, the folks around me were running through the same routines for the servers they watch. Alice is responsible for three systems, but only one was affected. Christine’s main damage was the (same) lost sub-net, so today she got off easy. From outside our group, I had a brief contact with Lucy, who runs another FileNet system, and wanted my take on something; from Will, who was echoing me on his efforts to recover our COLD system (running OK in different server room, but on the lost sub-net); and from Tina, who was coordinating with the network people and calling whichever of us she supposed might need help. We’ve all been here before; we know the drill, have checklists to guide our actions, have routines to follow.
Finally checked in with Mel, my boss, and reviewed the situation. We headed for the production unit to calm the waters. With all connections broken, Caroline, who’s James’ line supervisor, had sent her staff to breakfast, which was a good move. My main object was to reassure everyone that someone was working on stuff. One of the network guys showed up about this time; we’d lost a router and he was waiting for parts to arrive. Looked like another hour. Yucky, but under control.
To the cafeteria: “Michele, I need breakfast.”
Back to my desk. Touched base with the IT web team, just so they’d get the word; then called our “back-end” vendor’s support guy Steve, for the same reason.
One last issue to pursue: Our monitors had failed. We should have already known there was an outage. The failure of the local monitor was no surprise; it wasn’t designed to survive a total outage. The remote monitor, though, was a bit perplexing: Turns out that Boulder knew they’d lost touch shortly after the power failed, but that the notification system had collapsed on our end. That’s now on my to-do list, of course.
Finally the unit got back online. There was system tuning to do. These things don’t run themselves. Thus my morning….
After a fashion, we were victims of the weather. A State Highway near our office was under water, and closed; the detour took drivers down farm lanes and across a signed-but-not-signalled railroad crossing. A train hit a truck at this crossing, which knocked down a power pole. Fortunately, the injuries are reported to have been minor. I’d taken the same route earlier yesterday, and had been concerned about the poor visibility at the crossing. Obviously the concern was legitimate.
After lunch, I left a voicemail for Matt commending a job well done under trying circumstances. His team doesn’t get thanked very often.