News

Don't shoot the developer

You probably recall the big blackout that hit the northeastern United States last August 14th. This week the North American Electric Reliability Council released a report with the rather imposing title "August 14, 2003 Blackout: NERC Actions to Prevent and Mitigate the Impacts of Future Cascading Blackouts February 10, 2004". In the report, NERC provides a whole mess of things that should be done to make the electical system more reliable, and orders specific corrective actions to be taken by FirstEnergy, Midwest Independent System Operator, and PJM, the three entities who seem to be most responsible for the events that launched the blackout.

Based on this report, some of the press is pointing fingers at the folks who developed GE Energy's XA/21 System for monitoring and managing electrical power systems. Among the things that FirstEnergy is ordered to do is to replace the XA/21 system and "Until the current energy management system is replaced, FE shall incorporate all fixes for the GE XA21 system known to be necessary to assure reliable and stable operation of critical reliability functions, and particularly to correct the alarm processor failure that occurred on August 14, 2003."

At this point things get a bit fuzzy, because various final reports aren't available yet, but it appears that an unpatched bug in the XA/21 system in August led to FirstEnergy not receiving alarms that things were wrong in their system. Aha, think some people, it's that darned software breaking down again that caused us to have to cancel our plans on that hot August night.

Comforting though it is to blame the software, I think in this case it's unjustified, or at best unproven. There are several places we might look before throwing the entire blame on GE's shoulders. For starters, unpatched software is hardly the only thing that FirstEnergy got dinged for by NERC. They didn't train their operators properly, they didn't bother to notify other systems when things started to go south (a violation of existing NERC policies), and they didn't even bother to generate electrical power that was up to national standards. They didn't even bother to follow their own rules for how much power they should be generating. They also get criticized for an ineffective "vegetation management" program; apparently the proximate cause of the problems was not computer failures, but tree limbs boinking into lines carrying 345 kV. And these were "persistent problems" before August 14. The picture that emerges is one of a pretty slipshod utility where pretty much everything was deteriorating.

Second, if the XA/21 system is so bad, how come we're not seeing it fail all over the place? I couldn't find any sales figures on GE Energy's Web site (hardly surprising), but given that they've got a dozen training courses scheduled in Florida between now and June I assume they're selling a few copies. Clearly there must have been some special circumstance at FirstEnergy to cause it to fail the way that it did, and who knows yet whether it was a coding fault, setup error, operator error, or what.

Finally, it seems like a violation of just plain good sense to use the same monitoring software on both the main and backup monitoring system (if in fact that's what was done; it's a bit hard to tell from the information that was released so far). Maybe there aren't any other good pieces of software in the market, but surely different software on the backup system would have been a more robust way to set things up.

If there's anything clear about the August 14 blackout, it's that it was a complex system failure. Software, training, and even tree limbs all played their part. So why do people focus in on software as "the cause"? Perhaps it's just the natural reaction to too many crashes on their home PCs. Whatever the case, let's not jump to conclusions here, or try to simplify complex failures into a single line of code.

About the Author

Mike Gunderloy has been developing software for a quarter-century now, and writing about it for nearly as long. He walked away from a .NET development career in 2006 and has been a happy Rails user ever since. Mike blogs at A Fresh Cup.