When something goes wrong, the surrounding supporting infrastructure must suddenly accept a new load.
If that load-spike exceeds the supporting infrastructure’s capacity, then it fails, and you get a cascading failure. Extreme examples of cascading failures don’t stop failing until there is nothing left to fail. [nyt] writes about the blackout of 2003:
A surge of electricity to western New York and Canada touched off a series of power failures and enforced blackouts yesterday that left parts of at least eight states in the Northeast and the Midwest without electricity. The widespread failures provoked the evacuation of office buildings, stranded thousands of commuters and flooded some hospitals with patients suffering in the stifling heat.
I remember that day; my phone started ringing as soon as the power came back on; it was journalists asking “is this the Chinese?” What had actually happened was what is often described as a “software glitch” though I consider it more of a “design error.” The power grid shutdown was a result of systems behaving the way they were supposed to behave – the consequences, however, were unforeseen.
Wikipedia explains it better than I need to [wik] – the short form is that a high voltage line began to draw an unexpected load because it had drooped into wet branches. The unexpected load caused the local station to switch so that power was being drawn from the local grid instead of sent across the questionable line. The local grid systems’ software saw an unusual surge which triggered their load alarms and they re-routed power from nearby stations. Suddenly, the smart-grid software saw a large surge/unusual load and a wave of overload shutdowns swept through the whole system. My guess is that the initial “shut it down and draw current from elsewhere” smart-grid parameter was chosen to be within engineering tolerances (a factor of 2 or 3 overhead) and the “fix” was to tweak the system parameters to have them be less sensitive to a system-wide problem. It seemed to me that an attack from the Chinese or whoever would first bring the system down, then tear down its telemetry and command/control. That would be bad. Remember when Terry Childs locked the City of San Francisco out of its own fiber-WAN backbone? [wired] He changed the administrators’ passwords to random junk and then left. A serious attack would look more like the Shamoon attacks against Saudi Aramco: system BIOS wipes and hard drive calibration wipes – poof! all your computers are bricks. I suppose the most elegant cyberattack would be to trigger a cascading failure, but why bother? If your point is to demonstrate you control someone’s system, smoking wreckage is as eloquent a message as any other.
There are lots of places where cascading failures can occur; I used to spend a lot of subconscious time trying to identify them. Perhaps that’s why I am so worried about the command/control systems for nuclear weapons; there appear to be a lot of places where failures can escalate (basically, all the failure modes seem to boil down to ‘and then there’s a full exchange’)
Anyhow, I saw this and immediately recognized a cascading failure: [it gets interesting at 0:35]
You can clearly see what’s going on, here: each rack is capable of holding a tremendous amount of weight bearing directly down through its supports. As soon as even a fraction of that weight is in the form of a sideways push or a pull on the upper corner the whole situation unravels.
I have to wonder if perhaps the racks were not installed correctly. With steel truss systems like that, it can be a matter of one bolt being missing.