I’ll likely be taking my next Amtrak trip starting on Friday; but a few days ago, a server for Amtrak’s Positive Train Control system crashed and stayed down for at least three days requiring that, basically, all trains except on Amtrak’s own Northeast Corridor were cancelled. Scuttlebutt has it that the server is back up, and it seems that trains are departing from Chicago again.
But what really amazes me is that, apparently, there wasn’t a backup. The system I worked on before I retired was actually four complete systems, one for development, one for system integration testing, one for customer acceptance testing, and one for production. Each of those had three servers: one for the Web component, one for the database, and a third for running background tasks. Each of those servers had at least one backup. The DEV and SIT systems had one backup for each server; the CAT and PROD systems had four of each server constantly sharing data back and forth so that they were pretty much exact copies of each other. That’s the way you do it. It’s really old news and well understood.
Another possibility, which also wouldn’t surprise me, is that security was so lax that a hacker could have brought the whole thing down for a ransomware attack. If that’s what happened, they probably won’t admit it.
In any event, Amtrak could have done better by getting a server from the folks who set up my own little website. They’re much more professional, it would seem.
K says
Glad things seem to be up and running. And you’re right–redundancy is critical, as is security. It’s appalling how many financial managers are short-sighted on that. A decade ago I walked into a job where I was asked to code a solution. I was working in the dev server when something borked…and my boss came running in yelling that “everything” was down. But I was working in dev! Nope; dev and prod were the same server just *named* dev and prod with a weak partition between them. There was “no money” to do it right, you see.
That was just a stupid internal website. It’s terrifying how everything is run on a shoestring and vulnerable to natural and targeted failures.
billseymour says
Ouch! I’m glad I never worked at a place like that.
I never brought our whole DEV system down, but I have made some really stupid mistakes, let’s say, more often than never.
Allison says
This doesn’t make any sense.
PTC is a safety-critical system, and as such, would be run by the host railroads, not AMTRAK. All locomotives (and trains) have to be equipped with it for it to make any sense, which means not just AMTRAK trains, but pretty much anything that runs on the tracks.
The fact that the NE corridor (which has long had a predecessor to PTC) was working fine also suggests that it was something else.
My guess would be that it was some other system, perhaps one that keeps track of where AMTRAK trains are (and how late they are!) for internal or external reporting purposes. If something like that failed, the trains would run fine (I hope!) but anything that involved AMTRAK operations systems would suddenly not know where anything was. (They could ask the host railroads, I suppose.)
billseymour says
Allison, my understanding is that the server wasn’t doing the PTC itself; it had something to do with communicating PTC data between Amtrak and the host railroads. Could it have been sending GPS data to the railroads?
In any event, trains were cancelled, and Amtrak blamed it on a server going down. The word I got from an Amtrak-related e-mail list is that it had something to do with PTC.