It was nearly two years ago, October 30, 2015 when the rains began. Heavy rains weren’t new to Central Texas, but the rain that fell at the Austin-Bergstrom International Airport (ABIA) that day were heavy by anyone’s standards. In a matter of an hour, nearly 6 inches of rain fell at the airport, closing the runway, but more dramatically, flooding air traffic control operations, causing an immediate shutdown of all flights. Other areas around Central Texas were to receive more than 15” during the weekend that followed.
When I heard the news while sitting in an airport in Washington D.C, I was perplexed. How could a rain event like this have such an impact on a critical piece of our infrastructure like airports? Aren’t the air traffic controllers up in that tower, 50 feet off the ground? Surely their systems were in a location that was prepared for an event like this. Well, unknown to me, not all Air Traffic Control operations take place in the top of the tower. In fact, most of the computer systems, and many of the workers, are at work in other buildings, leaving room at the top of the tower for those who are keeping ‘eyes on’ the planes themselves.
The question that always comes up in an incident like this is, “how could that have happened?”
The flooding was historic by all accounts, and the impact to flights in and out of ABIA was felt for weeks to come. I was stranded, and had to adjust my travel arrangements to arrive at another airport and drive the rest of the way back to Austin.
The question that always comes up in an incident like this is, “how could that have happened?”. There’s a sense that any businesses core technology infrastructure must be backed up, redundant and all someone needs to do is ‘flip a switch’ to enable another system.
The answer is, not true. In fact, systems that provide that kind of capability are not only difficult to maintain, but prohibitively expensive. And, it appears to be the case in Air Traffic operations as well.
Now, mind you, this isn’t an article bemoaning the lack of preparedness of the Air Traffic System. In fact, it’s quite the opposite. I was impressed at how quickly officials could bring in a mobile air traffic control unit and get operations back up and running.
What is important, however, is the importance of being ready for disasters, and what you as a CIO should be thinking about before the actual incident occurs. In the end, your responsibility isn’t just Disaster Recovery (DR), it’s the overall ability to keep the business running, and that encompasses Disaster Recover, as well as other critical activities.
DR is what most folks immediately think of when it comes to being prepared for some sort of disaster. Whether it’s a fire in the server room, to large scale power outages to significant hardware failures, Disaster Recovery is a key step in getting the business back up and operational quickly. This is the bread and butter activities that IT undertakes to get systems back up and online. While most folks just think backups, restores, hardware replacement, and remote location cutovers, it also encompasses documentation and workflow processes on how you will respond. Likely as a CIO, your team is already focused on the technical parts of the process of recovery, but, here are a few things to consider that go beyond the technology.
- Test your DR Process from end to end – Just running daily backups and looking at backup logs isn’t adequate. When was the last time you conducted a DR Test or simply a data restore process? What were the results? As the CIO, it’s critical that you ensure the team is conducting regular tests and validating your plans. It’s not a good feeling to find out a key backup process had been failing for months, or a process is out of date because a system is no longer in production.
- Check your documentation and make sure it’s relevant – Make sure you have adequate, and tested, documentation when you conduct a test. Don’t assume that the purpose of a test is to simply test the technology. You should ‘table top’ an annual exercise, and make sure that all the documentation available is up-to-date and valid. Technologies change, team members move to other roles, and documentation is usually one of the first casualties of change.
- Make sure your documentation is user-friendly – While step one is making sure the documentation is up-to-date, in many organizations this means going through reams of text heavy paperwork. The best DR documentation is comprised of visual workflow diagrams and charts. Remember, new employees come on board all the time, so you need something that a new person can sit down and follow, even if they haven’t read the full plan.
- Test your DR Plan in conjunction with a Business Continuity Plan (BCP) – Disaster Recovery is just one part of a larger Business Continuity Plan (BCP). It’s important to test the DR portion of that plan, but if it isn’t conducted in concert with the business, you can miss critical steps. Remember, as the CIO, you should be thinking not only about bringing systems back on line, but helping the business recover and get back in operation. Coordination with business units in their downtime procedures, fail over processes and other critical revenue generating activities are key.
- Don’t overlook Crisis Management – While you may think that Public Relations and Crisis Management aren’t your responsibility, you do play a big part in the process. In the case of the Austin airport, quickly getting information out to the public was important. There were customers depending on the airlines to get them where the needed to be. A planned, well executed Crisis Management strategy had to be in place to keep the situation under control.
In all the various roles and responsibilities, you fill as a CIO, don’t let Disaster Recovery, Business Continuity and Crisis Management take a back seat to the daily grind. It’s easy to neglect this, but hard to deal with when the time comes to put your plans into action.