Delta Air Lines has been in the news a lot in the last couple of weeks for all the wrong reasons. News reports have been full of pictures and videos of massive lines at airports, people sleeping at gates and sob stories from stranded travelers thanks to a massive computer outage affecting all of Delta’s systems. If you’re a regular Delta traveler it’s a nightmare, and if you’re in IT for Delta, these are probably some of the worst days of your career.
Though initial reports blamed the problems on a power outage by the utility company serving Delta’s main data center, as more details have come out we have learned that this massive outage began with an internal computer outage followed by the failure of a backup system to take over following the initial failure.
While embracing the schadenfreude of this situation is tempting, for data professionals responsible for high availability and disaster recovery this is exactly the scenario that keeps us up at night. One mistake in design or implementation of an HA or DR solution can inconvenience (or worse) the users of our systems and possibly be a resume-generating event for ourselves.
At Pragmatic Works, I am fortunate enough to work with a variety of HA and DR solutions implemented by a wide variety of customers. With this emphasis in my work, the Delta situation has given me a few thoughts about how to prevent customers and friends from encountering a similar situation. Here are three tips that I sincerely hope keep your organization from preventing Delta’s mistakes:
First, make sure your entire team understands both the design and the processes surrounding your environment. When something breaks in the middle of the night and you have implemented most traditional SQL Server HA/DR topologies the “red alert” call will involve both DBA and network operations personnel. If both groups understand the design of the components that do not belong to them the troubleshooting time will decrease significantly. That will directly lead to a shorter outage for your customers and more sleep for you!
Second, test every failure and failover process you have on a regular, frequent basis. If the team’s successful response to a failure is muscle memory by the time something actually breaks that will also lead to a significantly shorter incident, happier customers, and far less stress for your database and network operations personnel.
Third, and finally, please keep abreast of all the availability and disaster recovery your platforms provide. For SQL Server data professionals, it is important to keep in mind that even on-premise HA/DR options can include Azure options. From SQL Server 2014 on, Microsoft has made it fairly straightforward to add an Azure replica into your Always On Availability Group. Beyond Always On, using Azure VMs, it’s possible to add a third data center into any SQL Server HA/DR design and topology in case Mother Nature or other entities lay waste to the best laid plans for your data centers and their failover processes.
It makes sense to leverage Microsoft’s investments in its Azure data centers and get those machines to work for your business. As data professionals, none of us want to end up on the news because of a mistake in design or implementation. Evaluate all your options, implement a variety that makes sense for your business, and test your processes to ensure that even the most surprising event isn’t a surprise to your team.