If you work with AlwaysOn Availability Groups in a production environment long enough, you will inevitably encounter a situation that forces you to failover your AG. If you’ve just started working with AGs, that failover is likely followed by a developer or application administrator calling you in the middle of the night cheerfully telling you “something is not right with the database”.
The look on your face when somebody says your AlwaysOn AG is down.
I know this because it happened to me in my early days of supporting AlwaysOn AGs in production, and the reason was never because the database itself was not alive and available. Thankfully, while AlwaysOn is absolutely a marketing term, it’s also an accurate description of a properly configured environment. Fortunately, Microsoft provides us a lot of helpful information covering basic setup and configuration of an AlwaysOn AG.
What isn’t centrally provided and what would have prevented a few late night phone calls to my phone over the years was a list of tips to configure the other things an application needs from an AlwaysOn AG configuration to smoothly operate after a failover.
While this is more of a server administrator than DBA tip, since it’s likely the DBA’s phone that will ring if this is configured incorrectly, we need to take an interest in what’s happening across our operations center. Correctly establishing server quorum votes, in most environments, is the responsibility of a server administrator. Rest assured, though, when the on-call phone rings at 4 am you’re the one waking up if this is not configured correctly.
While a deep dive into this topic is a blog in and of itself, the short version of the tip is this: if your AG spans datacenters, make sure that each datacenter has an even number of quorum votes in the cluster and that the cluster has a file share witness configured at a third, separate site. Basically, a typical Windows cluster needs a majority of voting cluster nodes online to maintain normal cluster operations. If the votes are evenly spread between two datacenters and you have a witness at a third site breaking the tie, losing just one datacenter will not bring your cluster down. While you can force start a cluster after it loses quorum, that is going to create extra, manual work for you both you and your server administrators and that work will be done under the utmost pressure because your database will be inaccessible to the application via the AG listener even though the servers themselves may be healthy. Trust me when I say it’s better to learn this lesson reading a blog than being woken up in the middle of the night with somebody yelling “we’re down!”
An important, but often overlooked tip is to configure a read-only routing list for your AG listener. I typically do this as part of the initial build-out of the AG because, without this, you lose one of the coolest features of AlwaysOn availability groups: the ability to route read-only connections to replicas. That feature can free up the primary to provide much more stable performance to the application(s) it serves. It also greatly decreases the possibility that your phone rings with somebody complaining about read-only queries slowing down (or worse yet, blocking) read/write queries.
This tip continues our trip up the stack and (hopefully) solves a problem that is easy to solve but can be very difficult to figure out under pressure. Many of the applications I’ve worked on over the years had a variety of database logins (using both Windows authentication and SQL Server authentication). While this is not ideal, there are often technical and/or business reasons why there are multiple logins (using both types of authentication) that need access to the database. In an AlwaysOn AG, particularly one where you have added replicas after the initial establishment of the AG, the replica addition may not have included setting up all the logins the application needs. In that scenario the failover will appear to have worked fine and led to happy application admins and DBAs…until components of the application begin malfunctioning because they don’t have access to the replica (now primary) database. Trying to fix this under a lot of pressure may lead you on a wild goose chase or three (as the application errors or complaining users may not always point you in the right direction), but if you follow the steps found here, it will walk you through the process of synchronizing logins between two servers. This process should be one of the first things you do when setting up a new AG replica instance unless you really and truly enjoy off-hours phone calls about broken applications!
The last two tips are more focused on keeping your AG happy, healthy, and running smoothly which should help keep you, the DBA, happy and healthy as well. Tip #4 is a general SQL Server maintenance best practice but can take on a bit of added importance when supporting AlwaysOn availability groups. Make sure you have a transaction log backup strategy in place and being actively monitored on all databases, but especially the databases participating in the AG. While the discussion of whether or not you should perform these backups on your primary or replica(s) (AlwaysOn gives you the option to choose the location of these backups) is an in-depth topic best left for another blog entry, regular transaction log backups help make maintenance and recovery of an AG-participating database much easier. While shrinking a transaction log is rarely, if ever, recommended, it’s a much more complicated operation in an AlwaysOn AG. Beyond that, a transaction log whose size is out of control can make database recovery a much longer process simply because of the size of the files you are moving/restoring – which means you are spending many more hours with your phone ringing off the hook with unhappy users. That is something that we’d all like to avoid.
My final tip, #5, should keep your network administrators off your back and your AGs as synchronized as possible. It’s also a fairly simple thing to do. If you have a large database within your AG, make sure you stagger the index maintenance that is performed during your regular index maintenance. Whether you rely on a maintenance plan within SQL Server, homebuilt index maintenance scripts, or one of the many solid index maintenance solutions available within the SQL Server community, ensure that the transaction log traffic your index maintenance generates (which can be surprisingly large) is not causing your remote replicas to fall behind your primary database. If your index maintenance is causing those remote replicas to fall behind by seconds or, worse yet, minutes, simply stagger your index maintenance jobs throughout the off-hours or at least make sure the index maintenance on your large tables is spread throughout the time allotted to you for index maintenance. This means that not only are the users who are using the replicas for ad-hoc querying and/or reports likely receiving up-to-the-second data but, in the event of a failover, your exposure to data loss is as minimal as your network and server infrastructure will allow.
I hope these five tips do indeed help keep your AlwaysOn availability groups, application administrators, and users happy. That should mean some happier DBAs as well! I look forward to taking deeper dives into a couple of these topics in the not-too-distant future, but for now I look forward to your comments and questions!