Principles for building a basic Disaster Recovery zone

I recently was tasked to build a 'disaster recovery zone' for a customer, which was to serve two purposes:

  • Keep a synced offsite copy of Duplicity backups (which are made to an interim 'backup' server with a big disk, in the primary DC)
  • Run a 'hot spare' instance of the customer's myriad of complicated and interconnected applications, but on a reduced-capacity environment (e.g not an identical replica of the primary DC which has a multitude of loadbalancers, replication, filer and db clusters etc, along with various operational and staging infrastructure). Reduced-capacity due to the incredible costs involved in running a H/A environment - customer does not want to pay for two.

The experience was interesting not in terms of the very technical specifics (e.g what needs to be synced, how often, what credentials/connectivity details for intercommunicating apps need to change due to the Secondary DC not identically resembling the primary etc), but from a process point of view. I settled on a couple of primary objectives:

Keep it simple

During a disaster, complexity is not the friend of a stressed mind.

In my case this meant all my sync logic for various things (assets, databases, credential modifications etc) were handled within a single fabfile within a Puppet module. The fabfile had a different task (function) for each 'sync' type, all of which could be run together or individually. The individual separation meant easy ability to schedule different syncs at different frequencies, while keeping them all in the one script gives the engineer one 'go to' source for understanding the DR process

Get your Primary DC house in order first

Your life gets a whole lot easier if your Primary DC uses config management such as Puppet. It's hard to forget any components needed to run your stack if your D/R zone's node .pp file is an amalgamation of your production nodes, with memory tuning etc tweaked appropriately. You can then focus on the trickier task of determining what to sync, how often, and from where.

Decide on the sync direction

Being a one-way sync I had the option of 'pushing' syncs from the Primary to the Secondary. I opted for a 'pull' method for several reasons:

a) The 'pull' means the Secondary DC does the heavy lifting, syncing from various remote locations. A 'push' method would mean I'd have to push from those various locations back to the secondary DC, which would mean having to run various sync scripts all over the place at the Primary DC. In addition, those various servers in the primary DC might need firewall rules and network routes to be able to reach the secondary DR machine via a VPN. Certainly do-able - but not simple.

b) Pulling at one location means I can 'fence off' the primary DC easily. I don't want the primary DC pushing data once it believes a network or similar is restored - especially from multiple locations, as that could be hard to guarantee fence-off (unless I blocked higher up the stack e.g at the edge gateway outbound, I suppose)

Ensure you can 'fence' off the Primary DC easily and quickly

My biggest fear with a one-way sync was that the Primary DC would recover and immediately a sync job would pull in changes that overwrite the data on the secondary (which has now become the primary after a failover event). I modified my firewall to have 'FENCE' logic that, when set to true, did the following:

a) firewall off traffic to/from the Primary DC's IP ranges (both external and internal, since there's a VPN)
b) Stop the VPN
c) Disable the sync crons
d) Stop the puppet agent (so it can't undo changes above)

Don't let your D/R zone depend on your Primary zone

Don't leave anything on the D/R machine depending on the Primary DC (except for the syncing). For example, initially I had the D/R machine a Puppet client of the Puppetmaster in the primary DC. Then I realised that in a failure scenario, that would prevent us pushing any further config changes to the D/R machine while the Primary DC is down. Instead, I put a Puppetmaster on the D/R zone that Jenkins deploys the same Puppet modules/manifests to. This way I could even add other D/R machines to the Secondary DC and use this main machine as the Puppetmaster for all.

Test!

Fudge your /etc/hosts file, do a sync, then FENCE. Make sure the expected functionality of your services on the Disaster Zone are working as expected. If any functionality is missing from start to finish of the expected user experience, you have left something out - even if you think it's not important.

Document everything

...including the steps you did to do the test. This eventually forms a nice Disaster Recovery Procedure for the business, because the steps you did to test are effectively your failover steps. Documenting the sync frequency etc also assists other stakeholders in understanding the process (especially other engineers). Early on it also aids in others identifying what you have forgotten!





After I stood this all up, I thought about how I could abstract all this to build some sort of Disaster Recovery 'product'. But I soon realised that every business is different. For this reason, every business's disaster recovery mechanics (especially, or at least, at a technical level) are always going to be different. What needs to be synced, and from where, and how often does the data change? What's involved in fencing the primary DC? There's no one size fits all for this, and many of these are not technical issues but business decisions (particularly sync frequency).

However, the principles, as listed above, are pretty generic, and I think could be applied to most organisations thinking about how to build a cheap DIY Disaster Recovery zone that does the bare minimum to maintain business continuity.

One thing that still troubles me is the dependency on DNS to handle a 'real' failover. Yes, I could have loadbalancers to do instant cutover to the 'backup' VMs - but loadbalancers have to live somewhere too - usually in the Primary DC. What if they're down? I see no way around DNS being the 'final say' in failover, and what a horribly uncontrollable one it is, even with very short TTLs.

Do you have your own recommendations or lessons learnt? I'm convinced I've forgotten things, so am interested to hear what others do.

Tags: