'So, what is it you exactly do?' - Part six, high availability

In this segment of 'what do you do, sysadmin?' I'll cover the area of high availability, building infrastructure that can withstand failure, and preparing for worst case disasters.

My (client's) site: the most important thing ever

It seems High Availability, for me at least, is duking it out with Security as the 'hot area' of sysadmin in recent years. Every agency's customer seems convinced of the importance of their application and how vital it is to the rest of us. As a result, the slightest bit of downtime is viewed as abhorrent. This can be a vicious cycle of having supported an application to the point that it has been very stable, and the customer has got used to the fact that problems are rare. Or it's having been fed 'cloud' FUD from somewhere else.

Fact of the matter is, unless you are operating major payment systems (i.e you are a bank), or it's life-threatening, I don't honestly believe anyone needs high availability. Nonetheless, I seem to be regularly setting such environments up.

Tiers of tears

High Availability, as you probably know, from a sysadmin perspective is about breaking your infrastructure in to 'tiers' of service (HTTP layer, Database layer, Volatile File layer, Search layer etc), and then trying to ensure that each of those layers can withstand failure (keep running at some level so as to ensure your application continues to behave as expected). Understanding the layers or tiers of service is vital in terms of identifying how your application can scale (though scalability is a separate topic) and where the weak points are.

In understanding how the tiers are structured, it also requires an understanding of the application itself works. Drupal, for example, totally depends on its database layer. The codebase itself tends to be somewhat static (doesn't change except on deployments or upgrades). But it also has volatile storage area (the 'files' directory), where uploaded data goes. Awkwardly, that volatile data tends to live within the same file structure as the static codebase (most people with some level of sophisticated deployment, will ensure that the 'files' area lives outside of the codebase filesystem for this reason).

Wordpress and Magento are very similar with their 'uploads' and 'media' directories respectively (and Magento's 'var' directory also).

Single Points of Failure

Once you start to understand how your application is structured, you begin to recognise its 'weak' points. We call this the act of identifying single points of failure. A database is a single point of failure in Drupal, because it can't function without it. An inability to access or upload files won't bring the site down, but will probably make a right mess of its appearance, as well as the loss of expected functionality (uploading), so we may as well consider this a single point of failure too.

And of course, at a wider level, the server providing the HTTP service is also a single point of failure.

High availability is about mitigating Single Points of Failure, typically through load-balancing. Ironically, there are often many single points of failure :)

Databases

Aside from putting load balancers in front of the web apps to distribute the requests and dynamically pull web apps in and out of the cluster when they fail/recover, database replication is probably the next focus of peoples' attention for 'high availability'. Proper database H/A extends beyond creating read-only slaves (which for Drupal, which likes to write a lot of the time, is really only useful for scaling reads, not H/A), and ensuring that you can write to (at least) another master if your primary master goes down.

It is possible to set up multi-master MySQL replication natively - but you have to have a method with which to switch automatically to the other master if the first one fails. You can do this by assigning a 'floating' IP address to the active master, with old stable tools (which, on the other hand, aren't actively developed anymore) like MySQL-MMM. Or you can be a bit more 'modern' in your approach and look at solutions such as Percona XtraDB Cluster - though these have caveats you should be aware of, especially regarding InnoDB deadlocks.

File servers

The biggest issue I see with H/A solutions, is the file layer. It seems common to just throw the 'volatile files' on NFS which all web apps can mount with an NFS client. This does work, but it leaves the NFS as a single point of failure. If it goes down, so do those mount points, which go stale and drive up a race condition of locked up processes on the web servers (and increasing CPU load).

There are solutions for making your NFS highly available, such as using DRBD with Pacemaker and Corosync as a cluster which can tolerate one of the servers failing (even more if you use a distributed filesystem like OCFS). There's also Gluster, which I have always had inevitable problems with and so have avoided it as much as I can. Other solutions I have yet to look at include Ceph. If you have a lot of money, you can look at something like a NetApp cluster, which I have been 'on the other side of' (never have had direct access to one myself, just the NFS mount point) and it seemed quite stable.

For now, most of the Drupal sites I use have quite a 'dumb' files layer which isn't being abused all that much - just occasional uploads of single, reasonably sized files. For me, a DRBD+Pacemaker+Corosync+NFS solution, so long as I have quorum and a STONITH solution (more on that below), gives me the amount of control and stability I require.

Load balancers

The same goes for load balancers as it does for file servers: a single loadbalancer gives you H/A at the web app layer, but itself remains a single point of failure. It doesn't matter if your backend web servers are all healthy: they are unreachable if your loadbalancer goes down. So make sure you have H/A at your load balancer layer too. Services like Linode's NodeBalancer and Rackspace's Cloud Load Balancers have this built in.

If you are rolling your own, i.e with HAproxy, you can use the 'floating IP' approach again, with something like keepalived to ensure your 'passive' HAproxy LB can become the 'active' LB by claiming the floating IP (which presumably your site resolves to via DNS, or is behind a NAT that does). The beauty, at least in cases where your session management is elsewhere in the stack, is that you don't have to worry about state enforcement or integrity (i.e replication).

Majority rules: the simple power of quorum

'Back in the day' we used to roll out DRBD 2-node clusters with Heartbeat. Frequently we saw 'death matches' when connectivity was severed between the two nodes and they fought it out to claim the 'Primary' role. This is quite disastrous for data integrity and I won't go into detail here - google 'STONITH' and you'll see what I mean.

Essentially, STONITH, an acronym for 'Shoot The Other Node In The Head' was invented to help with this problem. A crude process, it was a pluggable way of tearing down your infrastructure in the event of such a disaster, based on the philosophy that it was better to lose high availability than risk integrity issues. And it's true.

STONITH is still vital in such setups, but a major mitigating factor of many of the scenarios that can require STONITH is quorum. In most clusters, the communication layer is regularly making decisions about the health of the environment and whether all parties agree on who is the 'primary', who is the 'secondary' and so on. Pacemaker does this, as does the Percona XtraDB Cluster. Quorum simply means deliberately 'imbalancing' this decision-making behaviour so that there is always a majority consensus.

For example, in the past, if both nodes in a two-node cluster thought the other node was down, it was impossible for the node to know if it was correct or not. With quorum, it should be possible to get consensus on who is right about which node is down.

In the case of Pacemaker - even where it is backing a DRBD two-node cluster - you can accomplish this by having a third node in perpetual 'standby' mode. It will still contribute to quorum, while you don't have to worry about it trying to claim the actual resources (e.g it doesn't have to run DRBD).

Nonetheless, even with Quorum, there can still be scenarios, or more specifically a series of events which can result in a STONITH like situation. So a STONITH device is still necessary and I would not deploy without one, regardless of how crude it is.

Plan for irony: high availability breeds failure

My biggest gripe with H/A, other than the fact that I don't believe most people really need it, is that as we have seen above, it is no quick fix. Proper H/A requires identifying all the weak single points of failure and mitigating it. This inevitably breeds complexity into the overall architecture. Just look again at the diagram at the start of the article. It very quickly becomes far from being a simple web app environment. The problem with this is that complexity, the amount of moving parts, increases the chance of some component failing.

Sure, this is not be a problem, you might say: we have accounted for single points of failure, therefore we can tolerate such failure. And this is partially true. But even small failure can (should) result in monitoring systems becoming aware, incident reports being generated, investigation and response. It should also be noted that in many cases, failure of a component can result in the loss of H/A, in that should the remaining node (say, the passive HAproxy loadbalancer which is now primary) fail as well, then you lose your H/A.

In other words, in common H/A environments you can only tolerate failure for a short period if you want to reclaim your redundancy. And due to the complexity, you should expect to see such failure occur more often than you might with a single server.

While you are in a mindset of availability and failure, there are, of course, other problematic areas to consider. Power failure at the data centre. Hypervisor compromise. Maybe the Autonomous System (AS) advertising the IP range your infrastructure is on, gets hijacked by rogue BGP and even though your stuff is healthy, you simply can't route to it. This opens up a whole other kettle of fish - that being Disaster Recovery, and having hot spare capacity ideally with a totally other hosting provider. I have written about some basic principles of disaster recovery design earlier.

Sometimes read-only slaves are OK

This article has focused on Drupal-style applications and what is demanded of H/A for it to work. And I dismissed read-only slaves at a database layer as part of this.

However, of course there are systems where this is enough. For example, if you depend on LDAP for your infrastructure, it is pretty trivial to set up LDAP slave replicators in read-only mode - it's largely 'set and forget'. This is enough for me because writes to LDAP (i.e changing of a password, or adding a new user) are less frequent, therefore we might be able to tolerate a period where the LDAP writer (master) is down, while still not losing the availability of the LDAP database in the meantime (thanks to the slaves).

In other words, you have to think about the nature of the application, how it is used, and what level of 'availability' loss you can tolerate.

Summing up: swallow your (client's) pride

When I see really basic websites that aren't really all that important except to the owner's flimsy, needy ego, suffer complexity and increase the likelihood of failure due to a demand for H/A, I get really annoyed. Often it means the narrow-minded project manager or pre-sales are fixated on dredging money out of the client no matter what (or are already frightened of a high-strung panicky client), and therefore don't challenge the client when they make unreasonable requests for H/A during the negotiation phase. It gets bolted into the contract and then it lands in the engineer's lap, just like all the other bad design decisions do for the frontend developer.

Don't misunderstand me: this is not a hatred of H/A - and some solutions really do justify it and should design for it - but a frustration with the typical problems associated with web development agencies: the divide between cash/commission-driven account managers/project managers/pre-sales and the poor lackies who have to actually do the engineering at the end of the day. We need a 'DevOps' for sales and tech teams, basically. CashOps? :)

I would encourage anyone with the ability to influence the decision making process with a client early on, to try and challenge the client to explain their perceived need for H/A. Describe to them how it can breed complexity into an otherwise simple application that can actually result in the problems they are trying to avoid.

Preach the beauty of simplicity and if you can make them see through their pride, they may just realise their site can tolerate the 20 minutes of outage at 3AM (which no-one was around to notice) before the on-call engineer wakes up and restarts Apache before going to back to bed.

You won't always win. But a good client will end up respecting you for challenging them in terms of 'the right thing to do' - even if they disagree.

Coming up...

The concluding piece: Part 7, communication: the most important part of my job.