One-touch provisioning and auto-monitoring of new servers

I've recently been doing some very innovative work for the very clever gents at Code Enigma, where I've been working on some interesting projects:

1. an automated 'zero-touch' dev/stage/live deployment system for their enterprise Drupal applications (developers no longer need to ssh in to servers to do deployments)

2. automatic 'one-touch' provisioning and configuration of new hosting cloud services.

(More on the dev/stage/live zero-touch deployment soon :) )

This post is about how I went about creating a 'one touch' provisioning system that builds everything from the new server itself, to automatic monitoring of the machine once it's up.

Building new machines in the cloud with Madelon and Libcloud API

The easy part was out of the way pretty quick: we accomplished most of the automatic provisioning of new VPS at Linode by way of my opensource tool Madelon, which talks to LibCloud API to build new machines.

Once upon a time, Madelon was a fairly static bit of code that built the VPS and then immediately ran Puppet on them: similar to Vagrant but for public-facing systems. However, that was pretty much all it did.

Much of the work puppetising Code Enigma's stack directly resulted in the refactoring of Madelon to support 'hookable' submodules that perform all manner of various tasks once the machine is built - the most significant hook being, running the Puppet manifests against the server so that it works as we want it to, out of the box - but also potentially other hooks.

Building from a webform and Jenkins

This amount of automation was not enough, however. Someone had to still SSH in to the 'control' server and run Madelon by hand. So I built a Webform using Drupal that allows a staff member to submit a request for a new server and can choose what 'roles' the server will perform: for instance, Nginx server, MySQL server, Varnish server, and so on.

The results of the webform submission are emailed to a Jenkins server, where a script parses the output of the form and talks Jenkins API to create a new Jenkins job instructing our 'control' server running Madelon, to begin building the machine.

You might wonder why Jenkins is involved here: surely the webform could easily communicate directly with Madelon.

While this is true, I wanted the job to run through Jenkins so that the progress of the build was less opaque, and could be observed in realtime from within Jenkins. Additionally, I wanted to use Jenkins' built-in notification methods to alert us to failed builds, or notify an IRC channel on success.

Letting Puppet do the work

Jenkins communicates to Madelon the required 'roles' for the server, which in turn are read by the Puppet hook and those roles are applied by Puppet to the new 'node'.

So far so good, we had achieved a system of webform > new functioning server in what can take less than a 2 minutes!

But not enough!

Autodiscovery monitoring

We also run redundant Nagios and Munin monitoring nodes that alert me to any issues across the infrastructure from multiple vectors. Once this machine was built, I still had to modify the Nagios and Munin configurations to add this new host into the group to be monitored.

Once again, having a 'hook' system in Madelon helped me. It was relatively easy to write a hook that injected a new Munin host into the manifest, which would trigger Puppet into deploying the new host config snippet into /etc/munin/munin-conf.d/ on the (puppet-managed) monitoring servers. I use definitions for this in Puppet so it really is a three-liner in the manifest:

  munin::host { 'foobar.com':
    address => '12.34.56.78',
  }

which invokes this definition host.pp:

define munin::host (
  $address
) {
 
  include munin
 
  file { "/etc/munin/munin-conf.d/${name}.conf":
    ensure     => present,
    require    => File["/etc/munin/munin-conf.d"],
    content    => template("munin/munin-host.conf.erb");
  }
}

Munin is useful here because it really just required a small, isolated server-specific config file and will automatically start monitoring it on the next cron run.

Nagios, however, was looking much more difficult to automate.

I was googling around and many people were developing very complicated manifests for achieving this aim of handling automatic monitoring in Nagios.

What bothered me most of all about these published attempts, was that for every host being added into Nagios, the 'service checks' e.g check_ping and so on, were duplicated many many times, over and over again, with an underscore _$host appended to keep each service check in a unique namespace.

This might be ok for a couple of host, but with lots and lots of hosts, that is a lot of practically identical service checks for Nagios to use! This is what hostgroups are for, and we were already using hostgroups to 'group' the one service check against an array of hosts. Much more simple.

However, the hostgroups.cfg file kept its hosts as a comma-delimited list of 'members' in this format:

define hostgroup {
  hostgroup_name     nginxservers
  alias                      Nginx servers
  members                foo1, foo2, bar1, bar2, example, example2
}

I could not think of a sane way of having a Madelon nagios hook 'append' new hostnames to these lists. I could mess about with I/O in Python but it was going to be a real hack and very messy. I really needed to use the Munin approach and just drop in *one* file specific to that host that would be the best of both worlds: add the machine to the hostgroup, and not have duplicated service checks all over the place.

Then, totally by accident, I was browsing the Nagios docs and reading about the host definition: and for the first time after all these years, I discovered that you can define 'hostgroups' as a list within the *host* specific file, and remove 'members' lists from hostgroup definitions altogether! In effect, reversing the arrangement in the configuration so that each Host declares what Hostgroups it belongs to, rather than each Hostgroup declare what Hosts are its members.

Eureka!

I could now create Nagios host definitions in Puppet like so:

   nagios::host { 'foobar.com':
     address     =>'12.34.56.78',
     hostgroups => 'common, nginxservers, mysqlservers, varnishservers',
   }

With the host.pp definition looking like this:

define nagios::host (
  $address,
  $hostgroups
) {
 
  include nagios
 
  file {"/etc/nagios3/conf.d/host_${name}.cfg":
    content => template('nagios/host.cfg.erb'),
    owner   => 'root',
    group   => 'root',
    mode   => '644',
    require => Package["nagios3"],
    notify   => Service["nagios3"];
  }
}

and these attributes would be injected via the host.cfg template in my Puppet manifest, into a host-specific config file in Nagios.

The Nagios hook in Madelon even detects, like the Puppet hook, what 'roles' were requested for this server from the webform / Jenkins, and 'explodes' them into the array as seen above. In other words: my hostgroups in Nagios, directly match 'roles' in Puppet, allowing the machine to be instantly monitored with the role-specific checks (eg check_mysql because it's in mysqlservers, check_http because it's in nginxservers.

Hey presto! We now can submit a webform, have a running server minutes later, configured with our Puppet manifests so users, firewalls, tuned mysql configs, running services etc are all in place, and have our Munin/Nagios servers 'auto-discover' the new node and start monitoring it as soon as it's up and running!

What's Next?

1. Hopefully opensource of some of the hooks we've discussed here (nagios and munin etc)

2. Improve! Remove some assumptions and rather ugly methods for modifying the puppet manifests in real time

3. Automatic 'audit' reports of what servers are used for, what sites and services they run, etc.

4. A 'one-click' system that does all the above as well as implements our existing zero-touch site deployment tools (currently a separate webform > Jenkins system - more on this later :) )

For more info on all the cool stuff Code Enigma are doing, I highly recommend you visit http://www.codeenigma.com . Keep an eye out for these guys, they are building amazing stuff for big clients, and it's been a pleasure to help make their lives easier in solving these deployment problems :)

To hear more about Code Enigma's use of this system, the lessons learnt and maybe see this process in action, you should come to DrupalCamp Toulouse in November, where I'll be presenting it (Skyped in from Australia at some evil hour of the GMT+10 night :) ) with Greg Harvey from Code Enigma.

Finally, my business is building big infrastructure like this. If these ideas appeal for you and your dev team, feel free to contact me and request a quote!

Tags: