'So, what is it you exactly do?' - Part four, monitoring

Here's a scenario...

At 4:30AM every Thursday (sysadmin's time), a server's site suddenly spikes in load, because a full backup takes place at such a time, which is not an off-peak time in terms of traffic due to international visitors.

A bunch of users visiting a site on that server receive a flurry of 502 errors trying to load some content - a form of application timeout due to the taxing effect on the CPU related to the backup process.

No-one notices, until two months later, those 502 errors turn into 503s, because a disk has become full (due to said weekly backup not purging old backups), and a database has subsequently become corrupt.

The application doesn't bootstrap the database, but due to a programming oversight, exposes the database credentials in the 503 error page.

A nefarious visitor observes this and manages to leverage their way onto the system through a security vulnerability in the application a few weeks later, once the database corruption has been fixed, and takes a copy of the database using these credentials. The vulnerability was related to a temp folder in the application codebase being chmod 777.

The database is leaked on pastebin.com.

This didn't happen to me, I made the scenario up (phew!) but I think my point is clear already. All of this may have been actionable much sooner with some decent monitoring. Let me explore the ways:

  • The regular CPU spike occurring each Thursday: something that a monitoring tool like Nagios would likely notice and alert people about. The pattern of being woken up by a Nagios push notification weekly at 4:30AM would soon make the sysadmin look into further detail about its nature.
  • The 502 alerts would be detected by a log analysis monitoring tool such as OSSEC. These would also be alerted via e-mail or in routine OSSEC-generated reports, and cross-referenced with the spike in load
  • The disk becomes full after a few months: a sign of a growing problem or trend, which humans would recognise easily in a graph form. Something like Munin or other RRD graph-generating interface monitoring the disk usage would show this.
  • The 503 errors would be noticed by Nagios and OSSEC, as would the database corruption, and probably many other alerts/symptoms if a disk got full
  • The poor permissions on the application folder would have been alerted to by OSSEC, long before someone exploited it
  • The intrusion by an attacker leveraging the database credentials would (possibly) be detected by OSSEC
  • The leak of the database on pastebin.com may have been detectable via a Threat Monitoring tool such as Scumblr, allowing the company to initiate a remediation procedure earlier than being 'caught on the back foot' by the leak being picked up by the mainstream media.

It becomes pretty clear that a chain reaction of very unfortunate events can be largely avoided, or caught in time, through effective monitoring.

No surprise like a bad surprise

In 2012 I observed a series of websites get hacked through an application vulnerability. In at least one case, it meant that payment information may have been exposed. We all (myself included) were about 2 weeks late to the discovery from the time that the intrusion occurred. There was not enough monitoring in place to detect it - and that's when I discovered OSSEC. It felt unbearable to me to know that something like that could have happened under my nose. For weeks until a chance encounter performing another task on the server, allowed me to spot it.

Monitoring and security

Of course, the above scenario, and the 'real life' case I mention, have a focus on security in terms of hacks/leaks. But as I mentioned in the previous article, security is also about availability (disks not getting full!), integrity (data not being corrupted, or effective rollover of backup systems so that the backups themselves don't become corrupt), and also trust (the public opinion of the company who got hacked through their negligence). This is something especially important to be considered in the scope of 'security' for ISO27001 certified companies (or those trying to obtain that certification). I believe monitoring can help protect against all of these issues.

Some more thoughts on specific tools

Do I implement and use all of the above tools for the agencies I consult to? Yes indeed I do - some use all or a subset of these depending on their requirements.

Nagios

A tool like Nagios is the staple for me - it is old, and there is a growing disdain for it by hipsteradmins who pray to the Bleeding Edge gods, but it has plenty of value for me so I use it. I am not saying it is the best, I am saying it does everything I need it to do.

Nagios is notoriously difficult to configure - not so when you define your Nagios configuration through Puppet. 'But you still have to do the Puppet configuration', say the trolls. Not so when you have written bots who do it for you on demand. Say I want to spin up a server at Linode or Rackspace that is a LAMP stack: my business is all about providing tools that take that requirement and automate it. This includes not just the provisioning of the server, but the auto-hookup of such a server and its services into Nagios. And all config managed in Puppet. Welcome to effective systems administration in 2015.

'Nagios looks like crap' - yes indeed, by default it does (see here). It's customisable though:

Munin

Munin is also a tool that's vital to me because it records the historic issues visually - a spike in CPU or RAM, a growing disk usage issue, a strange pattern in network traffic. It helps me analyse the past, and predict the future. Most of the time it helps me prove to hosting companies that their disk latency is too high or that there is a 'noisy neighbour' resulting in high CPU steal time on my server, as of the last (say) 16 hours.

Scumblr

Scumblr is a threat monitoring tool released by Netflix's OSS team, which I have also contributed fixes and features to. This is really useful, especially for businesses with a recognisable brand or product that needs protection. But I have also used it to find cases where developers have accidentally published Github repositories as 'public' when they were meant to be private - or that they published credentials (e.g AWS keys) in those repositories.

I have written more on Scumblr, specifically with regards to Pastebin monitoring, on my blog.

ELK etc

And of course there are other products that I haven't discussed here, such as use of NewRelic for performance metrics, Pingdom as an additional 'uptime' vector (don't rely on an internal Nagios instance giving accurate indication of your service availability or being able to send notifications, if your upstream router is down!). A friend of mine is working on RadAlert.io, a next-gen monitoring solution that features predictive metrics.

There's also the 'ELK stack' comprising Elasticsearch, Logstash and Kibana. This suite can be useful as a centralised logging system with a flexible UI, capable of producing real-time and trend data in a similar fashion to Munin, but of any sort of data you want to aggregate (e.g not just metrics, but log messages). I have used the ELK stack to observe patterns in Wordpress intrusion attempts and SASL dictionary attacks via OSSEC.

Security in monitoring third party systems

In the modern era, many businesses use a variety of third party tools in order to operate. You probably use Google Apps or at least some of that suite. If you're in the business of developing software, you might also use Github or Bitbucket. Perhaps you have a third party project management system like Pivotal or Atlassian Jira. Maybe Twitter is an important communication channel between you and your customers. Many a business laments publicly when these services fall down. It's a fact, even if it shouldn't be: these major companies are depended on for a business to operate. So if you depend on that service, shouldn't you be monitoring it? This is something I've seen ISO27001 auditors point out.

Most reputable businesses like those I've listed, will feature a public status page to demonstrate a healthy or degraded service (or subset of services). As part of efforts I have invested in building 'mash up' monitoring dashboards for some of my customers, aggregating that information somehow is something I like to include - see below for an example.

Summing up

The amount of monitoring really depends on the business needs. What seems definite though, is that any form of monitoring can drastically bolster the security, stability, general understanding of, a business's technical internals, for that business. The smallest finger on the lightest pulse, is still an assurance that something's alive and beating. So it should be for the systems you need to be running at their optimal capability.

My observations have consistently been that the addition of OSSEC and/or ELK stacks for logging analysis alone, has exposed previously 'hidden' issues due people simply not checking /var/log/syslog proactively. These can include the discovery of occasional PHP memory limit exhaustion issues (not seen previously by you, but maybe by an end user) on specific pages of an app. Or, it might identify an awareness issue: a developer running 'sudo chmod -R 777' on something, which might involve a bit of re-education. (And a report to your CISO, because why did that developer have sudo anyway? (Someone added him to sudoers on your sick day so you didn't know) ).

The very awareness that 'there happens to be a spike in SSH brute force attacks this week', doesn't necessarily change anything for your business other than it makes you more keenly aware of the threat landscape you're operating in. Which makes you more prepared and more educated as a sysadmin, which means you can make your staff (or customers, if you consult like I do) more prepared, more educated.

And, of course, monitoring is not just about tools: it's not something to 'switch on' and walk away from. No monitoring system is of much use if no-one's able to use or understand what it's saying. That's something I'll cover in part five: troubleshooting, because a massive part of my role is interpreting problems and railing against the irritating sense of 'mysteries' or 'ghosts in the machine', in an industry defined by science. I have seen sysadmins who aren't capable of effective troubleshooting, or making common mistakes in an effort to do so. This can be the difference between a good sysadmin and one who is just costing an organisation money without solving problems.

Coming up

Part Five: troubleshooting or 'ghosts in the machine',
Part Six: high availability and disasters (sometimes the same thing :) ),
Part Seven: communication.