'So, what is it you exactly do?' - Part seven, communication

Admittedly there is an irony in having a massive delay between the last article in this series and this one, aptly to be regarding 'communication' :) Sorry for the silence. It seems some of these articles have inspired some interest in the work I do, and as a result, I have been running off my feet with new business. There are worse problems to have.

Working as a consulting sysadmin kind of flips the sysadmin stereotype on its head. A consultant can't really operate in the basement of some office complex, rarely emerging to see the light of day, and bark at colleagues and customers alike. As a consultant I work for many organisations and so need to be available and responsive to all of them. And I have to (try and) be polite, even if I (often?) fail!

In addition, modern agencies have broken down a lot of those stereotypical office cultures. The sysadmin is a vital member of a team who is far more 'customer-facing' than he or she used to be. For example, large parts of my work involve preparing the technical components of RFPs to assist my customers getting new business themselves.

All of this points to the heightened importance of good communication as a vital skill for sysadmins, or any sort of consultant.

Building trust - proactive communication in a reactive role

It's not all beautiful in 'modern agencies'. Despite latest DevOps crazes, the sysadmin still has a largely reactive role, simply because someone has to. Automatic response to issues isn't always possible. Frequently, even post-mortem analysis of an incident requires the subtlety of a human brain to detect patterns or filter through irrelevancies. See the previous article on Troubleshooting for more on this.

The reality of a reactive role means that someone has to communicate about an issue. This might be simply notifying constituents that there is a problem. It might be providing an ETA, or regular progress updates if the issue is taking longer than desired to resolve. And it might need an incident report afterward. In all of these cases, the sysadmin would do well to expect questions - sometimes justified, sometimes irrational, sometimes inane. Now the sysadmin has two problems :) Dealing with a technical problem, and dealing with a people problem.

More and more, I find my work revolves around the ability to communicate my way through these issues.

I see frequent on-boarding of new customers at different agencies, and there often seems an initial period of panic and mistrust by said new customer. They may have been burnt by previous hosting experience. They may simply be a real piece of work and hence why their hosting company dropped them. The slightest issue or even a simple routine task is taken out of proportion almost immediately by the panicky client rep. Or, it's an idea by them that is already whipped into a panicked frenzy before it's even presented to you.

This is all about a lack of trust, largely because not enough time has passed to develop that trust.

It would be nice if customers assumed the benefit of the doubt: that you know what you're doing and that you actually want to help! Sadly this often isn't the case. The customer perhaps feels they can't give the benefit of the doubt because the 'risk' (or really, the personal attachment) to the project feels too great to take. A vicious cycle ensues.

Effective communication can fix all this.

The sysadmin operates at the 'pointy' end of risk, as someone recently described to me, where things can and do go wrong. It only follows that the sysadmin then witnesses or experiences directly, the angst of a client. Optimism, diligence, even humour through frequent communication will reassure the customer. Eventually through a 'broken record player' repeated cycle of tasks with a majority of successful results, strung together with consistent communication, trust is established.

Consistency: communicating about incidents

Cross-reference historical tickets with new issues to link similar or related tasks. This also reassures the customer that you are paying deep attention not just to their application, but to the history shared between you both. It also helps you in future when trying to debug something that has happened before.

I am a big fan of the incident report. It might just be an e-mail, or a ticket, or an entry in a spreadsheet or other system. It need only be based on a consistent template, like the following:

Start time: YYYY-MM-DD HH:MM
End time: YYYY-MM-DD HH:MM
Duration: (diff between the above)
Discovered by:
Resolved by:

Summary: [ONE-LINE SUMMARY OF ISSUE]

Details:
At HH:MM the monitoring system alerted on-call engineers to an issue with the BLAH BLAH.

Upon investigation, it was apparent that BLAH BLAH had caused a BLAH BLAH due to a misconfiguration in the BLAH.

Engineers applied an on-the-fly change to the BLAH and restarted the service. 

Normal service returned shortly thereafter.


Mitigation:
Developers and system engineers are investigating the ideal configuration which will be then folded into config management to ensure this issue doesn't re-occur.

If you have any queries about this incident, please contact BLAH BLAH.

My instinct is to provide as much detail as possible in these reports. However, I learned a year or two into my career that less-technical customers are also easily overwhelmed by the content of incident reports. Try and keep them consistent (this is why a template helps) so they are used to knowing where to jump to if they want a summary or a detailed synopsis. It is OK to provide a technical explanation in the details, but avoid asking questions or placing blame. Cause and effect is the goal.

I also try not to stress about any Mitigation details. We are human. Sometimes problems are hard to understand. Sometimes they even need to occur again but not at 3AM, for us to analyse in real time. It is OK to state that engineers are 'continuing to monitor and investigate the root cause.' Ideally once you figure it out, you can update the incident report later.

Finding the fine line: too much vs not enough

You can, of course, get into the opposite sort of trouble if you withhold too much technical info. It can depend on the situation.

One example where more information is useful, is in the wake of some sort of security issue (such as Heartbleed, POODLE, Logjam, Shellshock etc). Even if your systems are not affected or you have quickly mitigated the risk, customers can make public announcements (or statements for auditors) that were simply incorrect because you told them 'everything is fine' when you really mean 'we're going through the process of fixing things up now, I'll let you know when we're all clear'. A customer might proclaim publicly 'Great news, we're not affected by [EVIL AND CATCHY-NAMED VULN]!'. In certain circles, particularly infosec which seems particularly plagued by immaturity, this is like asking for trouble.

In other words: sometimes you need to give more information than you normally would. Think of it as not explaining to your customer, but explaining to the people who are going to ask your customer (i.e their customers, their auditors, the public if this has a significant public image/brand etc). Or simply protecting your customer from themselves...

This is hard, but effective communication takes practice.

The plan and the rollback plan

When planning a particularly high risk task like a big deployment or a migration, you know things can go wrong 3/4 of the way through, which require some sort of backing out.

I find it helps to write a step-by-step plan for the task itself, but even more importantly, a rollback plan at the end. When things go wrong, it can be stressful. Your brain doesn't work as well when stressed as it does when calm, because it's receiving conflicting instinctive instructions to deal with the threat. Having a rollback plan doesn't just fix the issue: it allows you to not have to think, taking even more pressure off your brain so it begins to calm down again.

Communicate both the plan and the rollback plan to your team prior to commencement of the task. Get a second opinion on what sort of tests should demonstrate a successful completion of the task. Find out if people need to be notified in the event of rollback.

Other people will have good ideas that you didn't expect, which you can then factor into your plan.

You might just find that this communication has spotted something that, when added or corrected, means not having to rollback at all!

Customers hate silence

Frequently I see a customer ask a question or lodge a ticket for something to be done, and the engineer or the developer commences work on it but doesn't say so. Any number of issues can later occur (such as being dragged into meetings) that leave the customer feeling that no-one has even noticed their request. Behaviour gets irate (especially if trust is not yet established, and thus expectations and realities are not in proportion).

One customer I know might assign me something to deploy during an agreed routine 'maintenance window' that is at the end of the week. I know I don't have to do anything for 5 days. But the customer would get nervous that I have not even seen the task and that it might be missed. This didn't make them intolerable, but it didn't take much effort or time for me to acknowledge the task in the ticket and put some anxieties at rest. It would probably only work in my favour to have a relaxed client if I had to give them some bad news about some other problem an hour later :)

Without being overly verbose, it can help to leave a quick note "No worries, I'm just jumping into a meeting but I think this is a quick fix. I'll look at it shortly/give it to Alice to take a look while I step out". Alice might reply "I hit a bit of an obstacle here because [something you didn't even remember as you were thinking about your imminent meeting]". Comes back to you: "oh, you're right, I forgot about that. Sorry Client, this will take a little longer than I initially thought. I've run out of time today but it's on my TODO for tomorrow."

Flood

Excessive communication is of course too much, and creates a 'switch-off' mode where you are either ignored, not taken seriously, or your information is misinterpreted due to sheer volume.

Again, this is just about practice, balance, and being dynamic. Different people require different amounts of communication to either keep them happy or make them understand something important.

I would, however, say that you are less commonly going to run into issues communicating superfluously than you are by not communicating enough.

The TODO system

A few years ago I recall being very overworked and quite inefficient at getting through tasks. I was forgetting some tasks, jumping from some tasks to others before finishing them, and generally making myself and the people who depended on me pretty unhappy.

Tom Limoncelli's book 'Time Management for System Administrators', which is quite small (important for people short on time!) and excellent, features a TODO 'cycle' approach which really helped me rotate through tasks over weeks, moving things I didn't finish one day to the top of tomorrow's list. But also significant was his recommendation to always take a note of any new requests. We all know people have an annoying habit of sidling up to your desk while you're busy and asking for something. Sometimes demanding they use the ticket system is just a waste of your own time: you know these people won't leave til they are convinced you'll help them.

You would be surprised at the difference it can make to respond "OK, sure. I don't have time right this second as I need to finish this other task. But I'll note this down so it doesn't get forgotten." Ensure the person sees for themselves that you are writing it down. This absurd procedure is validating for them your acknowledgement not just of their issue, but of them personally. It's also a lot less abrasive than barking "why can't you open a TICKET???!"

This is trickier in a distributed team, but that's where your ticket system comes in handy (and you were only going to add that item to the ticket system later right?). Assign yourself a ticket and ensure the requester is notified of it, if your ticket system supports it. Who knows: they might even think they can save some time/effort in future and just make the ticket themselves. Stranger things have and will happen!

Summing up (this series)

You can take all the above with a grain of salt since it took so long for me to communicate this blog post! What do I know?

Well, I don't know a lot, but I hope this series has been interesting to other sysadmins, as well as anyone working in digital agencies, or even consultants of any sort. I think the nature of consultancy carries a lot of common elements across a wide range of job descriptions, technologies and industries. This is just a snapshot of how it fills up most of my time.

If you have questions or feedback you are welcome to comment here, in any of the previous articles in the series, or via the contact form.

Thanks!

Previous articles in this series

Part One - Continuous deployment
Part Two - Config Management
Part Three - Security
Part Four - Monitoring
Part Five - Troubleshooting
Part Six - High availability and disasters