#sysadminsucks

The Path to Peer Review

As a sysadmin, I’ve never been part of a team that did peer review well. When you’re doing a piece of work and you want to check you’re doing the right things, how do you get feedback?

Do you send an email out saying “can you look at this?”

Do you have someone look over your shoulder at your monitor?

Do you have discussions about what you’re going to do?

Sometimes I’ve done these things, and they’ve been partly effective - but usually they just get a reply of “Looks good to me!”

Most of the sysadmins I’ve worked with will just make a change because they want to, or were asked to, and don’t tell anyone. Or they want to be heros who surprise everyone by sweeping in with an amazing solution that they’ve been working on secretly.

I don’t want to work where people are trying to perform heroics. I hate surprises. Having done all those individualistic stupid things myself, I want to work in a team where we work together on problems out in the open. When you involve others, they are more engaged and feel part of the decision making process. And their feedback makes you produce better work.

My current role now has the best culture of reviewing work that I’ve experiences. But we had to create it for ourselves, and this is what we did.

We put all our work in version control

When I started this role, the first project I worked on was to move our configs into git. Most of those configs are stored on shared NFS volumes which are available on all hosts. Previously people made config changes by making a copy of the file they wanted to change and then made the change to the copy. Once that is ready they would take a backup of the production copy of the file, and copy their new version into place.

Importing the files into repos was generally straightforward, but sometimes there was automatically generated content that needed excluding in .gitignore. To deploy the repo we added a post-update hook on the git server that would run git pull from the correct network path.

But we also wanted to catch when people would change things outside of git and do so without overwriting their changes. This would allow us to identify who was changing things and make sure they knew the new git based process. To do this we added a pre-receive hook on the server that would run git status in the destination path and look for changes.

That was the first step in changing the way we worked. It wasn’t a big change, but it got everyone fairly comfortable with using git. The synchronous nature of the deployment was a big plus too, because all the feedback about success or failure would happen in your shell after running git push

Then we generated notifications

This was great progress, and we got more and more of our configs into this system over time. The next thing we wanted to do was to produce notifications about what’s changing. This would allow us to catch mistakes, find bad changes that broke things, and to understand who is making changes to what systems.

We did two things to achieve this - we made all these repos send email changelogs and diffs whenever someone pushed a change, and we created an IM bot that would publish details changes into a group chat root.

This was great to produce an audit trail of changes happening to the system. But whenever you saw a change break something, in hindsight you thought - well I knew that change would break, I could stop that before it was deployed.

Finally, we added Pull Requests

We knew we wanted to implement Pull Requests but we couldn’t do this with the git server software we were originally using.

Since we had started using Confluence and JIRA for our intranet and issue tracking, we moved all our git repos to Stash, which is Atlassian’s github-like git server. This provided us with the functionality we wanted.

On a single repo, we enabled a Pull Request workflow - no one could commit to master any more and everyone had to create a new branch and raise a PR for the changes they wanted merged.

We wrote up some documentation on how to use git branching and raise PRs via the stash web interface, and explained to everyone how it all worked.

We chose a repo that was relatively new, so the workflow was part of the process of learning to work with the new software. It was also one that everyone in the team would have to work with, so that everyone was exposed to the workflow as soon as possible.

To deploy the approved changes, we used the Jenkins plugin for Stash to notify Jenkins when there were changes to the repository. Jenkins then ran a job that did the same thing as our previous post-update hooks - ran git pull in the correct network location.

Running the deployment asynchronously like this felt like we were losing something - if there was a problem you were sent an email & IM from Jenkins, but this felt less urgent than a bunch of red text in your terminal. But for the benefit of review, this was a price worth paying.

In the interim we had moved the checks for changes being made outside of git into our alerting system, so we could catch them earlier than when the next person went to make a change. This meant we didn’t have to implement these checks as part of the git workflow - if there was still a problem we let the hook job fail, and it could be re-run from Jenkins once the cause was resolved.

Over time, we moved all the other repositories across to this workflow, starting with the repos with the highest number of risky changes. But, for a number of repos, we kept the old workflow with synchronous deployment hooks because all the changes being made were low risk and well established practises.

Initially, slowing down is hard

The hardest thing to adapt to was changing the perceived pace that people were working. Everyone was very used to being able to make the change they wanted immediately, and close that issue in JIRA straight away. That’s how we judged how long a piece of work takes, but that doesn’t take into account the time spent troubleshooting and doing unplanned work.

What we were doing was moving more of the work up front, where you can fix problems with less disruption. But making that adjustment to the way you work can be really hard because you perceive the process to be slower.

Everyone in our team is a good sport and willing to give things a go - but as much as we could try to explain the benefit of slowing down, you only realise how much better it is by doing it over and over, and that takes time.

Sometimes it’s necessary to move faster, and we manage that simply - if someone needs a review now, they ask someone and explain why it’s urgent. As a reviewer, if I’m busy I’ll get people to prioritise their requests by asking “I’m probably not going to be able to review everything in my queue today. Is there anything that needs looking at now?”

In time the benefits are demonstrated

By using Pull Requests we created an asynchronous feedback system. First you propose what you’re going to do in JIRA. Then you implement it how you think it should be done and create a PR. Then when a reviewer is available they’ll provide feedback or approve the change. You keep making updates to your PR and JIRA until the change is approved or declined.

With time, everyone experienced all of the following benefits of that feedback:

Catch mistakes before they are deployed

This was what we set out to do! There were breaking changes made before, and there are less of them now.

Learn about dependencies between components

A common type of feedback is “How is this change going to affect X?” - sometimes the requestor has already considered that and can explain the impact and any steps they took to deal with X. But if they haven’t considered it, they need to research it. That way they learn more about how things are connected and have a greater appreciation of how the system works as a whole.

Enforce consistent style and approaches

Everyone has their own preferred style of text editing and coding. With a PR we can say this is the style we want to use, and enforce it. The tidier you keep your configs and code, the more respect others will have for them.

It’s massively helpful to be told about an existing function that achieves what you’re trying to do, or there’s existing examples of approaches to solving a problem. This can help you learn better techniques, and avoid duplicating code.

Identify risky changes

With changes to fragile systems, you’re never confident about hitting deploy even if the change looks good. Until it’s in production and put through it’s paces there is risk. So this has allowed us to schedule deploying changes for times where the impact will be lower, or deploy the change to a subset of users.

It’s also stopped “push and run” scenarios - we avoid merges after 5pm, they can always wait for tomorrow morning!

Explain what you’re trying to do

Much of the review process is not about identifying problems with configs and code, but simply being aware of and understanding the changes that are taking place. This is invaluable to me as a reviewer.

So, when raising a Pull Request, it’s expected that each commit has a reference to the JIRA issue related to the change. The Pull Request can have a comment about what is changing and why, and that can be very helpful but the explanation of why the change is taking place must be in the JIRA.

This way, when looking back at the change history we can also reference the motivations for making that change and see the bigger picture beyond the commit message.

People want to have their work reviewed

Somewhere along the way, getting your work reviewed became desirable. You realise the earlier you put your work out there for review, the earlier you get feedback and the less likely you are to spend time doing the wrong thing.

We still have a number of repos that anyone can just change by pushing to master, there’s no controls because these changes are considered safe. But people choose to create branches and PRs for their changes to these repos, because they want to have their work reviewed.

Change takes time…

Going from no version control to making this cultural change in the way we approached our work took about 2.5 years.

Throughout this period there was no grand plan or defined scenario that we were trying to achieve. At each stage, we could only see a short way forward. We had some things we were looking to do better, so we experimented. When we found what worked, we made sure everyone kept doing it that way.

At the start, I had no idea that peer review of sysadmin work would be done via code review. As we’ve moved things into git and we saw the benefits, we want to manage everything the same way and that drives moving more things into git.

We moved slowly because there was other work going on and we needed to get people comfortable with the new way of working and tooling before we could ask more of them. Change takes time, and many of our team have benefitted from seeing that incremental process take place.

It’s been gratifying that the newest team members who have joined since we established PRs have said “I was uncertain about it at first, but now I get it. It really works.”

… and involves lots of painstaking work

To get to where we are has meant spending a lot of time setting up new repositories and isolating files that are managed by humans and automatically generated, migrating repositories to stash, creating deploy hooks, explaining how git works, and most of all making sure you’re providing useful reviews and that they happen regularly.

It’s one thing to start setting up a couple of repos like this – but to fully establish the change you need to do lots of boring work to make sure everything is migrated, even the more difficult cases. It’s important that everything is managed consistently, within one system.

Monitorama Roundup Part 2

Part one is available here

During the second day there were multiple tracks available - I mostly followed the workshop track, only catching one presentation on the speaker track.

Florian Forster, collectd - Collecting custom metrics

Slides

Described the collectd data model and how to use the Exec plugin to execute arbitray scripts for collecting custom metrics.

Also presented a statsd plugin - implementation of statsd network protocol inside collectd.

Abe Stanway - Kale (Skyline and Oculus)

Slides

Presentation of architecture for the kale suite of tools recently released by etsy.

You can find a better description on the etsy blog

Skyline - Analyse time series for anomalies in (almost) real time

The setup at etsy includes 250k metrics, which takes about 70sec for anomaly discovery, and requires 64GB memory to store 24 hours of metrics in memory.

  • carbon-relay is used to forward metrics to the horizon listener
  • metrics are stored in redis
  • data is stored in redis as messagepack (allows efficient binary array stream)
  • roomba runs intermittely to clean up old metrics from redis
  • amalyzer does it’s thing and writes info to disk in JSON for the web front end

To identify anomalies, skyline uses some of the techniques that Abe talked about in his presentation the previous day. It uses the consensus model, where a number of models are used and they vote - so if a majority of models detect an anomaly, then that is reported.

Oculus - analyse time series for correlation

Oculus figures out which time series are correlated using Euclidian Distance - the difference between time series values.

It also uses Dynamic Time Warping - to allow for phase shifts, if the change in one series occurs later than in the other. But this is slow, so it’s targeted on the time series to could be correlated by comparing shape descriptions.

  • data pulled from skyline redis and stored in elasticsearch
  • time series converted to shape description (limited number of keywords that describe the pattern)
  • phrase search done for shared shape description fingerprints
  • run dynamic time warping on that data

Pierre-Yves Ritschard - Riemann

Slides

Unfortunately this presentation was marred by a projector/slide deck failure which made it hard to follow. It was incredibly disappointing because there was a lot of positive discussion around Riemann and I was looking forward to a better exposition of the tool.

Riemann is a event stream processing tool.

  • All events sent to riemann have a key value data structure - for logs, metrics, etc.
  • You can use all your current collectors: collectd, logstash etc, and has an in app statsd replacement
  • Those events can be manipulated in many ways, and sent to many outputs
  • The query language, clojure, is data driven & from lisp family
  • There is storage available for event correlation but I didn’t really understand from his discussion

Devdas Bhagat - Big Graphite

Slides

This workshop covered how booking.com have scaled their graphite setup.

They have somewhere in the region of 5000 hosts, multiple terabytes of data stored in whisper.

Both IO and CPU have become bottlenecks and in each instance they have thrown hardware at the problem to run more agents and shard their data.

I/O probems:

  • Ran into IO wall, disks 100% writing. Lots of seeking
  • Have ended up using SSD drives in RAID0
  • Sharding becomes hard to maintain and balance
    • Don’t know in advance in which namespace metrics will be created
    • Rebalancing tricky when adding more backends - they have a script that replicates graphite hashing function and manually move things
  • Found SSDs not as reliable as spinning disk under high update conditions
    • Lots of drive failures, so replicate data to separate datacenters to provide availability
  • FusionIO performance no improvement from SSDs (I find this hard to believe)

CPU problems:

  • Relays start maxing out CPUs
    • Multiple relay hosts (per datacenter)
    • Multiple relays on each host (1 per core)
    • Use haproxy load balancer (prevent losing metrics)

Other Problems:

  • Software breaking on updates (whisper failure on upgrade)
  • People can use the software in surprising ways - not time series data, or one record a day

David Goodlad - Infrastructure is Secondary

Slides | Video

David presented that your primary metrics should be your business metrics, and that your infrastructure metrics are secondary. Which is not to say infrastructure metrics aren’t important, but the business metrics are measures of how your business is performing and this is the data which you should be alerting on.

  • Alerts should be informational and actionable - not just “cool story, bro”
  • Consider what matters to customers - i.e. instead of measuring queue size, use time to process
  • When a business metric alerts, then correlate against infrastructure monitoring
  • Keep the infrastructure thresholds, but don’t alert on that information - you can access it when necessary

He gave a great one liner to help decide what you should be measuring - “What would get your boss fired?” - Measure these things deeply.

Also pushed the idea of sharing your information outside the team - to be more transparent and visible to the rest of the business, particularly since the information you hold are business metrics. This will provide feedback about the metrics which are important to others.

Michael Gorsuch - Graph Automation

This workshop was a practical introduction to instrumentation and exploration - stepping through configuring StatsD, graphite and collectd to instrument an application and exploring graphs using descartes.

Monitorama Roundup Part 1

Over the past two days I’ve attended the Monitorama conference in Berlin. The conference covers a number of topics in Open Source monitoring, and reflects the cutting edge in techologies and approaches being taken to monitoring today. It also acts as a melting pot of ideas that get used by lots of people to take those ideas further, or act on them within their businesses.

There were a number of ideas that ran throughout many of the talks, which I considered the key take aways from day one:

  • Alert fatigue is a big problem - Alert less, remove meaningless alerts, alert only on business need and add context to alerts.

  • Everything is an event stream - metrics, logs, whatever. Treat it all the same (and store it in the same place).

  • The majority of our metrics don’t fit standard distribution, which makes automated anomaly detection hard. So people are looking for models that fit our data to do anomaly detection.

  • Everyone loves airplane stories. I heard at least five, three of which ended in crashes.

The first day had a single speaker track, here is my personal summary and interpretation of all the talks:

Dylan Richard - Keynote

Video

This was an experience talk about the Obama re-election campaign, what they did that worked, what they wished they had etc.

Essentially they used every tool going, and used whatever made sense for the particular area they were looking at. But they weren’t able to look carefully about their alerting and were emitting vast numbers of alerts. To deal with this, they leveraged the power users of their applications, who would give feedback about things not working and then they’d dig into the alerts to track down the causes. To improve on that feedback, they also created custom dashboards for those power users so they could report more context with the problem they were experiencing.

Alert fatigue was mentioned - “Be careful about crying wolf with alerts” because obviously people turn off if there’s too much alerting noise, and subsequently “Monitoring is useless without being watched”.

Danese Cooper - Open Source Past Future

Video

Gave a presentation about the history of open source, where it stands today, and advocated participation in open source and the institutions in that space.

Abe Stanway, Etsy - Anomaly Detection with Algorithms

Slides | Video

Abe gave a great talk on the history of Statistical Process Analysis and how it is used in quality control on production lines and similar industrial settings. This is all about anomaly detection by looking for events that fall outside three standard deviations from the mean. But unfortunately this detection process only works on normally distributed data, and almost none of the data we collect is normally distributed. He followed this with a number of ideas for approaches to take to describe our data in an automated fashion - and then called on the audience to get involved in helping develop these models.

He also gave an interesting comparison between current monitoring being able to provide us with either a noisy situational awareness, or more limited feedback using predefined directives.

Mark McGranaghan, Heroku - Fewer Better Systems

Slides | Video

Presented the argument that the best systems are the ones that are used constantly - for example failover secondaries often do not work because they have not been subjected to live running conditions and the associated maintenance. So he suggested a bunch of things that are done differently now, which could be done the same, so we get better at doing less things:

  • Metrics, Logs & Events - can all be considered events, so make them the same format as events
  • Metric Collection & Alerting - often collect the same data - collect once, and alert from your stored metrics
  • Integration testing and QOS monitoring - share a lot of same goals, so do them the same
  • Unlike results, errors get specific code for dealing with them - instead treat them as a result but send data that presents the error

Katherine Daniels, Gamechanger - Staring at Graphs as-a-Service

Slides | Video

Gave a great practical talk about how monitoring systems fail, and the series of small decisions that get made to improve individual situations but make the whole system, and the operator’s experience, much much worse. Essentially, everything that create monitoring systems with screens full of red alerts, that have been that way for a long time and with no sign of resolution.

As a way forward, she suggested the following:

  • Only monitor the key components of your business - remove all the crap
  • Find out what critical means to you - understand your priorities
  • Fix the infrastructure - start with a zero error baseline
  • Plan for monitoring earlier in the development cycle (aka devops ftw)

Lindsay Holmwood, Psychology of Alert Design

Slides | Video

Lindsay presented a number of stories, from air disasters and hospitals, to present two key ideas when designing alerts:

  • Don’t startle or overload the operator (reduce notifications)
  • Don’t suggest, expose (provide more context - give relevant situational data at the same time)

Theo Schlossnagle - Monitoring what the hell?

Slides | Video

This talk covered a lot of ground quite quickly, so these were the key things I took away:

  • Monitor for failure by reviewing problems that you’ve identified before, create detailed descriptions of those events, and only them alert on those descriptions - so that you have enough context to provide with an alert so that it’s meaningful and actionable to the receiver.
  • Alert on your business concerns
  • Store all your data in the same place - treat logs, events and metrics the same
  • Our data isn’t normally distributed and that makes shit hard. The next leaps in dealing with this data will come from outside Computer Science - more likely to be the hard science disciplines.

Michael Panchenko - Monitoring not just for numbers

Slides | Video

Most of this talk was about the problems of configuration drift, and how subtle differences of systems outside configuration management policy scope can yield big surprises.

Michael presented his dream of enhancing numerical monitoring with non-numerical and non-binary observations, with some suggestions:

  • Monitoring of infrastructure state
  • Providing an audit trail of categorical data
  • Bring able to compare states across nodes and time

He also suggested describing activity in a standardised context: who, what, when.

Jarkko Laine - Let your data tell a Story

Slides | Blog | Video

This was a really interesting talk about how humans process ideas and information, how these lead to biases in analysis - and how to exploit them.

The talk was an exposition of two main ideas:

  • Attention and Memory are limited
  • Tell stories to engage the audience

This was condensed to two key directives for dashboard design and using visualisations as a tool for communication:

  • Minimise eye fixations
  • Maximise data-ink ratio

Ryan Smith - Predictable Failure

Slides | Video

This talk featured the best airplane (near-) disaster stories as Ryan was extremely enthusiastic about the content. These provided pertinent links to understanding failure in IT environments.

The most important was learning about failure modes form other users - official documentation will lack in this respect, so you need to hear the war stories from your peers

He also described a bunch of other failure cases, particular related around redundancy and the complexity that ensues.

Daniele De Matteis & Harry Wincup, Server Density - Monitoring, graphs and visualisations

Video

This was a presentation from the designer’s perspective on the theory of how visualisations should be presented, based on the experience of creating the new Server Density interface.

They outlined various design principles they strived for:

  • Consistency
  • Context
  • Clarity (less is more)
  • Perspective (i.e. vertical alignment for context)
  • Appeal (Pleasant user experience)
    • Consistent graphical elements, white space, horizontal and vertical flows, contrast between elements
  • Control (Let user find next path, but make sure it’s only click away)

Testing Logstash Configs With Rspec

At work I’m supporting a rails app, developed by an external company. That app logs a lot of useful performance information, so I’m using logstash to grab that data and send it to statsd+graphite.

I’ve been nagging the developers for more debugging information in the log file, and they’ve now added “enhanced logging”.

Since the log format is changing, I’m taking the opportunity to clean up our logstash configuration. The result of this has been to create an automated testing framework for the logstash configs.

Managing config files

Logstash allows you to point to a directory containing many config files, so I’ve used this feature to split up the config into smaller parts. There is a config file per input and, a filter config for that input, and a related statsd output config. For outputs, I also have a config for elasticsearch.

Because I use grep to tag log messages, then run a grok based on those tags, it was necessary to put all filters for an input in a single file. Otherwise your filter ordering can get messed up as you can’t guarantee what order the files are read by logstash.

If you want to break out your filters in multiple files but need the filters to be loaded in a certain order, then prefix their names with numbers to explicitly specify the order.

100-filter-one.conf
101-filter-two.conf
...

Filter config

The application logfile has several different kinds of messages that I want to extract data from. There are rails controller logs, CRUD requests generated by javascript, SQL requests, passenger logs and memcached logs.

So, when I define the input, those log messages are defined with the type ‘rails’.

The first filter that gets applied is a grok filter which processes the common fields such as timestamp, server, log priority etc. If this grok is matched, the log message is tagged with ‘rails’.

Messages tagged ‘rails’ are subject to several grep filters that differentiate between types of log message. For example, a message could be tagged as ‘rails_controller’, ‘sql’, or ‘memcached’.

Then, each message type tag has a grok filter that extracts all the relevant data out of the log entry.

One of the key things I’m pulling out of the log is the response time, so there are some additional greps and tags for responses that take longer than we consider acceptable.

When constructing the grep filters, I debug the regexes with http://www.rubular.com/, and for the grok filters http://grokdebug.herokuapp.com/ is a massively useful tool.

However, each of these web tools only look at a single log message or regex - I want to test my whole filter configuration, how entries are directed through the filter logic, and know when I break some dependency for another part of the configuration.

rspec tests

Since logstash 1.1.5 it’s been possible to run rspec tests using the monolithic jar:

java -jar logstash-monolithic.jar rspec <filelist>

So, given I have a log message that looks like this:

2013-01-20T13:14:01+0000 INFO server rails[12345]: RailController.index: 123.1234ms db=railsdb request_id=fe3c217 method=GET status=200 uri=/page/123 user=johan

Then I would write a spec test that looks like:

spec/logstash.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
files = Dir['../configs/filter*.conf']
@@configuration = String.new
files.sort.each.do |file|
  @@configuration << File.read(file)
end

describe "my first logstash rspec test"
  extend LogStash::RSpec

  config(@@configuration)

  message = %(2013-01-20T13:14:01+0000 INFO server rails[12345]: RailsController.index: 123.1234ms db=railsdb request_id=fe3c217 method=GET status=200 uri=/page/123 user=johan)

  sample("@message" => message, "@type" => "rails")
    insist { subject.type } == "rails"
    insist { subject.tags }.include?("user")
    reject { subject.tags }.include?("_grokparsefailure")
    insist { subject["TIMESTAMP_ISO8601"] } == "2013-01-20T13:14:01+0000"
    insist { subject["logpriority"] } == "INFO"
    insist { subject["logsource"] } == "rails"
    insist { subject["railscontroller"] } = "RailsController"
    insist { subject["railscontrollerction"] } = "index"
    insist { subject["time"] } == "123.1234"
    insist { subject["database"] } == "railsdb"
    insist { subject["request_id"] } == "fe3c217"
    insist { subject["method"] } == "GET"
    insist { subject["status"] } == "200"
    insist { subject["uri"] } == "/page/123"
    insist { subject["user"] } == "johan"
  end
end

So, this is dynamically including in all my filter configurations from my logstash configuration directory. Then I define my known log message, and what I expect the outputs to be - the tags that should and shouldn’t be there, and the content of fields pulled out of the log message.

Develop - Verify workflow

Before writing any filter config, I take sample log messages and write up rspec tests of what I expect to pull out of those log entries. When I run those tests the first time, they fail.

Then I’ll use the grokdebug website to construct my grok statements. Once they’re working, I’ll update the logstash filter config files with the new grok statements, and run the rspec test suite.

If the tests are failing, often I’ll output subject.inspect within the sample block, to show how logstash has processed the log event. But these debug messages are removed once our tests are passing, so we have clean output for automated testing.

When all the tests are passing I’ll deploy them to the logstash server and restart logstash.

1
2
3
4
5
java -jar /usr/share/logstash/logstash-monolithic.jar rspec examples.rb
..........................

Finished in 0.23 seconds
26 examples, 0 failures

Automating with Jenkins

Now we have a base config in place, I want to automate testing and deploying new configurations. To do this I use my good friend Jenkins.

All my spec tests and configs are stored in a single git repository. Whenever I push my repo to the git server, a post-receive hook is executed that starts a Jenkins job.

This job will fetch the repository and run the logstash rspec tests on a build server. If these pass, then the configs are copied to the logstash server and the logstash service is restarted.

If the tests fail, then a human needs to look at the errors and fix the problem.

Integrating with Configuration Management

You’ll notice my logstash configs are stored in a git repo as files, rather than being generated by configuration management. That’s a choice I made in this situation as it was easier to manage within our environment.

If you manage your logstash configs via CM, then a possible approach would be to apply your CM definition to a test VM and then run your rspec tests within that VM.

Alternatively, the whole logstash conf.d directory could be synced by your CM tool. Then you could grab just that directory for testing, rather than having to do a full CM run.

Catching problems you haven’t written tests for

I send to statsd the number of _grokparsefailure tagged messages - this highlights any log message formats that I haven’t considered, or can show up if the log format changes on me and that I need to update my grok filters.

Testing Puppet Manifests With Toft-puppet

Toft is a ruby gem to manage testing of configuration management manifests with LXC Linux containers – it can manage nodes, run chef or puppet, run ssh commands and then run any testing framework against those nodes – such as rspec or cucumber.

Using containers for this purpose is incredibly useful as they can be created and destroyed very quickly, even compared to a virtual machine. So we can setup a fresh node with a base OS, run our manifests on that, and then run tests on the system we have created. Since we run the tests from the system that hosts the container, we can run tests both from within and outside the node easily.

I’m using toft from jenkins to run cucumber tests on our manifests. Anytime anyone checks into testing, all my cucumber tests (as well as other things like puppet-lint) are run and when they succeed, we merge into master.

Testing the behaviour of your deployed manifests as part of an automated QA – this is a massive result, and toft makes it just that much easier.

Installation

The QA system will need the following packages:

libvirt
lxc

Unfortunately there is no lxc package in EPEL, so I needed to take the source RPM from Fedora 16 and build it on Scientific Linux. This was a hassle free process.

The following gems are required:

toft-puppet
cucumber
rspec

These are most easily installed with gem install if you have access to rubygems.org or an internal repository.

Configuration

For libvirtd, disable multicast DNS. Also change host bridge network interface to br0 and set the bridge network as 192.168.20.0:

/etc/libvirt/libvirtd.conf
1
mdns_adv = 0
/etc/libvirt/qemu/networks/default.xml
1
2
3
4
5
6
7
8
9
10
11
12
<network>
<name>default</name>
<uuid>3b673ba9-be12-4299-95a3-2059be18f7b9</uuid>
<bridge name=”br0″ />
<mac address=’52:54:00:BE:D0:1D’/>
<forward/>
<ip address=”192.168.20.1″ netmask=”255.255.255.0″>
<dhcp>
<range start=”192.168.20.2″ end=”192.168.20.254″ />
</dhcp>
</ip>
</network>

Now we can start the libvirt service with

service libvirtd start

LXC requires cgroup support, so the cgroup filesystem needs mounting. Add the following to /etc/fstab

/etc/fstab
1
none    /cgroup cgroup  defaults        0       0

Then create the directory /cgroup and mount it

mkdir /cgroup
mount /cgroup 

Our templates and support files need to be put in place, and directories created. Toft supplies templates for natty, lucid and centos-6. I use Scientific Linux, so I needed to create my own template from the centos-6 template – you can find it here: https://github.com/johanek/toft/blob/master/scripts/lxc-templates/lxc-scientific-6

mkdir -p /usr/lib64/templates/files
cp lxc-scientific-6 /usr/lib64/lxc/templates/
chmod 0755 /usr/lib64/lxc/templates/lxc-scientific-6
cp /usr/lib/ruby/gems/1.8/gems/toft-puppet-0.0.11/scripts/lxc-templates
/files/rc.local /usr/lib64/lxc/templates/files/

Base Image

Now we need to get a base image to run. You can grab one from the OpenVZ project, as the images are compatible. http://wiki.openvz.org/Download/template/precreated

Since the image is just a bunch of files on disk, you can extract that tarball and chroot into the resulting directory structure to modify the image to your needs. Once you’re happy, tar.gz the directory tree and copy that file to:

/var/cache/lxc/scientific-6-x86_64.tar.gz

Test LXC

To test lxc is working, create a quick node config file:

/tmp/n1.conf
1
2
3
4
5
lxc.network.type = veth
lxc.network.flags = up
lxc.network.link = br0
lxc.network.name = eth0
lxc.network.ipv4 = 192.168.20.2/24

Then, create a node by running:

lxc-create -n n1 -f /tmp/n1.conf -t scientific-6

This should extract the image and configure it ready to be started. When you start the image, a a tty on the guest machine will take over your terminal, so it’s best to run the following in a screen session:

lxc-start -n n1

Once that’s booted you should be able to login via your terminal, and ssh to the machine on the IP we configured above. Once that’s working well, we can stop and destroy the guest with the following:

lxc-stop -n n1
lxc-destroy -n n1

Running Tests

The toft module comes with a lot of example cucumber tests, step definitions and puppet configs to start you off. They’re worth reading through to understand what can be done – I need to modify them for my own needs, so let’s make a copy of the examples as a baseline:

mkdir /root/cucumber/
cd /root/cucumber
rsync -av /usr/lib/ruby/gems/1.8/gems/toft-puppet-0.0.11/features .
rsync -av /usr/lib/ruby/gems/1.8/gems/toft-puppet-0.0.11/fixtures .

The supplied puppet.conf file has lots of localisations specific to the developers setup, so I created a very simple one designed just to specify the path to our puppet modules

fixtures/puppet/conf/puppet.conf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
[main]
# The Puppet log directory.
# The default value is ‘$vardir/log’.
logdir = /var/log/puppet

# Where Puppet PID files are kept.
# The default value is ‘$vardir/run’.
rundir = /var/run/puppet

# Where SSL certificates are kept.
# The default value is ‘$confdir/ssl’.
ssldir = $vardir/ssl

modulepath = /tmp/toft-puppet-tmp/modules/

[agent]
# The file in which puppetd stores a list of the classes
# associated with the retrieved configuratiion. Can be loaded in
# the separate “puppet“ executable using the “–loadclasses“
# option.
# The default value is ‘$confdir/classes.txt’.
classfile = $vardir/classes.txt

# Where puppetd caches the local configuration. An
# extension indicating the cache format is added automatically.
# The default value is ‘$confdir/localconfig’.
localconfig = $vardir/localconfig

At this point I also removed the chef examples, to avoid any errors related to software I’m not using

rm -rf fixtures/chef

Now we need to copy our own modules to fixtures/puppet/modules/

Toft uses a routine located in the rc.local file we copied earlier to update DNS with the hostname of the new node. It does this via nsupdate. Since I’m only ever going to create one node with a known IP, and access it from the one machine we’re running the tests from, I can just add the hostname to /etc/hosts:

192.168.20.2    n1    n1.foo

Now we’re ready to write tests – again there are a lot of examples of cucumber tests in the features directory. I removed them all, apart from puppet.features, which I pared down and added my own tests:

features/puppet.feature
1
2
3
4
5
6
7
8
9
10
11
12
13

Feature: Puppet support

Scenario: Run Puppet manifest on nodes
Given I have a clean running node n1
When I run puppet manifest “manifests/test.pp” on node “n1″
Then Node “n1″ should have file or directory “/tmp/puppet_test”

Scenario: Apache module
Given I have a clean running node n1
When I run puppet manifest “manifests/apache.pp” with config file “puppet.conf” on node “n1″
Then Node “n1″ should have package “httpd” installed in the centos box
And Node “n1″ should have service “httpd” running in the centos box

Note the references to the “centos box” – that’s just the way the step definitions have been written in the toft module and can be modified easily. My Apache module test has an associated manifest:

fixtures/puppet/manifests/apache.pp
1
2

include apache

Now we can run our tests:

cd features
cucumber puppet.feature

Which gives a result like:

Creating Scientific Linux 6 node…
Checking image cache in /var/cache/lxc/rootfs-x86_64 …
Extracting rootfs image to /var/lib/lxc/n1/rootfs …
Set root password to ‘root’
‘scientific-6′ template installed
‘n1′ created
Feature: Puppet support
Scenario: Run Puppet manifest on nodes # puppet.feature:3
Starting host node…
Waiting for host ssdh ready………….
Waiting for host to be reachable.
SSH connection on ‘n1/192.168.20.2′ is ready.
Given I have a clean running node n1 # step_definitions/node.rb:1
notice: /Stage[main]/Test/File[/tmp/puppet_test]/ensure: created
notice: Finished catalog run in 0.04 seconds
When I run puppet manifest “manifests/test.pp” on node “n1″ # step_definitions/puppet.rb:1
Then Node “n1″ should have file or directory “/tmp/puppet_test” # step_definitions/checker.rb:5

Scenario: Apache module # puppet.feature:8
Starting host node…
Waiting for host ssdh ready.
Waiting for host to be reachable.
SSH connection on ‘n1/192.168.20.2′ is ready.
Given I have a clean running node n1 # step_definitions/node.rb:1
notice: /Stage[main]/Apache::Install/Package[httpd]/ensure: created
notice: /Stage[main]/Apache::Service/Service[httpd]/ensure: ensure changed ‘stopped’ to ‘running’
notice: Finished catalog run in 10.68 seconds
When I run puppet manifest “manifests/apache.pp” with config file “puppet.conf” on node “n1″ # step_definitions/puppet.rb:5
httpd-2.2.15-15.sl6.x86_64
Then Node “n1″ should have package “httpd” installed in the centos box # step_definitions/centos/checks.rb:1
9
And Node “n1″ should have service “httpd” running in the centos box # step_definitions/centos/checks.rb:6

2 scenarios (2 passed)
7 steps (7 passed)
0m26.193s

Job done – now you can write more tests and automate them with Jenkins.

Kanban Update

So, here’s what our Kanban board looked like:

In the first week or so, we made some subtle changes – first we renamed backlog to task pool, and made that area much bigger than I’d originally allowed.

Secondly, we also added a “not ready” section at the bottom of the task pool area. This is for tasks that are blocked but not yet started – for example, a task might be to install software on a server, but I’m still waiting for the server networking to be configured, so I can’t start that yet.

This worked fine for us, and we didn’t make any further changes. Initially, the board was used a lot, but after about a month or so everyone started falling back to their own spreadsheets, checklists, notebooks etc. and not really updating the board. That is very frustrating, I’ve constantly struggled with reporting compliance throughout my time in this role.

We started by doing daily standups, but I found not much was happening in them, so I stopped them after about a week & had ad-hoc discussions with individuals as required. I believe stopping these meetings hurt compliance. I was motivated by allowing everyone to push their status to the board, and my managers and myself could check over that as required – getting out of people’s way by reducing the number of meetings they required. That only works if the push happens! I’ve often tried to think – what does the reporter get out of using system X? What justification can I give as a good motivation for them? Unfortunately, the motivation for a lot of this is reporting and forecasting effort for headcount so the benefits are entirely for the consumer.

I expected to find a couple of bottlenecks – after all, this is what kanban is all about. But mostly, everyone kept their work in progress down. You can see on the board, someone has four tasks on the go. Given the kind of tasks we do, this can be a reasonable amount.

There are a couple of reasons why I believe we didn’t find any blockers. One is that the world changed for the two individuals who were the blockages before – one got very ill, so hasn’t been in much this year. The other got put on a major project, so we hired additional resources to assist with their responsibilities and this helped reduce the backlog.

Additionally, it’s become clear to me that we are in no way a team who can pick up each others tasks. The technology components we work with are too different – Windows, Linux and Storage. Perhaps the only crossover is that we all have to co-ordinate activities for system provisioning – hardware installation, networking etc. But in general we have a bunch of individuals with individual streams of work. So what we’re looking at on the board is a number of todo lists for a component of a system – it’s in no way giving a view of the system as a whole.

My time in this role is finishing in a few weeks. Since the company wants this information to be stored in electronic tools, I wrapped up the board last week and asked everyone to move their tasks back into JIRA. The experienment is over.

Of course, I did this first to lead the way – and instantly realised how much I disliked JIRA. I found it very hard to look at my tasks and answer “what am i doing now?” “whats in my task pool?” – so I immediately had to configure Greenhopper, Altassian’s Kanban tool, to create a new personal Kanban board! As soon as I had that configured, a bunch of people were over my shoulder saying “I like that, can I have that too?” – including members of other teams.

So my experience is that I really like displaying a backlog and work in progress queue, and physically moving tasks between statuses. However, as a workflow analysis tool and to improve status reporting, the experiment wasn’t particularly successful.

Kanban for Project Teams

For the past 9 months I’ve been managing the day to day activities of a team. Before that I really had no experience of this sort of task. All the team members are very experienced and are considered subject matter experts within the business, so my role is not mentoring as with past senior sysadmin roles. During this time I’d been struggling to find a way of keeping track of the tasks and projects the team were working on. I ended up with boring weekly meetings, and taking notes of live projects where the notes just got longer, and longer each week.

Then late last year, I started considering doing Kanban after reading this article: http://agilesysadmin.net/kanban_sysadmin - So I researched Kanban further, read David Anderson’s book about Kanban, and bought a whiteboard and a bunch of cards and magnets.

This week, I introduced the concept to the team. We’re a project delivery team, rather than your regular operational sysadmin team. But we’re all operations sysadmins at heart, we understand those problems and have worked in that environment - but we only deliver projects and no one pages us.

So we spend a lot more of our time tracking larger projects - but these projects involve a lot of smaller activities, which involve people coming to your desk and asking for something quickly. Everyone works on multiple projects at once and is quite busy - things develop and need to be delivered fast, with as little resource as possible - so balancing the time requirements and due dates of multiple projects is an art of itself.

The goals of this experiment are twofold: make the intra-team communication less of a burden, and to try and improve our agility. I don’t have to ask everyone what they’re doing all the time, rather I can look at the board and get the highlights at a glance.

Here’s a really stupid paint diagram of what the board looks like to start:

I’ll try to get a picture of what the real board looks like at some point.

We went for this model, as it reflects the individuals workload within the team, which is what we are interested in. Cards move left to right, are in a big pool when they’re still TODOs, but then they jump into an individuals swimlane when in flight.

We also discussed the possibility of having a more project focused structure - with an extra project column at the start, holding project cards which each have their own swimlane. The system we have doesn’t really cater for understanding the current tasks from a project perspective. Let’s leave that for the Project Managers…

I kept the rules simple for now - cards only move to the right, and respect the board - only work on things that are on the board. One “rule” that I set as a guideline to be discussed was sizing of the tasks. I suggested that the size of the tasks on each card should allow someone else to perform the task. (We also suggested a task of roughly one day’s effort is roughly the right size).

A problem that I see with our team is that everyone has been tightly fit into roles - Windows, Linux/Application Environments and Storage. We’ve had a reasonable turnover of staff recently, and we’ve grown in size from 4 to 6. This helps provide the opportunity for transformation of working practise, so I want to promote more sharing of tasks - at the moment we’re too reliant on individual availability, and this is affecting project delivery.

For now, I’ve also neglected to impose any work in progress limits. The board is an experiment that I hope will help us identify workflow problems. I suspect that too much work in progress is a problem, and we would see the benefits of reducing it - but until that happens I’m not going to apply it. Visualising the bottlenecks is the key, and they have to be clearly seen for everyone in the team to understand the reasoning behind work in progress limits.