Testing puppet manifests with toft-puppet

Toft is a ruby gem to manage testing of configuration management manifests with LXC Linux containers – it can manage nodes, run chef or puppet, run ssh commands and then run any testing framework against those nodes – such as rspec or cucumber.

Using containers for this purpose is incredibly useful as they can be created and destroyed very quickly, even compared to a virtual machine. So we can setup a fresh node with a base OS, run our manifests on that, and then run tests on the system we have created. Since we run the tests from the system that hosts the container, we can run tests both from within and outside the node easily.

I’m using toft from jenkins to run cucumber tests on our manifests. Anytime anyone checks into testing, all my cucumber tests (as well as other things like puppet-lint) are run and when they succeed, we merge into master.

Testing the behaviour of your deployed manifests as part of an automated QA – this is a massive result, and toft makes it just that much easier.

Installation

The QA system will need the following packages:

libvirt
lxc

Unfortunately there is no lxc package in EPEL, so I needed to take the source RPM from Fedora 16 and build it on Scientific Linux. This was a hassle free process.

The following gems are required:

toft-puppet
cucumber
rspec

These are most easily installed with gem install if you have access to rubygems.org or an internal repository.

Configuration

For libvirtd, disable multicast DNS. Also change host bridge network interface to br0 and set the bridge network as 192.168.20.0:

File: /etc/libvirt/libvirtd.conf
mdns_adv = 0
File: /etc/libvirt/qemu/networks/default.xml
<network>
<name>default</name>
<uuid>3b673ba9-be12-4299-95a3-2059be18f7b9</uuid>
<bridge name=”br0″ />
<mac address=’52:54:00:BE:D0:1D’/>
<forward/>
<ip address=”192.168.20.1″ netmask=”255.255.255.0″>
<dhcp>
<range start=”192.168.20.2″ end=”192.168.20.254″ />
</dhcp>
</ip>
</network>

Now we can start the libvirt service with

service libvirtd start

LXC requires cgroup support, so the cgroup filesystem needs mounting. Add the following to /etc/fstab

File: /etc/fstab
none    /cgroup cgroup  defaults        0       0

Then create the directory /cgroup and mount it

mkdir /cgroup
mount /cgroup 

Our templates and support files need to be put in place, and directories created. Toft supplies templates for natty, lucid and centos-6. I use Scientific Linux, so I needed to create my own template from the centos-6 template – you can find it here: https://github.com/johanek/toft/blob/master/scripts/lxc-templates/lxc-scientific-6

mkdir -p /usr/lib64/templates/files
cp lxc-scientific-6 /usr/lib64/lxc/templates/
chmod 0755 /usr/lib64/lxc/templates/lxc-scientific-6
cp /usr/lib/ruby/gems/1.8/gems/toft-puppet-0.0.11/scripts/lxc-templates
/files/rc.local /usr/lib64/lxc/templates/files/

Base Image

Now we need to get a base image to run. You can grab one from the OpenVZ project, as the images are compatible. http://wiki.openvz.org/Download/template/precreated

Since the image is just a bunch of files on disk, you can extract that tarball and chroot into the resulting directory structure to modify the image to your needs. Once you’re happy, tar.gz the directory tree and copy that file to:

/var/cache/lxc/scientific-6-x86_64.tar.gz

Test LXC

To test lxc is working, create a quick node config file:

File: /tmp/n1.conf
lxc.network.type = veth
lxc.network.flags = up
lxc.network.link = br0
lxc.network.name = eth0
lxc.network.ipv4 = 192.168.20.2/24

Then, create a node by running:

lxc-create -n n1 -f /tmp/n1.conf -t scientific-6

This should extract the image and configure it ready to be started. When you start the image, a a tty on the guest machine will take over your terminal, so it’s best to run the following in a screen session:

lxc-start -n n1

Once that’s booted you should be able to login via your terminal, and ssh to the machine on the IP we configured above. Once that’s working well, we can stop and destroy the guest with the following:

lxc-stop -n n1
lxc-destroy -n n1

Running Tests

The toft module comes with a lot of example cucumber tests, step definitions and puppet configs to start you off. They’re worth reading through to understand what can be done – I need to modify them for my own needs, so let’s make a copy of the examples as a baseline:

mkdir /root/cucumber/
cd /root/cucumber
rsync -av /usr/lib/ruby/gems/1.8/gems/toft-puppet-0.0.11/features .
rsync -av /usr/lib/ruby/gems/1.8/gems/toft-puppet-0.0.11/fixtures .

The supplied puppet.conf file has lots of localisations specific to the developers setup, so I created a very simple one designed just to specify the path to our puppet modules

File: fixtures/puppet/conf/puppet.conf
[main]
# The Puppet log directory.
# The default value is ‘$vardir/log’.
logdir = /var/log/puppet
 
# Where Puppet PID files are kept.
# The default value is ‘$vardir/run’.
rundir = /var/run/puppet
 
# Where SSL certificates are kept.
# The default value is ‘$confdir/ssl’.
ssldir = $vardir/ssl
 
modulepath = /tmp/toft-puppet-tmp/modules/
 
[agent]
# The file in which puppetd stores a list of the classes
# associated with the retrieved configuratiion. Can be loaded in
# the separate “puppet“ executable using the “–loadclasses“
# option.
# The default value is ‘$confdir/classes.txt’.
classfile = $vardir/classes.txt
 
# Where puppetd caches the local configuration. An
# extension indicating the cache format is added automatically.
# The default value is ‘$confdir/localconfig’.
localconfig = $vardir/localconfig

At this point I also removed the chef examples, to avoid any errors related to software I’m not using

rm -rf fixtures/chef

Now we need to copy our own modules to fixtures/puppet/modules/

Toft uses a routine located in the rc.local file we copied earlier to update DNS with the hostname of the new node. It does this via nsupdate. Since I’m only ever going to create one node with a known IP, and access it from the one machine we’re running the tests from, I can just add the hostname to /etc/hosts:

192.168.20.2    n1    n1.foo

Now we’re ready to write tests – again there are a lot of examples of cucumber tests in the features directory. I removed them all, apart from puppet.features, which I pared down and added my own tests:

File: features/puppet.feature
 
Feature: Puppet support
 
Scenario: Run Puppet manifest on nodes
Given I have a clean running node n1
When I run puppet manifest “manifests/test.pp” on node “n1″
Then Node “n1″ should have file or directory “/tmp/puppet_test”
 
Scenario: Apache module
Given I have a clean running node n1
When I run puppet manifest “manifests/apache.pp” with config file “puppet.conf” on node “n1″
Then Node “n1″ should have package “httpd” installed in the centos box
And Node “n1″ should have service “httpd” running in the centos box

Note the references to the “centos box” – that’s just the way the step definitions have been written in the toft module and can be modified easily. My Apache module test has an associated manifest:

File: fixtures/puppet/manifests/apache.pp
 
include apache

Now we can run our tests:

cd features
cucumber puppet.feature

Which gives a result like:

Creating Scientific Linux 6 node…
Checking image cache in /var/cache/lxc/rootfs-x86_64 …
Extracting rootfs image to /var/lib/lxc/n1/rootfs …
Set root password to ‘root’
‘scientific-6′ template installed
‘n1′ created
Feature: Puppet support
Scenario: Run Puppet manifest on nodes # puppet.feature:3
Starting host node…
Waiting for host ssdh ready………….
Waiting for host to be reachable.
SSH connection on ‘n1/192.168.20.2′ is ready.
Given I have a clean running node n1 # step_definitions/node.rb:1
notice: /Stage[main]/Test/File[/tmp/puppet_test]/ensure: created
notice: Finished catalog run in 0.04 seconds
When I run puppet manifest “manifests/test.pp” on node “n1″ # step_definitions/puppet.rb:1
Then Node “n1″ should have file or directory “/tmp/puppet_test” # step_definitions/checker.rb:5
 
Scenario: Apache module # puppet.feature:8
Starting host node…
Waiting for host ssdh ready.
Waiting for host to be reachable.
SSH connection on ‘n1/192.168.20.2′ is ready.
Given I have a clean running node n1 # step_definitions/node.rb:1
notice: /Stage[main]/Apache::Install/Package[httpd]/ensure: created
notice: /Stage[main]/Apache::Service/Service[httpd]/ensure: ensure changed ‘stopped’ to ‘running’
notice: Finished catalog run in 10.68 seconds
When I run puppet manifest “manifests/apache.pp” with config file “puppet.conf” on node “n1″ # step_definitions/puppet.rb:5
httpd-2.2.15-15.sl6.x86_64
Then Node “n1″ should have package “httpd” installed in the centos box # step_definitions/centos/checks.rb:1
9
And Node “n1″ should have service “httpd” running in the centos box # step_definitions/centos/checks.rb:6
 
2 scenarios (2 passed)
7 steps (7 passed)
0m26.193s

Job done – now you can write more tests and automate them with Jenkins.

Kanban update

So, here’s what our Kanban board looked like:

image

In the first week or so, we made some subtle changes – first we renamed backlog to task pool, and made that area much bigger than I’d originally allowed.

Secondly, we also added a “not ready” section at the bottom of the task pool area. This is for tasks that are blocked but not yet started – for example, a task might be to install software on a server, but I’m still waiting for the server networking to be configured, so I can’t start that yet.

This worked fine for us, and we didn’t make any further changes. Initially, the board was used a lot, but after about a month or so everyone started falling back to their own spreadsheets, checklists, notebooks etc. and not really updating the board. That is very frustrating, I’ve constantly struggled with reporting compliance throughout my time in this role.

We started by doing daily standups, but I found not much was happening in them, so I stopped them after about a week & had ad-hoc discussions with individuals as required. I believe stopping these meetings hurt compliance. I was motivated by allowing everyone to push their status to the board, and my managers and myself could check over that as required – getting out of people’s way by reducing the number of meetings they required. That only works if the push happens! I’ve often tried to think – what does the reporter get out of using system X? What justification can I give as a good motivation for them? Unfortunately, the motivation for a lot of this is reporting and forecasting effort for headcount so the benefits are entirely for the consumer.

I expected to find a couple of bottlenecks – after all, this is what kanban is all about. But mostly, everyone kept their work in progress down. You can see on the board, someone has four tasks on the go. Given the kind of tasks we do, this can be a reasonable amount.

There are a couple of reasons why I believe we didn’t find any blockers. One is that the world changed for the two individuals who were the blockages before – one got very ill, so hasn’t been in much this year. The other got put on a major project, so we hired additional resources to assist with their responsibilities and this helped reduce the backlog.

Additionally, it’s become clear to me that we are in no way a team who can pick up each others tasks. The technology components we work with are too different – Windows, Linux and Storage. Perhaps the only crossover is that we all have to co-ordinate activities for system provisioning – hardware installation, networking etc. But in general we have a bunch of individuals with individual streams of work. So what we’re looking at on the board is a number of todo lists for a component of a system – it’s in no way giving a view of the system as a whole.

My time in this role is finishing in a few weeks. Since the company wants this information to be stored in electronic tools, I wrapped up the board last week and asked everyone to move their tasks back into JIRA. The experienment is over.

Of course, I did this first to lead the way – and instantly realised how much I disliked JIRA. I found it very hard to look at my tasks and answer “what am i doing now?” “whats in my task pool?” – so I immediately had to configure Greenhopper, Altassian’s Kanban tool, to create a new personal Kanban board! As soon as I had that configured, a bunch of people were over my shoulder saying “I like that, can I have that too?” – including members of other teams.

So my experience is that I really like displaying a backlog and work in progress queue, and physically moving tasks between statuses. However, as a workflow analysis tool and to improve status reporting, the experiment wasn’t particularly successful.

Kanban for Project Teams

For the past 9 months I’ve been managing the day to day activities of a team. Before that I really had no experience of this sort of task. All the team members are very experienced and are considered subject matter experts within the business, so my role is not mentoring as with past senior sysadmin roles. During this time I’d been struggling to find a way of keeping track of the tasks and projects the team were working on. I ended up with boring weekly meetings, and taking notes of live projects where the notes just got longer, and longer each week.

Then late last year, I started considering doing Kanban after reading this article: http://agilesysadmin.net/kanban_sysadmin - So I researched Kanban further, read David Anderson’s book about Kanban, and bought a whiteboard and a bunch of cards and magnets.

This week, I introduced the concept to the team. We’re a project delivery team, rather than your regular operational sysadmin team. But we’re all operations sysadmins at heart, we understand those problems and have worked in that environment - but we only deliver projects and no one pages us.

So we spend a lot more of our time tracking larger projects - but these projects involve a lot of smaller activities, which involve people coming to your desk and asking for something quickly. Everyone works on multiple projects at once and is quite busy - things develop and need to be delivered fast, with as little resource as possible - so balancing the time requirements and due dates of multiple projects is an art of itself.

The goals of this experiment are twofold: make the intra-team communication less of a burden, and to try and improve our agility. I don’t have to ask everyone what they’re doing all the time, rather I can look at the board and get the highlights at a glance.

Here’s a really stupid paint diagram of what the board looks like to start:

image

I’ll try to get a picture of what the real board looks like at some point.

We went for this model, as it reflects the individuals workload within the team, which is what we are interested in. Cards move left to right, are in a big pool when they’re still TODOs, but then they jump into an individuals swimlane when in flight.

We also discussed the possibility of having a more project focused structure - with an extra project column at the start, holding project cards which each have their own swimlane. The system we have doesn’t really cater for understanding the current tasks from a project perspective. Let’s leave that for the Project Managers…

I kept the rules simple for now - cards only move to the right, and respect the board - only work on things that are on the board. One “rule” that I set as a guideline to be discussed was sizing of the tasks. I suggested that the size of the tasks on each card should allow someone else to perform the task. (We also suggested a task of roughly one day’s effort is roughly the right size).

A problem that I see with our team is that everyone has been tightly fit into roles - Windows, Linux/Application Environments and Storage. We’ve had a reasonable turnover of staff recently, and we’ve grown in size from 4 to 6. This helps provide the opportunity for transformation of working practise, so I want to promote more sharing of tasks - at the moment we’re too reliant on individual availability, and this is affecting project delivery.

For now, I’ve also neglected to impose any work in progress limits. The board is an experiment that I hope will help us identify workflow problems. I suspect that too much work in progress is a problem, and we would see the benefits of reducing it - but until that happens I’m not going to apply it. Visualising the bottlenecks is the key, and they have to be clearly seen for everyone in the team to understand the reasoning behind work in progress limits.