#sysadminsucks

Monitorama Roundup Part 1

Over the past two days I’ve attended the Monitorama conference in Berlin. The conference covers a number of topics in Open Source monitoring, and reflects the cutting edge in techologies and approaches being taken to monitoring today. It also acts as a melting pot of ideas that get used by lots of people to take those ideas further, or act on them within their businesses.

There were a number of ideas that ran throughout many of the talks, which I considered the key take aways from day one:

  • Alert fatigue is a big problem - Alert less, remove meaningless alerts, alert only on business need and add context to alerts.

  • Everything is an event stream - metrics, logs, whatever. Treat it all the same (and store it in the same place).

  • The majority of our metrics don’t fit standard distribution, which makes automated anomaly detection hard. So people are looking for models that fit our data to do anomaly detection.

  • Everyone loves airplane stories. I heard at least five, three of which ended in crashes.

The first day had a single speaker track, here is my personal summary and interpretation of all the talks:

Dylan Richard - Keynote

Video

This was an experience talk about the Obama re-election campaign, what they did that worked, what they wished they had etc.

Essentially they used every tool going, and used whatever made sense for the particular area they were looking at. But they weren’t able to look carefully about their alerting and were emitting vast numbers of alerts. To deal with this, they leveraged the power users of their applications, who would give feedback about things not working and then they’d dig into the alerts to track down the causes. To improve on that feedback, they also created custom dashboards for those power users so they could report more context with the problem they were experiencing.

Alert fatigue was mentioned - “Be careful about crying wolf with alerts” because obviously people turn off if there’s too much alerting noise, and subsequently “Monitoring is useless without being watched”.

Danese Cooper - Open Source Past Future

Video

Gave a presentation about the history of open source, where it stands today, and advocated participation in open source and the institutions in that space.

Abe Stanway, Etsy - Anomaly Detection with Algorithms

Slides | Video

Abe gave a great talk on the history of Statistical Process Analysis and how it is used in quality control on production lines and similar industrial settings. This is all about anomaly detection by looking for events that fall outside three standard deviations from the mean. But unfortunately this detection process only works on normally distributed data, and almost none of the data we collect is normally distributed. He followed this with a number of ideas for approaches to take to describe our data in an automated fashion - and then called on the audience to get involved in helping develop these models.

He also gave an interesting comparison between current monitoring being able to provide us with either a noisy situational awareness, or more limited feedback using predefined directives.

Mark McGranaghan, Heroku - Fewer Better Systems

Slides | Video

Presented the argument that the best systems are the ones that are used constantly - for example failover secondaries often do not work because they have not been subjected to live running conditions and the associated maintenance. So he suggested a bunch of things that are done differently now, which could be done the same, so we get better at doing less things:

  • Metrics, Logs & Events - can all be considered events, so make them the same format as events
  • Metric Collection & Alerting - often collect the same data - collect once, and alert from your stored metrics
  • Integration testing and QOS monitoring - share a lot of same goals, so do them the same
  • Unlike results, errors get specific code for dealing with them - instead treat them as a result but send data that presents the error

Katherine Daniels, Gamechanger - Staring at Graphs as-a-Service

Slides | Video

Gave a great practical talk about how monitoring systems fail, and the series of small decisions that get made to improve individual situations but make the whole system, and the operator’s experience, much much worse. Essentially, everything that create monitoring systems with screens full of red alerts, that have been that way for a long time and with no sign of resolution.

As a way forward, she suggested the following:

  • Only monitor the key components of your business - remove all the crap
  • Find out what critical means to you - understand your priorities
  • Fix the infrastructure - start with a zero error baseline
  • Plan for monitoring earlier in the development cycle (aka devops ftw)

Lindsay Holmwood, Psychology of Alert Design

Slides | Video

Lindsay presented a number of stories, from air disasters and hospitals, to present two key ideas when designing alerts:

  • Don’t startle or overload the operator (reduce notifications)
  • Don’t suggest, expose (provide more context - give relevant situational data at the same time)

Theo Schlossnagle - Monitoring what the hell?

Slides | Video

This talk covered a lot of ground quite quickly, so these were the key things I took away:

  • Monitor for failure by reviewing problems that you’ve identified before, create detailed descriptions of those events, and only them alert on those descriptions - so that you have enough context to provide with an alert so that it’s meaningful and actionable to the receiver.
  • Alert on your business concerns
  • Store all your data in the same place - treat logs, events and metrics the same
  • Our data isn’t normally distributed and that makes shit hard. The next leaps in dealing with this data will come from outside Computer Science - more likely to be the hard science disciplines.

Michael Panchenko - Monitoring not just for numbers

Slides | Video

Most of this talk was about the problems of configuration drift, and how subtle differences of systems outside configuration management policy scope can yield big surprises.

Michael presented his dream of enhancing numerical monitoring with non-numerical and non-binary observations, with some suggestions:

  • Monitoring of infrastructure state
  • Providing an audit trail of categorical data
  • Bring able to compare states across nodes and time

He also suggested describing activity in a standardised context: who, what, when.

Jarkko Laine - Let your data tell a Story

Slides | Blog | Video

This was a really interesting talk about how humans process ideas and information, how these lead to biases in analysis - and how to exploit them.

The talk was an exposition of two main ideas:

  • Attention and Memory are limited
  • Tell stories to engage the audience

This was condensed to two key directives for dashboard design and using visualisations as a tool for communication:

  • Minimise eye fixations
  • Maximise data-ink ratio

Ryan Smith - Predictable Failure

Slides | Video

This talk featured the best airplane (near-) disaster stories as Ryan was extremely enthusiastic about the content. These provided pertinent links to understanding failure in IT environments.

The most important was learning about failure modes form other users - official documentation will lack in this respect, so you need to hear the war stories from your peers

He also described a bunch of other failure cases, particular related around redundancy and the complexity that ensues.

Daniele De Matteis & Harry Wincup, Server Density - Monitoring, graphs and visualisations

Video

This was a presentation from the designer’s perspective on the theory of how visualisations should be presented, based on the experience of creating the new Server Density interface.

They outlined various design principles they strived for:

  • Consistency
  • Context
  • Clarity (less is more)
  • Perspective (i.e. vertical alignment for context)
  • Appeal (Pleasant user experience)
    • Consistent graphical elements, white space, horizontal and vertical flows, contrast between elements
  • Control (Let user find next path, but make sure it’s only click away)

Testing Logstash Configs With Rspec

At work I’m supporting a rails app, developed by an external company. That app logs a lot of useful performance information, so I’m using logstash to grab that data and send it to statsd+graphite.

I’ve been nagging the developers for more debugging information in the log file, and they’ve now added “enhanced logging”.

Since the log format is changing, I’m taking the opportunity to clean up our logstash configuration. The result of this has been to create an automated testing framework for the logstash configs.

Managing config files

Logstash allows you to point to a directory containing many config files, so I’ve used this feature to split up the config into smaller parts. There is a config file per input and, a filter config for that input, and a related statsd output config. For outputs, I also have a config for elasticsearch.

Because I use grep to tag log messages, then run a grok based on those tags, it was necessary to put all filters for an input in a single file. Otherwise your filter ordering can get messed up as you can’t guarantee what order the files are read by logstash.

If you want to break out your filters in multiple files but need the filters to be loaded in a certain order, then prefix their names with numbers to explicitly specify the order.

100-filter-one.conf
101-filter-two.conf
...

Filter config

The application logfile has several different kinds of messages that I want to extract data from. There are rails controller logs, CRUD requests generated by javascript, SQL requests, passenger logs and memcached logs.

So, when I define the input, those log messages are defined with the type ‘rails’.

The first filter that gets applied is a grok filter which processes the common fields such as timestamp, server, log priority etc. If this grok is matched, the log message is tagged with ‘rails’.

Messages tagged ‘rails’ are subject to several grep filters that differentiate between types of log message. For example, a message could be tagged as ‘rails_controller’, ‘sql’, or ‘memcached’.

Then, each message type tag has a grok filter that extracts all the relevant data out of the log entry.

One of the key things I’m pulling out of the log is the response time, so there are some additional greps and tags for responses that take longer than we consider acceptable.

When constructing the grep filters, I debug the regexes with http://www.rubular.com/, and for the grok filters http://grokdebug.herokuapp.com/ is a massively useful tool.

However, each of these web tools only look at a single log message or regex - I want to test my whole filter configuration, how entries are directed through the filter logic, and know when I break some dependency for another part of the configuration.

rspec tests

Since logstash 1.1.5 it’s been possible to run rspec tests using the monolithic jar:

java -jar logstash-monolithic.jar rspec <filelist>

So, given I have a log message that looks like this:

2013-01-20T13:14:01+0000 INFO server rails[12345]: RailController.index: 123.1234ms db=railsdb request_id=fe3c217 method=GET status=200 uri=/page/123 user=johan

Then I would write a spec test that looks like:

spec/logstash.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
files = Dir['../configs/filter*.conf']
@@configuration = String.new
files.sort.each.do |file|
  @@configuration << File.read(file)
end

describe "my first logstash rspec test"
  extend LogStash::RSpec

  config(@@configuration)

  message = %(2013-01-20T13:14:01+0000 INFO server rails[12345]: RailsController.index: 123.1234ms db=railsdb request_id=fe3c217 method=GET status=200 uri=/page/123 user=johan)

  sample("@message" => message, "@type" => "rails")
    insist { subject.type } == "rails"
    insist { subject.tags }.include?("user")
    reject { subject.tags }.include?("_grokparsefailure")
    insist { subject["TIMESTAMP_ISO8601"] } == "2013-01-20T13:14:01+0000"
    insist { subject["logpriority"] } == "INFO"
    insist { subject["logsource"] } == "rails"
    insist { subject["railscontroller"] } = "RailsController"
    insist { subject["railscontrollerction"] } = "index"
    insist { subject["time"] } == "123.1234"
    insist { subject["database"] } == "railsdb"
    insist { subject["request_id"] } == "fe3c217"
    insist { subject["method"] } == "GET"
    insist { subject["status"] } == "200"
    insist { subject["uri"] } == "/page/123"
    insist { subject["user"] } == "johan"
  end
end

So, this is dynamically including in all my filter configurations from my logstash configuration directory. Then I define my known log message, and what I expect the outputs to be - the tags that should and shouldn’t be there, and the content of fields pulled out of the log message.

Develop - Verify workflow

Before writing any filter config, I take sample log messages and write up rspec tests of what I expect to pull out of those log entries. When I run those tests the first time, they fail.

Then I’ll use the grokdebug website to construct my grok statements. Once they’re working, I’ll update the logstash filter config files with the new grok statements, and run the rspec test suite.

If the tests are failing, often I’ll output subject.inspect within the sample block, to show how logstash has processed the log event. But these debug messages are removed once our tests are passing, so we have clean output for automated testing.

When all the tests are passing I’ll deploy them to the logstash server and restart logstash.

1
2
3
4
5
java -jar /usr/share/logstash/logstash-monolithic.jar rspec examples.rb
..........................

Finished in 0.23 seconds
26 examples, 0 failures

Automating with Jenkins

Now we have a base config in place, I want to automate testing and deploying new configurations. To do this I use my good friend Jenkins.

All my spec tests and configs are stored in a single git repository. Whenever I push my repo to the git server, a post-receive hook is executed that starts a Jenkins job.

This job will fetch the repository and run the logstash rspec tests on a build server. If these pass, then the configs are copied to the logstash server and the logstash service is restarted.

If the tests fail, then a human needs to look at the errors and fix the problem.

Integrating with Configuration Management

You’ll notice my logstash configs are stored in a git repo as files, rather than being generated by configuration management. That’s a choice I made in this situation as it was easier to manage within our environment.

If you manage your logstash configs via CM, then a possible approach would be to apply your CM definition to a test VM and then run your rspec tests within that VM.

Alternatively, the whole logstash conf.d directory could be synced by your CM tool. Then you could grab just that directory for testing, rather than having to do a full CM run.

Catching problems you haven’t written tests for

I send to statsd the number of _grokparsefailure tagged messages - this highlights any log message formats that I haven’t considered, or can show up if the log format changes on me and that I need to update my grok filters.

Testing Puppet Manifests With Toft-puppet

Toft is a ruby gem to manage testing of configuration management manifests with LXC Linux containers – it can manage nodes, run chef or puppet, run ssh commands and then run any testing framework against those nodes – such as rspec or cucumber.

Using containers for this purpose is incredibly useful as they can be created and destroyed very quickly, even compared to a virtual machine. So we can setup a fresh node with a base OS, run our manifests on that, and then run tests on the system we have created. Since we run the tests from the system that hosts the container, we can run tests both from within and outside the node easily.

I’m using toft from jenkins to run cucumber tests on our manifests. Anytime anyone checks into testing, all my cucumber tests (as well as other things like puppet-lint) are run and when they succeed, we merge into master.

Testing the behaviour of your deployed manifests as part of an automated QA – this is a massive result, and toft makes it just that much easier.

Installation

The QA system will need the following packages:

libvirt
lxc

Unfortunately there is no lxc package in EPEL, so I needed to take the source RPM from Fedora 16 and build it on Scientific Linux. This was a hassle free process.

The following gems are required:

toft-puppet
cucumber
rspec

These are most easily installed with gem install if you have access to rubygems.org or an internal repository.

Configuration

For libvirtd, disable multicast DNS. Also change host bridge network interface to br0 and set the bridge network as 192.168.20.0:

/etc/libvirt/libvirtd.conf
1
mdns_adv = 0
/etc/libvirt/qemu/networks/default.xml
1
2
3
4
5
6
7
8
9
10
11
12
<network>
<name>default</name>
<uuid>3b673ba9-be12-4299-95a3-2059be18f7b9</uuid>
<bridge name=”br0″ />
<mac address=’52:54:00:BE:D0:1D’/>
<forward/>
<ip address=”192.168.20.1″ netmask=”255.255.255.0″>
<dhcp>
<range start=”192.168.20.2″ end=”192.168.20.254″ />
</dhcp>
</ip>
</network>

Now we can start the libvirt service with

service libvirtd start

LXC requires cgroup support, so the cgroup filesystem needs mounting. Add the following to /etc/fstab

/etc/fstab
1
none    /cgroup cgroup  defaults        0       0

Then create the directory /cgroup and mount it

mkdir /cgroup
mount /cgroup 

Our templates and support files need to be put in place, and directories created. Toft supplies templates for natty, lucid and centos-6. I use Scientific Linux, so I needed to create my own template from the centos-6 template – you can find it here: https://github.com/johanek/toft/blob/master/scripts/lxc-templates/lxc-scientific-6

mkdir -p /usr/lib64/templates/files
cp lxc-scientific-6 /usr/lib64/lxc/templates/
chmod 0755 /usr/lib64/lxc/templates/lxc-scientific-6
cp /usr/lib/ruby/gems/1.8/gems/toft-puppet-0.0.11/scripts/lxc-templates
/files/rc.local /usr/lib64/lxc/templates/files/

Base Image

Now we need to get a base image to run. You can grab one from the OpenVZ project, as the images are compatible. http://wiki.openvz.org/Download/template/precreated

Since the image is just a bunch of files on disk, you can extract that tarball and chroot into the resulting directory structure to modify the image to your needs. Once you’re happy, tar.gz the directory tree and copy that file to:

/var/cache/lxc/scientific-6-x86_64.tar.gz

Test LXC

To test lxc is working, create a quick node config file:

/tmp/n1.conf
1
2
3
4
5
lxc.network.type = veth
lxc.network.flags = up
lxc.network.link = br0
lxc.network.name = eth0
lxc.network.ipv4 = 192.168.20.2/24

Then, create a node by running:

lxc-create -n n1 -f /tmp/n1.conf -t scientific-6

This should extract the image and configure it ready to be started. When you start the image, a a tty on the guest machine will take over your terminal, so it’s best to run the following in a screen session:

lxc-start -n n1

Once that’s booted you should be able to login via your terminal, and ssh to the machine on the IP we configured above. Once that’s working well, we can stop and destroy the guest with the following:

lxc-stop -n n1
lxc-destroy -n n1

Running Tests

The toft module comes with a lot of example cucumber tests, step definitions and puppet configs to start you off. They’re worth reading through to understand what can be done – I need to modify them for my own needs, so let’s make a copy of the examples as a baseline:

mkdir /root/cucumber/
cd /root/cucumber
rsync -av /usr/lib/ruby/gems/1.8/gems/toft-puppet-0.0.11/features .
rsync -av /usr/lib/ruby/gems/1.8/gems/toft-puppet-0.0.11/fixtures .

The supplied puppet.conf file has lots of localisations specific to the developers setup, so I created a very simple one designed just to specify the path to our puppet modules

fixtures/puppet/conf/puppet.conf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
[main]
# The Puppet log directory.
# The default value is ‘$vardir/log’.
logdir = /var/log/puppet

# Where Puppet PID files are kept.
# The default value is ‘$vardir/run’.
rundir = /var/run/puppet

# Where SSL certificates are kept.
# The default value is ‘$confdir/ssl’.
ssldir = $vardir/ssl

modulepath = /tmp/toft-puppet-tmp/modules/

[agent]
# The file in which puppetd stores a list of the classes
# associated with the retrieved configuratiion. Can be loaded in
# the separate “puppet“ executable using the “–loadclasses“
# option.
# The default value is ‘$confdir/classes.txt’.
classfile = $vardir/classes.txt

# Where puppetd caches the local configuration. An
# extension indicating the cache format is added automatically.
# The default value is ‘$confdir/localconfig’.
localconfig = $vardir/localconfig

At this point I also removed the chef examples, to avoid any errors related to software I’m not using

rm -rf fixtures/chef

Now we need to copy our own modules to fixtures/puppet/modules/

Toft uses a routine located in the rc.local file we copied earlier to update DNS with the hostname of the new node. It does this via nsupdate. Since I’m only ever going to create one node with a known IP, and access it from the one machine we’re running the tests from, I can just add the hostname to /etc/hosts:

192.168.20.2    n1    n1.foo

Now we’re ready to write tests – again there are a lot of examples of cucumber tests in the features directory. I removed them all, apart from puppet.features, which I pared down and added my own tests:

features/puppet.feature
1
2
3
4
5
6
7
8
9
10
11
12
13

Feature: Puppet support

Scenario: Run Puppet manifest on nodes
Given I have a clean running node n1
When I run puppet manifest “manifests/test.pp” on node “n1″
Then Node “n1″ should have file or directory “/tmp/puppet_test”

Scenario: Apache module
Given I have a clean running node n1
When I run puppet manifest “manifests/apache.pp” with config file “puppet.conf” on node “n1″
Then Node “n1″ should have package “httpd” installed in the centos box
And Node “n1″ should have service “httpd” running in the centos box

Note the references to the “centos box” – that’s just the way the step definitions have been written in the toft module and can be modified easily. My Apache module test has an associated manifest:

fixtures/puppet/manifests/apache.pp
1
2

include apache

Now we can run our tests:

cd features
cucumber puppet.feature

Which gives a result like:

Creating Scientific Linux 6 node…
Checking image cache in /var/cache/lxc/rootfs-x86_64 …
Extracting rootfs image to /var/lib/lxc/n1/rootfs …
Set root password to ‘root’
‘scientific-6′ template installed
‘n1′ created
Feature: Puppet support
Scenario: Run Puppet manifest on nodes # puppet.feature:3
Starting host node…
Waiting for host ssdh ready………….
Waiting for host to be reachable.
SSH connection on ‘n1/192.168.20.2′ is ready.
Given I have a clean running node n1 # step_definitions/node.rb:1
notice: /Stage[main]/Test/File[/tmp/puppet_test]/ensure: created
notice: Finished catalog run in 0.04 seconds
When I run puppet manifest “manifests/test.pp” on node “n1″ # step_definitions/puppet.rb:1
Then Node “n1″ should have file or directory “/tmp/puppet_test” # step_definitions/checker.rb:5

Scenario: Apache module # puppet.feature:8
Starting host node…
Waiting for host ssdh ready.
Waiting for host to be reachable.
SSH connection on ‘n1/192.168.20.2′ is ready.
Given I have a clean running node n1 # step_definitions/node.rb:1
notice: /Stage[main]/Apache::Install/Package[httpd]/ensure: created
notice: /Stage[main]/Apache::Service/Service[httpd]/ensure: ensure changed ‘stopped’ to ‘running’
notice: Finished catalog run in 10.68 seconds
When I run puppet manifest “manifests/apache.pp” with config file “puppet.conf” on node “n1″ # step_definitions/puppet.rb:5
httpd-2.2.15-15.sl6.x86_64
Then Node “n1″ should have package “httpd” installed in the centos box # step_definitions/centos/checks.rb:1
9
And Node “n1″ should have service “httpd” running in the centos box # step_definitions/centos/checks.rb:6

2 scenarios (2 passed)
7 steps (7 passed)
0m26.193s

Job done – now you can write more tests and automate them with Jenkins.

Kanban Update

So, here’s what our Kanban board looked like:

In the first week or so, we made some subtle changes – first we renamed backlog to task pool, and made that area much bigger than I’d originally allowed.

Secondly, we also added a “not ready” section at the bottom of the task pool area. This is for tasks that are blocked but not yet started – for example, a task might be to install software on a server, but I’m still waiting for the server networking to be configured, so I can’t start that yet.

This worked fine for us, and we didn’t make any further changes. Initially, the board was used a lot, but after about a month or so everyone started falling back to their own spreadsheets, checklists, notebooks etc. and not really updating the board. That is very frustrating, I’ve constantly struggled with reporting compliance throughout my time in this role.

We started by doing daily standups, but I found not much was happening in them, so I stopped them after about a week & had ad-hoc discussions with individuals as required. I believe stopping these meetings hurt compliance. I was motivated by allowing everyone to push their status to the board, and my managers and myself could check over that as required – getting out of people’s way by reducing the number of meetings they required. That only works if the push happens! I’ve often tried to think – what does the reporter get out of using system X? What justification can I give as a good motivation for them? Unfortunately, the motivation for a lot of this is reporting and forecasting effort for headcount so the benefits are entirely for the consumer.

I expected to find a couple of bottlenecks – after all, this is what kanban is all about. But mostly, everyone kept their work in progress down. You can see on the board, someone has four tasks on the go. Given the kind of tasks we do, this can be a reasonable amount.

There are a couple of reasons why I believe we didn’t find any blockers. One is that the world changed for the two individuals who were the blockages before – one got very ill, so hasn’t been in much this year. The other got put on a major project, so we hired additional resources to assist with their responsibilities and this helped reduce the backlog.

Additionally, it’s become clear to me that we are in no way a team who can pick up each others tasks. The technology components we work with are too different – Windows, Linux and Storage. Perhaps the only crossover is that we all have to co-ordinate activities for system provisioning – hardware installation, networking etc. But in general we have a bunch of individuals with individual streams of work. So what we’re looking at on the board is a number of todo lists for a component of a system – it’s in no way giving a view of the system as a whole.

My time in this role is finishing in a few weeks. Since the company wants this information to be stored in electronic tools, I wrapped up the board last week and asked everyone to move their tasks back into JIRA. The experienment is over.

Of course, I did this first to lead the way – and instantly realised how much I disliked JIRA. I found it very hard to look at my tasks and answer “what am i doing now?” “whats in my task pool?” – so I immediately had to configure Greenhopper, Altassian’s Kanban tool, to create a new personal Kanban board! As soon as I had that configured, a bunch of people were over my shoulder saying “I like that, can I have that too?” – including members of other teams.

So my experience is that I really like displaying a backlog and work in progress queue, and physically moving tasks between statuses. However, as a workflow analysis tool and to improve status reporting, the experiment wasn’t particularly successful.

Kanban for Project Teams

For the past 9 months I’ve been managing the day to day activities of a team. Before that I really had no experience of this sort of task. All the team members are very experienced and are considered subject matter experts within the business, so my role is not mentoring as with past senior sysadmin roles. During this time I’d been struggling to find a way of keeping track of the tasks and projects the team were working on. I ended up with boring weekly meetings, and taking notes of live projects where the notes just got longer, and longer each week.

Then late last year, I started considering doing Kanban after reading this article: http://agilesysadmin.net/kanban_sysadmin - So I researched Kanban further, read David Anderson’s book about Kanban, and bought a whiteboard and a bunch of cards and magnets.

This week, I introduced the concept to the team. We’re a project delivery team, rather than your regular operational sysadmin team. But we’re all operations sysadmins at heart, we understand those problems and have worked in that environment - but we only deliver projects and no one pages us.

So we spend a lot more of our time tracking larger projects - but these projects involve a lot of smaller activities, which involve people coming to your desk and asking for something quickly. Everyone works on multiple projects at once and is quite busy - things develop and need to be delivered fast, with as little resource as possible - so balancing the time requirements and due dates of multiple projects is an art of itself.

The goals of this experiment are twofold: make the intra-team communication less of a burden, and to try and improve our agility. I don’t have to ask everyone what they’re doing all the time, rather I can look at the board and get the highlights at a glance.

Here’s a really stupid paint diagram of what the board looks like to start:

I’ll try to get a picture of what the real board looks like at some point.

We went for this model, as it reflects the individuals workload within the team, which is what we are interested in. Cards move left to right, are in a big pool when they’re still TODOs, but then they jump into an individuals swimlane when in flight.

We also discussed the possibility of having a more project focused structure - with an extra project column at the start, holding project cards which each have their own swimlane. The system we have doesn’t really cater for understanding the current tasks from a project perspective. Let’s leave that for the Project Managers…

I kept the rules simple for now - cards only move to the right, and respect the board - only work on things that are on the board. One “rule” that I set as a guideline to be discussed was sizing of the tasks. I suggested that the size of the tasks on each card should allow someone else to perform the task. (We also suggested a task of roughly one day’s effort is roughly the right size).

A problem that I see with our team is that everyone has been tightly fit into roles - Windows, Linux/Application Environments and Storage. We’ve had a reasonable turnover of staff recently, and we’ve grown in size from 4 to 6. This helps provide the opportunity for transformation of working practise, so I want to promote more sharing of tasks - at the moment we’re too reliant on individual availability, and this is affecting project delivery.

For now, I’ve also neglected to impose any work in progress limits. The board is an experiment that I hope will help us identify workflow problems. I suspect that too much work in progress is a problem, and we would see the benefits of reducing it - but until that happens I’m not going to apply it. Visualising the bottlenecks is the key, and they have to be clearly seen for everyone in the team to understand the reasoning behind work in progress limits.