I’ve seen a few talks by Andrew before, and they’re full of challenging, rapid fire ideas and only loosely tied together - it’s more an expression of his world view rather than a talk on a specific subject. With that in mind, I’m still going to try and summarise it.
Andrew asks the question “what even is operable”, and in the end he came to the conclusion that Operability is the intersection of capability and usability.
There is an emergent architecture, which he calls cloud native, which are a set of patterns that emerged in organisations that deliver highly available applications continuously at scale - like Amazon, Google, Twitter, Facebook etc. There are many associated labels, like devops, continuous delivery and microservices, and these are all inter-related as part of that architecture.
“Do not seek to follow in the footsteps of the wise. Seek what they sought.” - Matsuo Basho
The human tendency is to fixate on the solution. We need to think about the problem more. Principles > Practices > Tools. You’re not going to be able to do the right things until you internalise the principles. Otherwise you are just imitating or cargo culting.
Equally, we can focus on automation tools and capabilities, not what is being automated.
Other pertinent points:
if tetris has touagh me anything is that errors pile up and accomplishments disappear
systems thinking teeachs that we should minimise resistence rather than push harder
highlighted borg paper and that all tasks run in born run http server that reports health status and metrics
operations problems become easier when apps are aware of their own health
worse is better won: broken gets fixed but shitty lasts forever
Colin told a war story about what happens when you completely ignore operations? what’s the worst that could happen?
A game he worked on had 5 million users and “Launch issues”. It hardly worked, because they expected 20k users and ran initially with no budget for operations.
There was a big launch, but the infrastructure for the system was build the day before and completely untested. It was swamped in 10sec. 1.7 million people tried to play 1st day, but couldn’t. There were various media reports of the failure.
As a result, managed to steal some funding from the marketing budger and scale 100x, add caching, and throw lots of hardware at the problem. To use the 64 DB servers, they used sharded PHP ORM to distribute API server traffic.
He did get it working, but in time they found transactions not working, as this was supposed to be addressed within the application. It wasn’t because of differences between the development and production environments, and as a result cash was disappearing from the game. As a result, everyone’s scores had to be reset.
Personally, Colin worked 54 hours solid to get the thing running. For the three months the site ran, he worked 100hr/weeks.
This is how bad a project can get. “I nearly died”. He must certainly have burned out. Took a personal sense of pride in the project. Horrible to other people around. Heroism != success.
Work as a team
Anthony Eden, DNSimple - How small teams accomplish big things
Anthony’s talk is about scaling a team, and how they had to scale the operational processes - as a result from various experiences, but none more so than losing one of the founders, someone who know everything about everything.
Bridget’s talk compared how distributed systems are complex, but so are distributed teams in many of the same ways.
Firstly, distributed != remote. Having a few people out of the office is not the same.
What’s important in teams is people > tools. Focusing on people are communicating more important than the tools and how.
She made various points which are important for distributed teams:
Durable communication encourage honesty, transparency and helps future you - “durable communication exhibits the same characteristics as accidental convenient communication in a co-located space. The powerful difference is how inclusive and transparent it is.” - Casey West
Let your team know when you’ll be unavailable.
Tell the team what you’re doing.
Misunderstandings are easy, need to over-communicate. Especially to express emotions, it’s easy to misinerpret textual communication.
This talk was about the experience of building dataloop as a startup, from working in other companies, and what was learned speaking to 60 companies about monitoring and focued on dashboards.
Generally see the following kinds of dashboards:
Analytics dashboards, to diagnose performance issues. Low level, detailed info.
NOC dashboard, high level overview of services.
Team dashboards, overview of everything not just technical elements - includes business metrics.
Public dashboards, high level, simplified & sanitised marketing exercises.
Keeping people in touch of reality is the problem, and knowing what the right thing to work on. Discussions on what features to work on get opinionated. There is data within our applications that we can use to make decisions, and they can be represented on dashboards.
Stability dashboards: general performance, known trouble spots.- Feature dashboards: customer driven, features forum. “I suggest you…” & voting
Release dashboards: dashboards for monitoring the continuous delivery pipeline
Elik Eizenberg, BigPanda - Alert Correlation in modern production environments
My personal favourite talk of the first day.
Elik’s contention is that there is a lack of automation in regard to responding to alerts.
Incidents are composed of many distinct symptoms, but monitoring tools don’t correlate alerts on those symptoms for us into a single incident.
The number of alerts received might not be proportional to number of incidents. The number of incidents experienced may be similar day to day, but the impact of the incidents can be very different - hence the number of alerts being higher sometimes.
compound metrics (i.e. aggregations, or a compound metric build from many hosts). This is relatively effective, but alerts are received late, and you can miss symptoms in buildup to the alert triggering.
service hierarchy, i.e. hierarchy of dependencies related to a service. Problem with this approach is that it’s hard to create & manage. Applications and their dependencies are generally not hierarchical.
“What I would like to advocate”: Stateful Alert Correlation
Alerts with a sense of time, aware of what happened before now.
Alerts can create a newincident, or link themself to an existing incident if it’s determined they’re related.
How do you know that an alert belongs to an incident?
Topology - some tag that every alert has. (Service? Datacenter? Role?)
Time - alerts occuring close in time to another alert
Modeling - learning if multiple alerts tend to fire within a short timeframe
Training - Machine Learning. User feedback if correlation good or bad.
Even basic heuristics are effective. There is lots of value to be gained from just applying Topology and Time.
There were two other talks after this, but I had to leave early. Sorry to the presenters for missing their talks!
As a sysadmin, I’ve never been part of a team that did peer review well. When you’re doing a piece of work and you want to check you’re doing the right things, how do you get feedback?
Do you send an email out saying “can you look at this?”
Do you have someone look over your shoulder at your monitor?
Do you have discussions about what you’re going to do?
Sometimes I’ve done these things, and they’ve been partly effective - but usually they just get a reply of “Looks good to me!”
Most of the sysadmins I’ve worked with will just make a change because they want to, or were asked to, and don’t tell anyone. Or they want to be heros who surprise everyone by sweeping in with an amazing solution that they’ve been working on secretly.
I don’t want to work where people are trying to perform heroics. I hate surprises. Having done all those individualistic stupid things myself, I want to work in a team where we work together on problems out in the open. When you involve others, they are more engaged and feel part of the decision making process. And their feedback makes you produce better work.
My current role now has the best culture of reviewing work that I’ve experiences. But we had to create it for ourselves, and this is what we did.
We put all our work in version control
When I started this role, the first project I worked on was to move our configs into git. Most of those configs are stored on shared NFS volumes which are available on all hosts. Previously people made config changes by making a copy of the file they wanted to change and then made the change to the copy. Once that is ready they would take a backup of the production copy of the file, and copy their new version into place.
Importing the files into repos was generally straightforward, but sometimes there was automatically generated content that needed excluding in .gitignore. To deploy the repo we added a post-update hook on the git server that would run git pull from the correct network path.
But we also wanted to catch when people would change things outside of git and do so without overwriting their changes. This would allow us to identify who was changing things and make sure they knew the new git based process. To do this we added a pre-receive hook on the server that would run git status in the destination path and look for changes.
That was the first step in changing the way we worked. It wasn’t a big change, but it got everyone fairly comfortable with using git. The synchronous nature of the deployment was a big plus too, because all the feedback about success or failure would happen in your shell after running git push
Then we generated notifications
This was great progress, and we got more and more of our configs into this system over time. The next thing we wanted to do was to produce notifications about what’s changing. This would allow us to catch mistakes, find bad changes that broke things, and to understand who is making changes to what systems.
We did two things to achieve this - we made all these repos send email changelogs and diffs whenever someone pushed a change, and we created an IM bot that would publish details changes into a group chat root.
This was great to produce an audit trail of changes happening to the system. But whenever you saw a change break something, in hindsight you thought - well I knew that change would break, I could stop that before it was deployed.
Finally, we added Pull Requests
We knew we wanted to implement Pull Requests but we couldn’t do this with the git server software we were originally using.
Since we had started using Confluence and JIRA for our intranet and issue tracking, we moved all our git repos to Stash, which is Atlassian’s github-like git server. This provided us with the functionality we wanted.
On a single repo, we enabled a Pull Request workflow - no one could commit to master any more and everyone had to create a new branch and raise a PR for the changes they wanted merged.
We wrote up some documentation on how to use git branching and raise PRs via the stash web interface, and explained to everyone how it all worked.
We chose a repo that was relatively new, so the workflow was part of the process of learning to work with the new software. It was also one that everyone in the team would have to work with, so that everyone was exposed to the workflow as soon as possible.
To deploy the approved changes, we used the Jenkins plugin for Stash to notify Jenkins when there were changes to the repository. Jenkins then ran a job that did the same thing as our previous post-update hooks - ran git pull in the correct network location.
Running the deployment asynchronously like this felt like we were losing something - if there was a problem you were sent an email & IM from Jenkins, but this felt less urgent than a bunch of red text in your terminal. But for the benefit of review, this was a price worth paying.
In the interim we had moved the checks for changes being made outside of git into our alerting system, so we could catch them earlier than when the next person went to make a change. This meant we didn’t have to implement these checks as part of the git workflow - if there was still a problem we let the hook job fail, and it could be re-run from Jenkins once the cause was resolved.
Over time, we moved all the other repositories across to this workflow, starting with the repos with the highest number of risky changes. But, for a number of repos, we kept the old workflow with synchronous deployment hooks because all the changes being made were low risk and well established practises.
Initially, slowing down is hard
The hardest thing to adapt to was changing the perceived pace that people were working. Everyone was very used to being able to make the change they wanted immediately, and close that issue in JIRA straight away. That’s how we judged how long a piece of work takes, but that doesn’t take into account the time spent troubleshooting and doing unplanned work.
What we were doing was moving more of the work up front, where you can fix problems with less disruption. But making that adjustment to the way you work can be really hard because you perceive the process to be slower.
Everyone in our team is a good sport and willing to give things a go - but as much as we could try to explain the benefit of slowing down, you only realise how much better it is by doing it over and over, and that takes time.
Sometimes it’s necessary to move faster, and we manage that simply - if someone needs a review now, they ask someone and explain why it’s urgent. As a reviewer, if I’m busy I’ll get people to prioritise their requests by asking “I’m probably not going to be able to review everything in my queue today. Is there anything that needs looking at now?”
In time the benefits are demonstrated
By using Pull Requests we created an asynchronous feedback system. First you propose what you’re going to do in JIRA. Then you implement it how you think it should be done and create a PR. Then when a reviewer is available they’ll provide feedback or approve the change. You keep making updates to your PR and JIRA until the change is approved or declined.
With time, everyone experienced all of the following benefits of that feedback:
Catch mistakes before they are deployed
This was what we set out to do! There were breaking changes made before, and there are less of them now.
Learn about dependencies between components
A common type of feedback is “How is this change going to affect X?” - sometimes the requestor has already considered that and can explain the impact and any steps they took to deal with X. But if they haven’t considered it, they need to research it. That way they learn more about how things are connected and have a greater appreciation of how the system works as a whole.
Enforce consistent style and approaches
Everyone has their own preferred style of text editing and coding. With a PR we can say this is the style we want to use, and enforce it. The tidier you keep your configs and code, the more respect others will have for them.
It’s massively helpful to be told about an existing function that achieves what you’re trying to do, or there’s existing examples of approaches to solving a problem. This can help you learn better techniques, and avoid duplicating code.
Identify risky changes
With changes to fragile systems, you’re never confident about hitting deploy even if the change looks good. Until it’s in production and put through it’s paces there is risk. So this has allowed us to schedule deploying changes for times where the impact will be lower, or deploy the change to a subset of users.
It’s also stopped “push and run” scenarios - we avoid merges after 5pm, they can always wait for tomorrow morning!
Explain what you’re trying to do
Much of the review process is not about identifying problems with configs and code, but simply being aware of and understanding the changes that are taking place. This is invaluable to me as a reviewer.
So, when raising a Pull Request, it’s expected that each commit has a reference to the JIRA issue related to the change. The Pull Request can have a comment about what is changing and why, and that can be very helpful but the explanation of why the change is taking place must be in the JIRA.
This way, when looking back at the change history we can also reference the motivations for making that change and see the bigger picture beyond the commit message.
People want to have their work reviewed
Somewhere along the way, getting your work reviewed became desirable. You realise the earlier you put your work out there for review, the earlier you get feedback and the less likely you are to spend time doing the wrong thing.
We still have a number of repos that anyone can just change by pushing to master, there’s no controls because these changes are considered safe. But people choose to create branches and PRs for their changes to these repos, because they want to have their work reviewed.
Change takes time…
Going from no version control to making this cultural change in the way we approached our work took about 2.5 years.
Throughout this period there was no grand plan or defined scenario that we were trying to achieve. At each stage, we could only see a short way forward. We had some things we were looking to do better, so we experimented. When we found what worked, we made sure everyone kept doing it that way.
At the start, I had no idea that peer review of sysadmin work would be done via code review. As we’ve moved things into git and we saw the benefits, we want to manage everything the same way and that drives moving more things into git.
We moved slowly because there was other work going on and we needed to get people comfortable with the new way of working and tooling before we could ask more of them. Change takes time, and many of our team have benefitted from seeing that incremental process take place.
It’s been gratifying that the newest team members who have joined since we established PRs have said “I was uncertain about it at first, but now I get it. It really works.”
… and involves lots of painstaking work
To get to where we are has meant spending a lot of time setting up new repositories and isolating files that are managed by humans and automatically generated, migrating repositories to stash, creating deploy hooks, explaining how git works, and most of all making sure you’re providing useful reviews and that they happen regularly.
It’s one thing to start setting up a couple of repos like this – but to fully establish the change you need to do lots of boring work to make sure everything is migrated, even the more difficult cases. It’s important that everything is managed consistently, within one system.
Skyline - Analyse time series for anomalies in (almost) real time
The setup at etsy includes 250k metrics, which takes about 70sec for anomaly discovery, and requires 64GB memory to store 24 hours of metrics in memory.
carbon-relay is used to forward metrics to the horizon listener
metrics are stored in redis
data is stored in redis as messagepack (allows efficient binary array stream)
roomba runs intermittely to clean up old metrics from redis
amalyzer does it’s thing and writes info to disk in JSON for the web front end
To identify anomalies, skyline uses some of the techniques that Abe talked about in his presentation the previous day. It uses the consensus model, where a number of models are used and they vote - so if a majority of models detect an anomaly, then that is reported.
Oculus - analyse time series for correlation
Oculus figures out which time series are correlated using Euclidian Distance - the difference between time series values.
It also uses Dynamic Time Warping - to allow for phase shifts, if the change in one series occurs later than in the other. But this is slow, so it’s targeted on the time series to could be correlated by comparing shape descriptions.
data pulled from skyline redis and stored in elasticsearch
time series converted to shape description (limited number of keywords that describe the pattern)
phrase search done for shared shape description fingerprints
Unfortunately this presentation was marred by a projector/slide deck failure which made it hard to follow. It was incredibly disappointing because there was a lot of positive discussion around Riemann and I was looking forward to a better exposition of the tool.
Riemann is a event stream processing tool.
All events sent to riemann have a key value data structure - for logs, metrics, etc.
You can use all your current collectors: collectd, logstash etc, and has an in app statsd replacement
Those events can be manipulated in many ways, and sent to many outputs
The query language, clojure, is data driven & from lisp family
There is storage available for event correlation but I didn’t really understand from his discussion
David presented that your primary metrics should be your business metrics, and that your infrastructure metrics are secondary. Which is not to say infrastructure metrics aren’t important, but the business metrics are measures of how your business is performing and this is the data which you should be alerting on.
Alerts should be informational and actionable - not just “cool story, bro”
Consider what matters to customers - i.e. instead of measuring queue size, use time to process
When a business metric alerts, then correlate against infrastructure monitoring
Keep the infrastructure thresholds, but don’t alert on that information - you can access it when necessary
He gave a great one liner to help decide what you should be measuring - “What would get your boss fired?” - Measure these things deeply.
Also pushed the idea of sharing your information outside the team - to be more transparent and visible to the rest of the business, particularly since the information you hold are business metrics. This will provide feedback about the metrics which are important to others.
This workshop was a practical introduction to instrumentation and exploration - stepping through configuring StatsD, graphite and collectd to instrument an application and exploring graphs using descartes.
Over the past two days I’ve attended the Monitorama conference in Berlin. The conference covers a number of topics in Open Source monitoring, and reflects the cutting edge in techologies and approaches being taken to monitoring today. It also acts as a melting pot of ideas that get used by lots of people to take those ideas further, or act on them within their businesses.
There were a number of ideas that ran throughout many of the talks, which I considered the key take aways from day one:
Alert fatigue is a big problem - Alert less, remove meaningless alerts, alert only on business need and add context to alerts.
Everything is an event stream - metrics, logs, whatever. Treat it all the same (and store it in the same place).
The majority of our metrics don’t fit standard distribution, which makes automated anomaly detection hard. So people are looking for models that fit our data to do anomaly detection.
Everyone loves airplane stories. I heard at least five, three of which ended in crashes.
The first day had a single speaker track, here is my personal summary and interpretation of all the talks:
This was an experience talk about the Obama re-election campaign, what they did that worked, what they wished they had etc.
Essentially they used every tool going, and used whatever made sense for the particular area they were looking at. But they weren’t able to look carefully about their alerting and were emitting vast numbers of alerts. To deal with this, they leveraged the power users of their applications, who would give feedback about things not working and then they’d dig into the alerts to track down the causes. To improve on that feedback, they also created custom dashboards for those power users so they could report more context with the problem they were experiencing.
Alert fatigue was mentioned - “Be careful about crying wolf with alerts” because obviously people turn off if there’s too much alerting noise, and subsequently “Monitoring is useless without being watched”.
Abe gave a great talk on the history of Statistical Process Analysis and how it is used in quality control on production lines and similar industrial settings. This is all about anomaly detection by looking for events that fall outside three standard deviations from the mean. But unfortunately this detection process only works on normally distributed data, and almost none of the data we collect is normally distributed. He followed this with a number of ideas for approaches to take to describe our data in an automated fashion - and then called on the audience to get involved in helping develop these models.
He also gave an interesting comparison between current monitoring being able to provide us with either a noisy situational awareness, or more limited feedback using predefined directives.
Presented the argument that the best systems are the ones that are used constantly - for example failover secondaries often do not work because they have not been subjected to live running conditions and the associated maintenance. So he suggested a bunch of things that are done differently now, which could be done the same, so we get better at doing less things:
Metrics, Logs & Events - can all be considered events, so make them the same format as events
Metric Collection & Alerting - often collect the same data - collect once, and alert from your stored metrics
Integration testing and QOS monitoring - share a lot of same goals, so do them the same
Unlike results, errors get specific code for dealing with them - instead treat them as a result but send data that presents the error
Gave a great practical talk about how monitoring systems fail, and the series of small decisions that get made to improve individual situations but make the whole system, and the operator’s experience, much much worse. Essentially, everything that create monitoring systems with screens full of red alerts, that have been that way for a long time and with no sign of resolution.
As a way forward, she suggested the following:
Only monitor the key components of your business - remove all the crap
Find out what critical means to you - understand your priorities
Fix the infrastructure - start with a zero error baseline
Plan for monitoring earlier in the development cycle (aka devops ftw)
This talk covered a lot of ground quite quickly, so these were the key things I took away:
Monitor for failure by reviewing problems that you’ve identified before, create detailed descriptions of those events, and only them alert on those descriptions - so that you have enough context to provide with an alert so that it’s meaningful and actionable to the receiver.
Alert on your business concerns
Store all your data in the same place - treat logs, events and metrics the same
Our data isn’t normally distributed and that makes shit hard. The next leaps in dealing with this data will come from outside Computer Science - more likely to be the hard science disciplines.
At work I’m supporting a rails app, developed by an external company. That app logs a lot of useful performance information, so I’m using logstash to grab that data and send it to statsd+graphite.
I’ve been nagging the developers for more debugging information in the log file, and they’ve now added “enhanced logging”.
Since the log format is changing, I’m taking the opportunity to clean up our logstash configuration. The result of this has been to create an automated testing framework for the logstash configs.
Managing config files
Logstash allows you to point to a directory containing many config files, so I’ve used this feature to split up the config into smaller parts. There is a config file per input and, a filter config for that input, and a related statsd output config. For outputs, I also have a config for elasticsearch.
Because I use grep to tag log messages, then run a grok based on those tags, it was necessary to put all filters for an input in a single file. Otherwise your filter ordering can get messed up as you can’t guarantee what order the files are read by logstash.
If you want to break out your filters in multiple files but need the filters to be loaded in a certain order, then prefix their names with numbers to explicitly specify the order.
So, when I define the input, those log messages are defined with the type ‘rails’.
The first filter that gets applied is a grok filter which processes the common fields such as timestamp, server, log priority etc. If this grok is matched, the log message is tagged with ‘rails’.
Messages tagged ‘rails’ are subject to several grep filters that differentiate between types of log message. For example, a message could be tagged as ‘rails_controller’, ‘sql’, or ‘memcached’.
Then, each message type tag has a grok filter that extracts all the relevant data out of the log entry.
One of the key things I’m pulling out of the log is the response time, so there are some additional greps and tags for responses that take longer than we consider acceptable.
However, each of these web tools only look at a single log message or regex - I want to test my whole filter configuration, how entries are directed through the filter logic, and know when I break some dependency for another part of the configuration.
Since logstash 1.1.5 it’s been possible to run rspec tests using the monolithic jar:
So, given I have a log message that looks like this:
2013-01-20T13:14:01+0000 INFO server rails: RailController.index: 123.1234ms db=railsdb request_id=fe3c217 method=GET status=200 uri=/page/123 user=johan
Then I would write a spec test that looks like:
So, this is dynamically including in all my filter configurations from my logstash configuration directory. Then I define my known log message, and what I expect the outputs to be - the tags that should and shouldn’t be there, and the content of fields pulled out of the log message.
Develop - Verify workflow
Before writing any filter config, I take sample log messages and write up rspec tests of what I expect to pull out of those log entries. When I run those tests the first time, they fail.
Then I’ll use the grokdebug website to construct my grok statements. Once they’re working, I’ll update the logstash filter config files with the new grok statements, and run the rspec test suite.
If the tests are failing, often I’ll output subject.inspect within the sample block, to show how logstash has processed the log event. But these debug messages are removed once our tests are passing, so we have clean output for automated testing.
When all the tests are passing I’ll deploy them to the logstash server and restart logstash.
Automating with Jenkins
Now we have a base config in place, I want to automate testing and deploying new configurations. To do this I use my good friend Jenkins.
All my spec tests and configs are stored in a single git repository. Whenever I push my repo to the git server, a post-receive hook is executed that starts a Jenkins job.
This job will fetch the repository and run the logstash rspec tests on a build server. If these pass, then the configs are copied to the logstash server and the logstash service is restarted.
If the tests fail, then a human needs to look at the errors and fix the problem.
Integrating with Configuration Management
You’ll notice my logstash configs are stored in a git repo as files, rather than being generated by configuration management. That’s a choice I made in this situation as it was easier to manage within our environment.
If you manage your logstash configs via CM, then a possible approach would be to apply your CM definition to a test VM and then run your rspec tests within that VM.
Alternatively, the whole logstash conf.d directory could be synced by your CM tool. Then you could grab just that directory for testing, rather than having to do a full CM run.
Catching problems you haven’t written tests for
I send to statsd the number of _grokparsefailure tagged messages - this highlights any log message formats that I haven’t considered, or can show up if the log format changes on me and that I need to update my grok filters.