#sysadminsucks

Monitorama 2016 Portland Day 3

Justin Reynolds, Netflix - Intuition Engineering at Netflix

Video

Discussed problems at Netflix that regions were siloed, they worked on serving users out of any region.

To fail regions, need to scale up other regions to server all traffic

Dashboards, good at looking back, but need to know now. How to provide intuition of the now?

Created vizceral - see the blog post for screenshots & video: http://techblog.netflix.com/2015/10/flux-new-approach-to-system-intuition.html

Brian Brazil, Robust Perceiver - Prometheus

Slides | Video

Prometheus is a TSDB offering ‘whitebox monitoring’ for looking inside applications. supports labels, alerting and graphing are unified, using the same language.

Pull based system, links into service discovery. HTTP api for graphing, supports persistent queries which are used for alerting.

Provides instrumentation library, incredibly simple to instrument functiions and expose metrics to prometheus. Client libraries don’t tie you into prometheus - can use graphite.

Can use prometheous as a clearing house to translate between different data formats.

Doesn’t use a notion of a machine. HA by duplicating servers, but alertmanager deduplicates alerts. Alertmanager can also group alerts.

Data stored as file per database on disk, not round-robin - stores all data without downsampling.

Torkel Ödegaard, Raintank - Grafana Master Class

Video

Gave a demo on how to use grafana, as well as recently added and future features.

Katherine Daniels, Etsy - How to Teach an Old Monitoring System New Tricks

Slides | Video

Old Monitoring System == Nagios

Adding new servers.

  • Use deployinator to deploy nagios configs. Uses chef to provide inventory to generate a currently list of hosts and hostgroups.
  • Run validation via Jenkins by running nagios -v, as well as writing tool for nagios validation.
  • New hosts are added with scheduled downtime so they don’t alert until the next day. Chat bots send reminders when downtime is going to finish.

Making Alerts (Marginally) Less Annoying

  • Created nagdash to provide federated view of multiple nagios instances.
  • Created nagios-herald to add context to nagios alerts. Also supports allowing people to sign up to alerts for things they’re interested in.

Tracking Sleep

  • Ops weekly tool. Provides on call reports, engineers flag what they had to do with alerts.
  • Sleep tracking and alert tracking for on call staff to understand how many alerts they’re facing and how it’s impacting their sleep.

An On Call bedtime story

  • Plenty of alerts because scheduled downtime expired for ongoing work.
  • Create daily reports of what downtime will soon expire and which will raise alerts.

Joe Damato, packagecloud.io - All of Your Network Monitoring is (probably) Wrong

Slides | Video

There’s too much stuff to know about

  • ever copy paste config or tune settings you didn’t understand?
  • do you really understand every graph you’re generating?
  • what makes you think you can monitor this stuff?

Claim: the more complex the system is the harder it is to monitor.

whats p. complicated? linux networking stack! lots of features, lots of bugs. with no docs!

  • /proc/net stats can be buggy
  • ethtool inconsistent, not always implemented
  • meaning of driver stats are not standardised
  • stats meaning for a dirver/device can change over time
  • /proc/net/snmp has bugs: double counting, not being counted correctly

Monitoring something requires a very deep understanding of what you’re monitoring.

Properly monitoring and setting alerts requires significant investment.

Megan Kanne, Justin Nguyen, and Dan Sotolongo, Twitter - Building Twitter’s Next-Gen Alerting System

Slides | Video

3.5B metrics per minute

Old Alerting System

  • 25k alerts/minute, 3m alert monitors
  • single config language, lot of existing example, easy to write and add
  • all those points were good and bad!
  • lots of orphaned and unmaintained configs, no validation
  • alerts and dashboards were seperate
  • problems with reliability when zones suffer problems

Solution

  • combined alerts and dashboard configuration
  • dashboards defined in python, common libraries that can be included
  • python allows testing configs
  • created multi-zone alerting system
  • reduced time to detect from 2.5mins to 1.75mins

Helping Human Reasoning

  • bring together global, dependencies and local context
  • including runbooks, contacts and escalations directly in the UI

Lessons Learned

  • distributed system, challenges about consistency, structural complexity and reasoning about time
  • sharding choices are hard, impossible to always avoid making mistakes
  • support and collaborate with users, try and reduce support burden with good information at interaction points (UI, CLI etc.), good user guides and docs
  • migrating - some happy to move, others not. some push back. had to accept schedule compromise and extra work.

Joey Parsons, Airbnb - Monitoring and Health at Airbnb

Video

Perfers to buy stuff:

  • New Relic
  • Datadog, instrument apps using dogstatsd
  • Alerting through metrics

Created open source tool, Interferon, to store alerts configuration as code.

Volunteer on call system. SREs make sure things in place so anyone can be on call.

  • Sysops training for volunteers, monitoring systems, how to be effective and learn from historical incidents
  • Shadow on call, learn from current primary/secondary
  • Promoted to on call

Weekly sysops meeting, go through incidents, hand offs, discuss scheduled maintenance.

On call health:

  • are alerting trends appropriate?
  • do we understand impact on engineers?
  • do we need to tune false positives?
  • are notifications and notification policies appropriate?

Dashboards for:

  • incident numbers over time
  • counts by service
  • total notifcations per user and how many come at night
  • false positive incident counts

Heinrich Hartmann, Circonus - Statistics for Engineers

Slides | Video

Monitoring Goals

  • Measure user experience/ quality of service
  • Determine implications of service degradation
  • Define sensible SLA targets

External API Monitoring:

  • sythetic request, measures availability, but bad for user experience
  • on long time ranges, rolled-up data is commonly displayed, erodes spikes

Log Analysis:

  • write request stats to log file
  • rich information but expensive and delay for indexing

Monitoring Latency Averages:

  • mean values, cheap to collect, store and analyse, but skewed by outliers/low volumes
  • percentiles, cheap to collect store and abalyse, robust to outliers but up front percentile choice required and cannot be aggregated

percentiles: keep all your data. don’t take averages! store percentiles for all reporting periods you are interested in - i.e. per min/hour/day. store all percentiles you’ll ever be interested in.

Mointoring with Histograms:

  • divide latency scale into bands
  • divide time scale into reporting periods
  • count the name of samples in each band x period

Can be aggregated across times. Can be visualised as heatmaps.

John Banning, Google - Monarch, Google’s Planet-Scale Monitoring Infrastructure

Video

Huge volume, global span, many teams - constant change

Previously borgmon. Each group had it’s own borgmon. Large load on anyone doing monitoring. Hazing ritual - new engineer gets to do borgmon config maintenance.

Wanted:

  • can handle the scale
  • small/no load to get up and running
  • capable of handling the largest services

Monitor locally. Collect and store the data near where it’s generated. Each zone has a monarch.

  • Targets collect data with streamz library. Metrics are multi dimensional information, stores histogram.
  • Metrics sent to monarch ingestion router, send to leaf which is in-memory data store and also written to recovery log. From log to long term disk respository.
  • Streams stored in a table, basis for queries
  • Evaluator runs queries and stores new data for streams or sends notifications

Integrate Globally. Global Monarch - Distributed across zones, but a single place to configure/query all monarchs in all zones.

Provides both Web UI and Python interfaces.

Monarch is backend for Stackdriver

Monitoring as a service is the right idea. Make the service a platform to build monitoring solutions.

Monitorama 2016 Portland Day 2

Brian Overstreet, Pinterest - Scaling Pinterest’s Monitoring

Video

Started with Ganglia, Pingdom

Deployed Graphite, single box

Second Graphite architecture - Load Balancer, 2x relay servers, multiple cache/web boxes etc.

Suffered lots of UDP packet receive errors

Put statsd everywhere

  • fixed packet loss, unique metric names per host
  • latency only per host, too many metrics

Sharded statsd

  • not unique per host now,
  • shard mapping in client, client version needs to be same everywhere

Multiple graphite clusters - one per application (python/java)

More maintenance, more routing rules etc.

Problems with reads, multiple glob searches can be slow

Deployed OpenTSDB

Replace statsd

  • local metrics agent, kafka, storm - send to graphite/opentsdb

Metrics-agent:

  • interface for opentsdb and statsd
  • sends metrics to kafka
  • processed by storm

120k/sec graphite, 1.5m/sec opentsdb. no more graphite, move to opentsdb.

Create statsboard - integrates graphite and opentsdb for dashboards and alerts

Graphite User Education - underlying info about how metrics are collected, precision, aggregation etc.

Protect System from Clients

  • alert on unique metrics
  • block metrics using zookeeper and shared blacklist (created on fly)

Lessen Operational Overhead

  • more tools, more overhead
  • more monitoring systems, more monitoring of the monitoring system
  • removing a tool in prod is hard

Set expectations

  • data has a lifetime
  • not magical data warehouse tool that returns data instantly
  • not all metrics will be efficient

Summary:

  • match monitoring system to where the company is at
  • user editation is key to scale tools organizationally
  • tools scale with the number of engineers, not users

Emily Nakashima, Bugsnag - What your javascript does when you’re not around

Video

Lots of app moving to frontend, so running in browser not on backend servers

Toolkit:

  • capture load performance from browser, send to app server, use statsd + grafana & google analytics
  • capture uncaught exections in the browser, using their own product

Sorry, Javascript is just not that relevant in my line of work

Eron Nicholson and Noah Lorang, Basecamp - CHICKEN and WAFFLES: Identifying and Handling Malice

Slides | Video

Suffered DDoS and blackmail. 80 gigabits - DNS reflection, NTP reflection, SYN floods, ICMP flood

Defense and Mitigation:

  • DC partner filters for them
  • More 10G circuits and routers
  • Arrangements with vendors to provide emergency access and other mitigation tools

Experience got them serious about more subtle application level attacks:

  • vulnerability scanners
  • repeat slow page requests
  • brute force attempts
  • nefarious crawlers

What do we want from a defense system?

  1. Protection against application-level attacks
  2. Keep user access uninterrupted
  3. Take advantage of the data we have available
  4. Transparent in what gets blocked and why

Chicken: who is a real user and who is malicious?

Considered Machine Learning classification. Problems: really hard to get a good training set. Need to be able to explain why an IP was blocked.

Simpler approach:

  • Some behaviours are known to be from people up to no good. crawling phpmyadmin, path reversal, repeated failed login attempts etc.
  • Request history gives a good idea of whether someone is a normal user, broken script or a malicious actor.
  • External indicators: geoip databases, badip dbs, facebook threat exchange

Removing simple things reduces noise. Every incoming request is scored and per-IP aggregate score calculated based on return code. Create Exponentially Weighted Moving Average from that data. About 12% had negative reputation.

Scaning for blockable actions and scoring requests in near real-time using request byproducts.

Request logs, netflow data, threat exchange -> kafka -> request scoring, scanner for known bad bahaviour, tools for manual evaluation.

Average IP reputation gives an early indicator to monitor for application level attack.

Provides list of good, bad, and dubious IPs.

Enforcement:

  • originally provided by iptables rules on haproxy hosts
  • then tried rule on loadbalancer
  • then tried null routing on routers
  • finally created waffles

Using BGP flowspec to send data from routers to waffles, which then decides what path to take: error, app or challenge. Waffles host live in a seperate network with limited access.

Waffles is redis and nginx.

John Stanford, Solinea - Fake It Until You Make It

Video

Monitoring an openstack cluster, 1 controller and 6 compute nodes, taking logs with heja and sending it to elasticsearch. Can I scale this up to a thousand nodes? How big can it get?

How do you go about figuring that out?

  • took 7 days of logs from lab, 25k messages/hr
  • number of logs coming from a node
  • number of logs coming from a component
  • 7 day message rate, look at histograms, identifies recurring outlier
  • message size, percentiles of payload size

What models look like what we’re doing for simulation? Add some random noise.

Flood process, monitoring everything, repeat until it breaks. System sustained 4k x 1k messages/sec, started to pause above that, but no messages were dropped.

Next steps: - find bottlenecks - improve the model

Tammy Butow, Dropbox - Database Monitoring at Dropbox

Slides | Video

Achieving any goal requires honest and regular monitoring of your progress.

Originally used nagios, thruk (web ui) and ganglia

Created own tool vortex in 2013

why create in house monitoring?

  • performance, reliability iissues, scaling number of metrics fast

Create Vortex:

  • Time Series Database with dashboards, alerting, aggregation
  • Rich metric metadata, tag a metric with lots of attributes

Monthra: single way of scheduling and relaying metrics, discourage scheduling with cron

Service Metrics: - what durability, reliability goals? align monitoring to goals? - threads running/ threads connected

Run a Monthly Metrics Review (great idea)

Dave Josephsen, Librato - 5 Lines I couldn’t draw

Video

  1. Making cofnitive leap to use monitoring tools to recognise system behaviour independent of alerting. misapprehension about what monitoring was and whom it was for.

  2. Monitoring is not for alerting. Nobody owns moitoring. ‘Tape measure that I share with every engineer I work with’. Ops owns monitoring vs everyone owns monitoring. Monitoring is for asking questions.

  3. Complexity isolates. Effective monitoring gives you the things that allow you to ‘Cynefin’ - make things more familiar and knowable. Reduce complexity rather than embracing it. Monitoring can build bridges to help people understand things across boundaries.

  4. Effective monitoring can bring about cultural change, how people interact between each other.

  5. Repeated point 4

Jessie Frazelle, Google - Everything is broken

Video

Talked about problems with Software Engineering and Operations

Demonstrated how they monitored community and external maintainer PR statistics for Docker project.

James Fryman, Auth0 - Metrics are for Chumps - Understanding and overcoming the roadblocks to implementing instrumentation

Video

Story of implementation of instrumentation at Auth0

Wanted data driven conversations. Metrics implementation happened in past, was ripped out because not well understood, thought to cause latency. Created adversions.

Make the chase. Get buy-in.

To have good decent conversations with someone you need to have metrics.

Pushback:

  • Not the most important feature - but it is!
  • Cannot start until we understand the data retention requirements - premature optimisations
  • We don’t run a SaaS - need to understand what your software is doing regardless

Make decisions based on knowledge, not intuition or luck.

Be opportunistic - success is 90% planning, 10% timing and luck. Find opportunites to accellerate efforts.

Needed to get something going fast - went for full service SaaS Datadog, but with common interfaces and shims to allow moving things in house later. Don’t delay, jump in and iterate.

Keep in sync with developers - change is difficult and there will be resistance, pay attention to feedback. Need to support interpretation of data.

Build out data flows, find potention choke points in system, take a baseline measurement, check systems in isolation

Fix and Repair bottlenecks. Solved 3 major bottlenecks, went from 500 to 10k RPS.

Monitorama 2016 Portland Day 1

This year I was lucky enough to attend Monitorama in Portland. Thanks to Sohonet for sending me! I’d wanted to attend again since going to Berlin in 2013, because the quality of the talks is the highest I’ve seen in any conference that’s relevant to my interests. I wasn’t disappointed, it was awesome again.

Here are my notes from the conference:

Adrian Cockcroft, Battery Ventures - Monitoring Challenges

Slides | Video

This talk reflected on new trends and how things have changed since Adrian talked about monitoring “new rules” in 2014

What problems does monitoring address?

  • measuring business value (customer happiness, cost efficiency)

Why isn’t it solved?

  • Lots of change, each generation has different vendors and tools.
  • New vendors have new schemas, cost per node is much lower each generation so vendors get disrupted

Talked about serverless model - now monitorable entities only exist during execution. Leads to zipkin style distributed tracing inside the functions.

Current Monitoring Challenges:

  • There’s too much new stuff
  • Monitored entities are too ephemeral
  • Price disruption in compute resources - how can you make money from monitoring it?

 Greg Poirier, Opsee - Monitoring is Dead

References | Video

Greg gave a history and definition of monitoring, and argued that how we think about monitoring needs to change.

Historically monitoring is about taking a single thing in isolation and making assertions about it.

  • resource utilisation, process aliveness, system aliveness
  • thresholds
  • timeseries

Made a defintion of monitoring:

Observability: A system is observable if you can determine the behaviour of the system based on it’s outputs

Behaviour: Manner in which a system acts

Outputs: Concrete results of it’s behaviours

Sensors: Emit data

Agents: Interpret data

Monitoring is the action of observing and checking the bahaviour and outputs of a system and its components over time

Failures in distributed systems are now: responds too slowly, fails to respond.

Monitoring should now be about Service Level Objectives - can it respond in a certain time, handle a certain throughput, better health checks

We need to better Understand Problems (of distributed systems), and to Build better tools (event correlation particularly)

Nicole Forsgren, Chef - How Metrics Shape Your Culture

Slides | Video

Measurement is culture. Something to talk about, across silos/boundaries

Good ideas must be able to seek an objective test. Everyone must be able to experiment, learn and iterate. For innovation to flourish, measurement must rule. - Greg Linden

Data over opinions

You can’t improve what yu don’t measure. Always measure things that matter. That which is measured gets managed. If you capture only one metric you know what will be gamed.

Metrics inform incentives, shape behaviour:

  • Give meaningful names
  • Define well
  • Communicate them across boundaries

Cory Watson, Stripe - Building a Culture of Observability at Stripe

Video

To create a culture of observability, how can we get others to agree and work toward it?

Where to begin? Spend time with the tools, improve if possible, replace if not, leverage past knowledge of teams

Empathy - People are busy, doing best with what they have, help people be great at their jobs

Nemawashi - Move slowly. Lay a foundation and gather feedback. (Write down and attribute feedback). Ask how you can improve.

Identify Power Users - Find interested parties, give them what they need, empower them to help others

What are you improving? How do you measure it?

Get started. Be willing to do the work, shave the preposterous line of yaks. Strike when opportunies arise (incidents). Stigmergy - how uncordinated systems work together.

Advertise - promote accomplishments, and accomplishment of others.

Alerts with context - link to info, runbooks etc. Get feedback on alerts, was it useful?

Start small, seek feedback, think about your value, measure effectiveness

Kelsey Hightower, Google - /healthz

Video

Kelsey gave a demo of the /healthz pattern, and how that can protect you from deploying non-functional software on a platform that can leverage internal health checks.

Stop reverse engineering apps and start monitoring from the inside

Move health/db checks and functional/smoke tests inside app, and expose over a HTTP endpoint

Ops need to move closer to the application.

 Brian Smith, Facebook - The Art of Performance Monitoring

Video

Gave an overview of some of the guiding ideas behind monitoring at facebook

Bad stuff:

  • High Cardinality - same notifications for 100x machines
  • Reactive Alarms - alarms which are no londer relevant
  • Tool Fatigure - too few/too many

It can Mechanical, Simple and Obvious to do these things at the time. But the cumulative effect is a thing thats hard to maintain.

Properties of Good Alarms:

  • Signal
  • Actionability
  • Relevancy

Your Dashboards are a debugger - metrics are debugger in production.

Caitie McCaffrey, Twitter - Tackling Alert Fatigue

Slides | References | Video

When alerts ae more often false than true, people are desensitised to alerts.

Unhappy customers is the result, but they are also unplanned work, and a distraction from focusing on your core business.

Same problem experienced by nurses responding to alarms in hospitals. What they have done:

  • Increase thresholds
  • Only crisis alarms would emit audible aleters
  • Nursing staff required to tune false positive alerts

What Caitie’s team did:

  • Runbook and alert audits - ensure ther eare runbooks for alerts, templated, single page for all alerts, each alert has customer impact and remediation steps. Importantly, also includes notification steps.

  • Empower oncall - tune alert thresholds, delete alerts or re-time them (only alert during business hours)

  • Weekly on-call retro - handoff ongoing issues, review alerts, schedule work to improve on-call

This resuted in less alerts, and improved visibility on systems that alert a lot.

To prevent alert fatigue:

  • Critical alerts need to be actionable
  • Do not alert on machine specific metrics
  • Tech lead or Eng manager should be on call

Mark Imbriaco, Operable - Human Scale Systems

Video

It’s common to say now that “Tools don’t matter” … but they do. We sweat the details of our tools because they matter. All software is horrible.

We operate in a complex Socio-Technical System. Human practitioners are the adaptable element of complex systems.

Make sure you think about the interface and interactions (human - software interactions)

  • Think about the intent, what problem are you likely to be solving (use cases)
  • Consistency is really important
  • Will it blend - how does it interact with other systems
  • Consider state of mind - high intensity situations/ tired operators

 Sarah Hagan, Redfin - Going for Brokerage: People analytics at Redfin

Video

Redfin is an online Estate Agency with agents on ground

Monitoring hiring

  • Capture lots of data on the market
  • Where should we move?
  • How many staff should we have in each location?
  • Useful tooling for the audience
  • Hire employees rather than contractors, analyse sold house price data to make sure employees earn enough vs. commission agents

Monitoring employees

  • Customer reviews for agents
  • Agents paid based on rating
  • Let the customer monitor the business
  • Monitor loading capacity of agents

Monitoring culture

  • Internal forums for feedback on tooling.

Pete Cheslock, Threat Stack - Everything @obfuscurity Taught Me About Monitoring

Slides | Video

Told the story of his history of learning about monitoring, and how he has approached monitoring problems at his current startup.

Telemetry and Alerting system is not core competancy.

  • Do simple things early when it makes sense (put metrics in logs).
  • When it’s necessary to get more data - just buy something.
  • Hosted TSDB is useful and just works, but there a faster non-durable metrics which are important. So he used graphite for 10s interval metrics, with 2 collectd processes writing to two outputs
  • Ended up with a full graphite deployment

Operability Day Two

Day one is available here

Charity Majors, Parse/Facebook - Building a world class ops team

Slides

This was a talk focusing on bootstrapping an ops team for startups.

Do you need an ops team?

Ops engineering at scale is a specialised skillset. is is not someone to do all the annoying parts of the running systems. Or do you need software engineers to get better at ops?

You need an ops team if you have hard problems:

  • extreme reliability
  • extreme scalability (3x-10x year over year)
  • extreme security
  • solving operational problems for the whole internet

What makes a good startup ops hire?

Its not possible to hire people who are good at everything - unicorns. What you can get are engineers who are good at some things, bad at others. People who can learn on the fly are valuable.

“A good operations engineer is broadly literate and can go deep on at least one or two areas”

Great ops engineers:

  • strong automation instincts
  • ownership over their systems
  • strong opinions, weakly held
  • simplify
  • excellent communication skills, calm in a crisis
  • value process (as that is what stops you making the same mistakes over again)
  • empathy

Things that aren’t good indicators:

  • whiteboarding code
  • any particular technology or language
  • any particular degree
  • big company pedigree

Succeeds at a big company:

  • structured roadmap
  • execute well on small coherent slices
  • classical cs backgrounds
  • value cleanliness & correctness
  • technical depth

Succeeds at startup:

  • comfortable with chaos
  • knows when to solve 80% and move on
  • total responsibility for outcomes
  • good judgement
  • highly reactive
  • technical breadth

How do you interview and sort for these qualities?

Don’t hire for lack of weaknesses. Figure out what strengths you really need and hire for those.

Good questions:

  • leading and broad, probe the candidates self reported strengths
  • related to your real problems
  • ask culture questions, screen for learned helplessness

Bad questions:

  • depend on a specific technology
  • designed to trip then up, looking for a reason to say no
  • deny candidates the resources they would use to solve something in the real world

You hired an ops engineer, now what?

How to spot a bad ops enginner:

  • tweaking indefinitely and pointlessly
  • walling off production from developers
  • adding complexity
  • won’t admit they don’t know things
  • disconnected from customer experience

How to lose good ops engineers:

  • all the responsibility, none of the authority
  • all the tedious shitwork
  • blameful culture
  • no interesting operational problems

David Mytton, Server Density - Human Ops - Scaling teams and handling incidents

Slides

This talk covered how incidents are handled at Server Density.

We should expect downtime - prepare, respond, postmortem.

Prepare

Things that need to be in place before an incident

  • on call schedue with primary & secondary
  • off call - 24hr recovery after overnight incident
  • docs, and must be located independently from primary infrastructure
  • key info must be available: team contacts, vendor contacts, credentials
  • plan for unexpected situations: loss of communication, loss of internet access
  • use war games to practise for incidents

Respond

Process to follow during an incident:

  • First responder
    • load incident response checklist
    • log into ops war room
    • log incident in jira
  • Follow Checklist(s)
    • due to complexity
    • easy to follow in times of stress and fatigue
    • take a beginners mind - ego can get in the way, don’t wing it
  • Key Principles
    • log everything (all commands run, by who and where and what the result was)
    • communicate frequently
    • gather the whole team for major incidents

Postmortem

  • Do within a few days
  • Tell the story of what happened - from your logs
  • Cover the appropriate technical detail
  • What failed, and why? How is it going to be fixed?

Emma Jane Hogbin Westby, Trillium Consultancy - Emphatically Empathetic

Emma talked about how she taught herself to be more empathetic.

Normal people have a lack of empathy, it’s a skill that can be practised and learned.

What is empathy: ability to understand the feelings of another

Level 1: Care just enough to learn about a person’s life

Doing this improves team cohesion, but requires a time investment.

Collect stories - learn about people by asking them questions. Shut up and listen. Respond in a way to encourage more info gathering.

Later, refer back to stories and follow up for more information.

Level 2: Strategies to structure interactions

Doing this you can engineer successful outcomes, and improve capacity for diverse thinking. But you risk being perceived as manipulative.

It’s a mistake to believe there is only one way to have a connection. Try to uncover motivation, why do people behave the way they do?

There are three types of thinking strategies, and you’ll see language patterns that match each of them.

creative thinking: ‘can we try…’ ‘what about…

understanding thinking: ‘so what you’re saying is…’ ‘just to clarify…’

decision thinking: ‘im ready to move on to…’ ‘last time we tried this…’

How can you create outcome based interactions for these sort of people? Perhaps you can plan for specific types of discussions in meeting agendas.

Find a system to use with your team to make communications more explicit, and to take advantage of the thinking strategies they use.

Level 3, engage with work from another’s perspective

This can foster creative problem solving. The risks are it is potentially overwhelming, and can cause doubt for self worth.

Seek to understand - complain about yourself from the other’s prespective, or situation. Live your day through the other’s constraints.

Thinking process should be no more left to chance than the delivery practise of a skill.


There was an interesting question after Emma’s talk - “How do I make Bob care about Dave from another team”. Her suggestion was to create a situation where they can bond over a common enemy - i.e. say something you know to be untrue and that they would both respond to in a similar way.

Scott Klein statuspage.io - Effective Incident Communication

Remember that there is someone on the other end of our incidents who is affected personally.

The talk covered what to do before, during and after an incident. You need a dedicate place to communicate system status to your users.

Before

Get a status page. It needs the following:

  • timestamps
  • to be very fast, very reliable
  • keep away from primary infrastructure, even DNS
  • contact info - give a way to get in touch

During

  • communicate early. say you are investigating - it means ‘we have no clue but at least we’re not asleep’
  • communicate often. always communicate when the next update is.
  • communicate precisely: be very declarative
    • don’t do etas, they will disappoint people
    • don’t speculate, ‘were still tracking down the cause’
    • ‘verification of the fix is underway’ not ‘we think we fixed it’
  • communicate together. have pre-written templates.
  • one person needs to be assigned as incident communicator

After

  • apologise first
  • don’t name names
  • be personal. “I’m very sorry”. Take responsbility.
  • details inspire confidence
  • close the loop - what we’re doing about it

Why do this?

  • gain trust with users/customers
  • turn bad experience into good experience
  • service recover paradox - people think more highly of a company if they respond properly.
  • show that you do your job well

Rich Archbold, Intercom.io - Leading a Team with Values

Talk covered Rich’s experience of introducing core values to drive performance of the team. They reduced downtime and infrastructure costs, and number of ops pages.

Enabled autonomy, distributed decision making.

Problems they were facing:

  • roadmap randomisation. easily distrated from what planned to do.
  • projects take a long time and delivered late
  • not feeling like a tribe

Criteria for values:

  • fit with the business
  • personal, specific
  • aspirational and inspirational
  • drive daily decision making
  • not dogma - needed to be flexible

These are the Values they came up with:

  1. Security, Availablilty, Performance, Scalability, Cost - prioritize for maximum impact
  2. Faster, Safer, Easier, Shipping
  3. Zero Touch Ops
  4. Run Less Software

Afterwards they gathered lots of metrics of unplanned work. From this they worked out that they need to multiply estimates by 2.7 to get accurate roadmap planning.

Matthew Skelton, Skeleton Thatcher Consulting - Un-Broken Logging, the foundation of software operability

Slides

The way we use logging is broken, how to make it more awesome

What is logging for? It provides an execution trace.

How is logging usually broken? It’s often unloved, discontinuous, contains errors only, bolten on, doesn’t have aggregation and search, severeties aren’t useful because they need to be determined up front.

Also, logs aren’t free. You need to allocate budget and time to make them useful.

Why do we log? For verification, traceability, accountability, and charting the waters.

How to make logging awesome

Continuous Event IDs - use to respresent distinct states. Describe what’s useful for the team to know and describe that as a seperate state. Use enums.

Transaction tracing - Create uniqueish identifier for each request, as pass it through the layers.

Decouple Severity - allow configurable severity levels. Log level should not be fixed at compile or build time. Map Event IDs to a severity.

Log aggregation and search tools - As we move from monolith to microserverice, the debugger does not have the full view anymore. Need an aggregated view of logs across a system. Develop software using log aggregation as a first class thing.

Design for logging - logging is another system component, and needs to be testable.

NTP - Time sync is crucially important for correlating log entries

Referenced the following video:

Evan Phoenix - Structured Logging

Gareth Rushgrove, Puppet Labs - Taking the Operating System out of operations

Slides

The age of the general purpose operating system is over. What does this mean for operators?

Lots of new OSes have appeared in the last year

New Breed:

  • Atomic (RedHat)
  • CoreOS
  • Snappy (ubuntu, replace dpkg with contrainers)
  • RancherOS (docker all the way down)
  • Nano - tiny alternative to Windows Server
  • VMWare Bonneville

Common themes:

  • Cluster native
  • RO file systems
  • Transactional updates
  • Integrated with containers

Why the interest in New OSes?

  • Lots of homogeneous workloads
  • Security is front page news
  • Size as a proxy for complexity
  • Utilisation matters at scale
  • Increasingly interacting with higher level abstractions anyway

Unikernels

Compile an application down to a kernel, there is no userspace. Only include the capabilities and libraries you need - everything is opt-in.

  • Hypervisor/hardware isolation
  • Smaller attack surface
  • Less running code
  • Enforced immutability
  • No default remote access

What happens to operators?

Hypervisor becomes the “platform”.

Everything else as an application. Firewalls. Network Switches. IDS. Remote access.

Everyone not running the hypervisor is an application developer. Standards required: Platforms, Containers, Monitoring. Publish more schemas than incompatible implementations in code.

Infrastructure is code.

Revolution not evolution. Distance between old infrastructure and new will be huge. Models of interaction and the skills required to operate.

Conclusions:

We have fundamental problems that date back more than 40 years. It Might take a different evolutionary process to build better infrastructure. We may have to throw away things we care about, such as Linux. This is all driven from security concerns.

Ben Hughes, Etsy - Security for Non-Unicorns

Slides

Security is hard. Tiny little bugs turn into giant things.

You’re already being probed for security holes, do you want to know or not? Bug Bounties are a way of getting attackers working for you.

You need to prepare a lot for bug bounties. Try and get all the low hanging fruit yourself. The first few weeks will be hell.

With much of our infrastructure in the cloud, it’s easy to expose sensitive information, such as credentials, on places like github. Gitrob helps to analyse git repos for you.

People trust random files off the internet - like docker images, vagrant images, and curl|bash installs etc.

Operability Day One

Today I’ve been at the operability.io conference in London.

The conference in intended to be focused on the ops side of devops, and how to make software operable.

This is a single track conference, here is my personal summary and notes from the talks:

Andrew Clay Shafer, Pivotal - What even is operable?

Video

I’ve seen a few talks by Andrew before, and they’re full of challenging, rapid fire ideas and only loosely tied together - it’s more an expression of his world view rather than a talk on a specific subject. With that in mind, I’m still going to try and summarise it.

Andrew asks the question “what even is operable”, and in the end he came to the conclusion that Operability is the intersection of capability and usability.

There is an emergent architecture, which he calls cloud native, which are a set of patterns that emerged in organisations that deliver highly available applications continuously at scale - like Amazon, Google, Twitter, Facebook etc. There are many associated labels, like devops, continuous delivery and microservices, and these are all inter-related as part of that architecture.

“Do not seek to follow in the footsteps of the wise. Seek what they sought.” - Matsuo Basho

The human tendency is to fixate on the solution. We need to think about the problem more. Principles > Practices > Tools. You’re not going to be able to do the right things until you internalise the principles. Otherwise you are just imitating or cargo culting.

Equally, we can focus on automation tools and capabilities, not what is being automated.

Other pertinent points:

  • if tetris has touagh me anything is that errors pile up and accomplishments disappear
  • systems thinking teeachs that we should minimise resistence rather than push harder
  • highlighted borg paper and that all tasks run in born run http server that reports health status and metrics
  • operations problems become easier when apps are aware of their own health
  • worse is better won: broken gets fixed but shitty lasts forever

Referenced the following for reading:

Colin Humphreys, CloudCredo - inoperability.io

Colin told a war story about what happens when you completely ignore operations? what’s the worst that could happen?

A game he worked on had 5 million users and “Launch issues”. It hardly worked, because they expected 20k users and ran initially with no budget for operations.

There was a big launch, but the infrastructure for the system was build the day before and completely untested. It was swamped in 10sec. 1.7 million people tried to play 1st day, but couldn’t. There were various media reports of the failure.

As a result, managed to steal some funding from the marketing budger and scale 100x, add caching, and throw lots of hardware at the problem. To use the 64 DB servers, they used sharded PHP ORM to distribute API server traffic.

He did get it working, but in time they found transactions not working, as this was supposed to be addressed within the application. It wasn’t because of differences between the development and production environments, and as a result cash was disappearing from the game. As a result, everyone’s scores had to be reset.

Personally, Colin worked 54 hours solid to get the thing running. For the three months the site ran, he worked 100hr/weeks.

This is how bad a project can get. “I nearly died”. He must certainly have burned out. Took a personal sense of pride in the project. Horrible to other people around. Heroism != success.

Takeaways:

  • Work as a team
  • Communicate

Anthony Eden, DNSimple - How small teams accomplish big things

Slides | Video

Anthony’s talk is about scaling a team, and how they had to scale the operational processes - as a result from various experiences, but none more so than losing one of the founders, someone who know everything about everything.

Specific Processes:

Incident Response Plan, consisting of 4 points: Assess, Communicate, Respond, Document.

Assess - think before you act. Determine impact. Have a threshold for requesting help. >10 mins results in everyone in a Google Hangout.

Communicate - 1 person responsible to communicate to customers. Update twitter, status page.

Respond - minimise impact. Triage first. Everyone ask questiosn and propose solutions. Consider available actions and act one best available.

Document - post mortem. what happened? why did it happen? how did we respond and revoer? how might we provent similar issues from occurring again?

Other processes were mentioned: On-call rotation. Security Policies. Security escalation policy. System security, RBAC. Track CVEs. Password rotation - especially on admin passwords. etc.

What makes a good process?

  • Born from Experience
  • Written down and available to all
  • Concise & Clear
  • Act as guidance

Once they’re in place you need to:

Execute it, to test it out. Prune where appropriate. Automate where that makes sense.

Bridget Kromhaut, Pivotal - distributed: of systems and teams

Slides | Video

Bridget’s talk compared how distributed systems are complex, but so are distributed teams in many of the same ways.

Firstly, distributed != remote. Having a few people out of the office is not the same.

What’s important in teams is people > tools. Focusing on people are communicating more important than the tools and how.

She made various points which are important for distributed teams:

Durable communication encourage honesty, transparency and helps future you - “durable communication exhibits the same characteristics as accidental convenient communication in a co-located space. The powerful difference is how inclusive and transparent it is.” - Casey West

  • Let your team know when you’ll be unavailable.
  • Tell the team what you’re doing.
  • Misunderstandings are easy, need to over-communicate. Especially to express emotions, it’s easy to misinerpret textual communication.
  • Be explicit about decisions you’re making.
  • Distribute decision making

Colin Hemmings, dataloop.io - In god we trust, all else bring data

This talk was about the experience of building dataloop as a startup, from working in other companies, and what was learned speaking to 60 companies about monitoring and focued on dashboards.

Generally see the following kinds of dashboards:

  • Analytics dashboards, to diagnose performance issues. Low level, detailed info.
  • NOC dashboard, high level overview of services.
  • Team dashboards, overview of everything not just technical elements - includes business metrics.
  • Public dashboards, high level, simplified & sanitised marketing exercises.

Keeping people in touch of reality is the problem, and knowing what the right thing to work on. Discussions on what features to work on get opinionated. There is data within our applications that we can use to make decisions, and they can be represented on dashboards.

  • Stability dashboards: general performance, known trouble spots.- Feature dashboards: customer driven, features forum. “I suggest you…” & voting
  • Release dashboards: dashboards for monitoring the continuous delivery pipeline

Elik Eizenberg, BigPanda - Alert Correlation in modern production environments

My personal favourite talk of the first day.

Elik’s contention is that there is a lack of automation in regard to responding to alerts.

Incidents are composed of many distinct symptoms, but monitoring tools don’t correlate alerts on those symptoms for us into a single incident.

The number of alerts received might not be proportional to number of incidents. The number of incidents experienced may be similar day to day, but the impact of the incidents can be very different - hence the number of alerts being higher sometimes.

Existing Approaches:

  • compound metrics (i.e. aggregations, or a compound metric build from many hosts). This is relatively effective, but alerts are received late, and you can miss symptoms in buildup to the alert triggering.
  • service hierarchy, i.e. hierarchy of dependencies related to a service. Problem with this approach is that it’s hard to create & manage. Applications and their dependencies are generally not hierarchical.

“What I would like to advocate”: Stateful Alert Correlation

Alerts with a sense of time, aware of what happened before now.

Alerts can create a newincident, or link themself to an existing incident if it’s determined they’re related.

How do you know that an alert belongs to an incident?

  • Topology - some tag that every alert has. (Service? Datacenter? Role?)
  • Time - alerts occuring close in time to another alert
  • Modeling - learning if multiple alerts tend to fire within a short timeframe
  • Training - Machine Learning. User feedback if correlation good or bad.

Even basic heuristics are effective. There is lots of value to be gained from just applying Topology and Time.


There were two other talks after this, but I had to leave early. Sorry to the presenters for missing their talks!

The Path to Peer Review

As a sysadmin, I’ve never been part of a team that did peer review well. When you’re doing a piece of work and you want to check you’re doing the right things, how do you get feedback?

Do you send an email out saying “can you look at this?”

Do you have someone look over your shoulder at your monitor?

Do you have discussions about what you’re going to do?

Sometimes I’ve done these things, and they’ve been partly effective - but usually they just get a reply of “Looks good to me!”

Most of the sysadmins I’ve worked with will just make a change because they want to, or were asked to, and don’t tell anyone. Or they want to be heros who surprise everyone by sweeping in with an amazing solution that they’ve been working on secretly.

I don’t want to work where people are trying to perform heroics. I hate surprises. Having done all those individualistic stupid things myself, I want to work in a team where we work together on problems out in the open. When you involve others, they are more engaged and feel part of the decision making process. And their feedback makes you produce better work.

My current role now has the best culture of reviewing work that I’ve experiences. But we had to create it for ourselves, and this is what we did.

We put all our work in version control

When I started this role, the first project I worked on was to move our configs into git. Most of those configs are stored on shared NFS volumes which are available on all hosts. Previously people made config changes by making a copy of the file they wanted to change and then made the change to the copy. Once that is ready they would take a backup of the production copy of the file, and copy their new version into place.

Importing the files into repos was generally straightforward, but sometimes there was automatically generated content that needed excluding in .gitignore. To deploy the repo we added a post-update hook on the git server that would run git pull from the correct network path.

But we also wanted to catch when people would change things outside of git and do so without overwriting their changes. This would allow us to identify who was changing things and make sure they knew the new git based process. To do this we added a pre-receive hook on the server that would run git status in the destination path and look for changes.

That was the first step in changing the way we worked. It wasn’t a big change, but it got everyone fairly comfortable with using git. The synchronous nature of the deployment was a big plus too, because all the feedback about success or failure would happen in your shell after running git push

Then we generated notifications

This was great progress, and we got more and more of our configs into this system over time. The next thing we wanted to do was to produce notifications about what’s changing. This would allow us to catch mistakes, find bad changes that broke things, and to understand who is making changes to what systems.

We did two things to achieve this - we made all these repos send email changelogs and diffs whenever someone pushed a change, and we created an IM bot that would publish details changes into a group chat root.

This was great to produce an audit trail of changes happening to the system. But whenever you saw a change break something, in hindsight you thought - well I knew that change would break, I could stop that before it was deployed.

Finally, we added Pull Requests

We knew we wanted to implement Pull Requests but we couldn’t do this with the git server software we were originally using.

Since we had started using Confluence and JIRA for our intranet and issue tracking, we moved all our git repos to Stash, which is Atlassian’s github-like git server. This provided us with the functionality we wanted.

On a single repo, we enabled a Pull Request workflow - no one could commit to master any more and everyone had to create a new branch and raise a PR for the changes they wanted merged.

We wrote up some documentation on how to use git branching and raise PRs via the stash web interface, and explained to everyone how it all worked.

We chose a repo that was relatively new, so the workflow was part of the process of learning to work with the new software. It was also one that everyone in the team would have to work with, so that everyone was exposed to the workflow as soon as possible.

To deploy the approved changes, we used the Jenkins plugin for Stash to notify Jenkins when there were changes to the repository. Jenkins then ran a job that did the same thing as our previous post-update hooks - ran git pull in the correct network location.

Running the deployment asynchronously like this felt like we were losing something - if there was a problem you were sent an email & IM from Jenkins, but this felt less urgent than a bunch of red text in your terminal. But for the benefit of review, this was a price worth paying.

In the interim we had moved the checks for changes being made outside of git into our alerting system, so we could catch them earlier than when the next person went to make a change. This meant we didn’t have to implement these checks as part of the git workflow - if there was still a problem we let the hook job fail, and it could be re-run from Jenkins once the cause was resolved.

Over time, we moved all the other repositories across to this workflow, starting with the repos with the highest number of risky changes. But, for a number of repos, we kept the old workflow with synchronous deployment hooks because all the changes being made were low risk and well established practises.

Initially, slowing down is hard

The hardest thing to adapt to was changing the perceived pace that people were working. Everyone was very used to being able to make the change they wanted immediately, and close that issue in JIRA straight away. That’s how we judged how long a piece of work takes, but that doesn’t take into account the time spent troubleshooting and doing unplanned work.

What we were doing was moving more of the work up front, where you can fix problems with less disruption. But making that adjustment to the way you work can be really hard because you perceive the process to be slower.

Everyone in our team is a good sport and willing to give things a go - but as much as we could try to explain the benefit of slowing down, you only realise how much better it is by doing it over and over, and that takes time.

Sometimes it’s necessary to move faster, and we manage that simply - if someone needs a review now, they ask someone and explain why it’s urgent. As a reviewer, if I’m busy I’ll get people to prioritise their requests by asking “I’m probably not going to be able to review everything in my queue today. Is there anything that needs looking at now?”

In time the benefits are demonstrated

By using Pull Requests we created an asynchronous feedback system. First you propose what you’re going to do in JIRA. Then you implement it how you think it should be done and create a PR. Then when a reviewer is available they’ll provide feedback or approve the change. You keep making updates to your PR and JIRA until the change is approved or declined.

With time, everyone experienced all of the following benefits of that feedback:

Catch mistakes before they are deployed

This was what we set out to do! There were breaking changes made before, and there are less of them now.

Learn about dependencies between components

A common type of feedback is “How is this change going to affect X?” - sometimes the requestor has already considered that and can explain the impact and any steps they took to deal with X. But if they haven’t considered it, they need to research it. That way they learn more about how things are connected and have a greater appreciation of how the system works as a whole.

Enforce consistent style and approaches

Everyone has their own preferred style of text editing and coding. With a PR we can say this is the style we want to use, and enforce it. The tidier you keep your configs and code, the more respect others will have for them.

It’s massively helpful to be told about an existing function that achieves what you’re trying to do, or there’s existing examples of approaches to solving a problem. This can help you learn better techniques, and avoid duplicating code.

Identify risky changes

With changes to fragile systems, you’re never confident about hitting deploy even if the change looks good. Until it’s in production and put through it’s paces there is risk. So this has allowed us to schedule deploying changes for times where the impact will be lower, or deploy the change to a subset of users.

It’s also stopped “push and run” scenarios - we avoid merges after 5pm, they can always wait for tomorrow morning!

Explain what you’re trying to do

Much of the review process is not about identifying problems with configs and code, but simply being aware of and understanding the changes that are taking place. This is invaluable to me as a reviewer.

So, when raising a Pull Request, it’s expected that each commit has a reference to the JIRA issue related to the change. The Pull Request can have a comment about what is changing and why, and that can be very helpful but the explanation of why the change is taking place must be in the JIRA.

This way, when looking back at the change history we can also reference the motivations for making that change and see the bigger picture beyond the commit message.

People want to have their work reviewed

Somewhere along the way, getting your work reviewed became desirable. You realise the earlier you put your work out there for review, the earlier you get feedback and the less likely you are to spend time doing the wrong thing.

We still have a number of repos that anyone can just change by pushing to master, there’s no controls because these changes are considered safe. But people choose to create branches and PRs for their changes to these repos, because they want to have their work reviewed.

Change takes time…

Going from no version control to making this cultural change in the way we approached our work took about 2.5 years.

Throughout this period there was no grand plan or defined scenario that we were trying to achieve. At each stage, we could only see a short way forward. We had some things we were looking to do better, so we experimented. When we found what worked, we made sure everyone kept doing it that way.

At the start, I had no idea that peer review of sysadmin work would be done via code review. As we’ve moved things into git and we saw the benefits, we want to manage everything the same way and that drives moving more things into git.

We moved slowly because there was other work going on and we needed to get people comfortable with the new way of working and tooling before we could ask more of them. Change takes time, and many of our team have benefitted from seeing that incremental process take place.

It’s been gratifying that the newest team members who have joined since we established PRs have said “I was uncertain about it at first, but now I get it. It really works.”

… and involves lots of painstaking work

To get to where we are has meant spending a lot of time setting up new repositories and isolating files that are managed by humans and automatically generated, migrating repositories to stash, creating deploy hooks, explaining how git works, and most of all making sure you’re providing useful reviews and that they happen regularly.

It’s one thing to start setting up a couple of repos like this – but to fully establish the change you need to do lots of boring work to make sure everything is migrated, even the more difficult cases. It’s important that everything is managed consistently, within one system.

Monitorama Roundup Part 2

Part one is available here

During the second day there were multiple tracks available - I mostly followed the workshop track, only catching one presentation on the speaker track.

Florian Forster, collectd - Collecting custom metrics

Slides

Described the collectd data model and how to use the Exec plugin to execute arbitray scripts for collecting custom metrics.

Also presented a statsd plugin - implementation of statsd network protocol inside collectd.

Abe Stanway - Kale (Skyline and Oculus)

Slides

Presentation of architecture for the kale suite of tools recently released by etsy.

You can find a better description on the etsy blog

Skyline - Analyse time series for anomalies in (almost) real time

The setup at etsy includes 250k metrics, which takes about 70sec for anomaly discovery, and requires 64GB memory to store 24 hours of metrics in memory.

  • carbon-relay is used to forward metrics to the horizon listener
  • metrics are stored in redis
  • data is stored in redis as messagepack (allows efficient binary array stream)
  • roomba runs intermittely to clean up old metrics from redis
  • amalyzer does it’s thing and writes info to disk in JSON for the web front end

To identify anomalies, skyline uses some of the techniques that Abe talked about in his presentation the previous day. It uses the consensus model, where a number of models are used and they vote - so if a majority of models detect an anomaly, then that is reported.

Oculus - analyse time series for correlation

Oculus figures out which time series are correlated using Euclidian Distance - the difference between time series values.

It also uses Dynamic Time Warping - to allow for phase shifts, if the change in one series occurs later than in the other. But this is slow, so it’s targeted on the time series to could be correlated by comparing shape descriptions.

  • data pulled from skyline redis and stored in elasticsearch
  • time series converted to shape description (limited number of keywords that describe the pattern)
  • phrase search done for shared shape description fingerprints
  • run dynamic time warping on that data

Pierre-Yves Ritschard - Riemann

Slides

Unfortunately this presentation was marred by a projector/slide deck failure which made it hard to follow. It was incredibly disappointing because there was a lot of positive discussion around Riemann and I was looking forward to a better exposition of the tool.

Riemann is a event stream processing tool.

  • All events sent to riemann have a key value data structure - for logs, metrics, etc.
  • You can use all your current collectors: collectd, logstash etc, and has an in app statsd replacement
  • Those events can be manipulated in many ways, and sent to many outputs
  • The query language, clojure, is data driven & from lisp family
  • There is storage available for event correlation but I didn’t really understand from his discussion

Devdas Bhagat - Big Graphite

Slides

This workshop covered how booking.com have scaled their graphite setup.

They have somewhere in the region of 5000 hosts, multiple terabytes of data stored in whisper.

Both IO and CPU have become bottlenecks and in each instance they have thrown hardware at the problem to run more agents and shard their data.

I/O probems:

  • Ran into IO wall, disks 100% writing. Lots of seeking
  • Have ended up using SSD drives in RAID0
  • Sharding becomes hard to maintain and balance
    • Don’t know in advance in which namespace metrics will be created
    • Rebalancing tricky when adding more backends - they have a script that replicates graphite hashing function and manually move things
  • Found SSDs not as reliable as spinning disk under high update conditions
    • Lots of drive failures, so replicate data to separate datacenters to provide availability
  • FusionIO performance no improvement from SSDs (I find this hard to believe)

CPU problems:

  • Relays start maxing out CPUs
    • Multiple relay hosts (per datacenter)
    • Multiple relays on each host (1 per core)
    • Use haproxy load balancer (prevent losing metrics)

Other Problems:

  • Software breaking on updates (whisper failure on upgrade)
  • People can use the software in surprising ways - not time series data, or one record a day

David Goodlad - Infrastructure is Secondary

Slides | Video

David presented that your primary metrics should be your business metrics, and that your infrastructure metrics are secondary. Which is not to say infrastructure metrics aren’t important, but the business metrics are measures of how your business is performing and this is the data which you should be alerting on.

  • Alerts should be informational and actionable - not just “cool story, bro”
  • Consider what matters to customers - i.e. instead of measuring queue size, use time to process
  • When a business metric alerts, then correlate against infrastructure monitoring
  • Keep the infrastructure thresholds, but don’t alert on that information - you can access it when necessary

He gave a great one liner to help decide what you should be measuring - “What would get your boss fired?” - Measure these things deeply.

Also pushed the idea of sharing your information outside the team - to be more transparent and visible to the rest of the business, particularly since the information you hold are business metrics. This will provide feedback about the metrics which are important to others.

Michael Gorsuch - Graph Automation

This workshop was a practical introduction to instrumentation and exploration - stepping through configuring StatsD, graphite and collectd to instrument an application and exploring graphs using descartes.

Monitorama Roundup Part 1

Over the past two days I’ve attended the Monitorama conference in Berlin. The conference covers a number of topics in Open Source monitoring, and reflects the cutting edge in techologies and approaches being taken to monitoring today. It also acts as a melting pot of ideas that get used by lots of people to take those ideas further, or act on them within their businesses.

There were a number of ideas that ran throughout many of the talks, which I considered the key take aways from day one:

  • Alert fatigue is a big problem - Alert less, remove meaningless alerts, alert only on business need and add context to alerts.

  • Everything is an event stream - metrics, logs, whatever. Treat it all the same (and store it in the same place).

  • The majority of our metrics don’t fit standard distribution, which makes automated anomaly detection hard. So people are looking for models that fit our data to do anomaly detection.

  • Everyone loves airplane stories. I heard at least five, three of which ended in crashes.

The first day had a single speaker track, here is my personal summary and interpretation of all the talks:

Dylan Richard - Keynote

Video

This was an experience talk about the Obama re-election campaign, what they did that worked, what they wished they had etc.

Essentially they used every tool going, and used whatever made sense for the particular area they were looking at. But they weren’t able to look carefully about their alerting and were emitting vast numbers of alerts. To deal with this, they leveraged the power users of their applications, who would give feedback about things not working and then they’d dig into the alerts to track down the causes. To improve on that feedback, they also created custom dashboards for those power users so they could report more context with the problem they were experiencing.

Alert fatigue was mentioned - “Be careful about crying wolf with alerts” because obviously people turn off if there’s too much alerting noise, and subsequently “Monitoring is useless without being watched”.

Danese Cooper - Open Source Past Future

Video

Gave a presentation about the history of open source, where it stands today, and advocated participation in open source and the institutions in that space.

Abe Stanway, Etsy - Anomaly Detection with Algorithms

Slides | Video

Abe gave a great talk on the history of Statistical Process Analysis and how it is used in quality control on production lines and similar industrial settings. This is all about anomaly detection by looking for events that fall outside three standard deviations from the mean. But unfortunately this detection process only works on normally distributed data, and almost none of the data we collect is normally distributed. He followed this with a number of ideas for approaches to take to describe our data in an automated fashion - and then called on the audience to get involved in helping develop these models.

He also gave an interesting comparison between current monitoring being able to provide us with either a noisy situational awareness, or more limited feedback using predefined directives.

Mark McGranaghan, Heroku - Fewer Better Systems

Slides | Video

Presented the argument that the best systems are the ones that are used constantly - for example failover secondaries often do not work because they have not been subjected to live running conditions and the associated maintenance. So he suggested a bunch of things that are done differently now, which could be done the same, so we get better at doing less things:

  • Metrics, Logs & Events - can all be considered events, so make them the same format as events
  • Metric Collection & Alerting - often collect the same data - collect once, and alert from your stored metrics
  • Integration testing and QOS monitoring - share a lot of same goals, so do them the same
  • Unlike results, errors get specific code for dealing with them - instead treat them as a result but send data that presents the error

Katherine Daniels, Gamechanger - Staring at Graphs as-a-Service

Slides | Video

Gave a great practical talk about how monitoring systems fail, and the series of small decisions that get made to improve individual situations but make the whole system, and the operator’s experience, much much worse. Essentially, everything that create monitoring systems with screens full of red alerts, that have been that way for a long time and with no sign of resolution.

As a way forward, she suggested the following:

  • Only monitor the key components of your business - remove all the crap
  • Find out what critical means to you - understand your priorities
  • Fix the infrastructure - start with a zero error baseline
  • Plan for monitoring earlier in the development cycle (aka devops ftw)

Lindsay Holmwood, Psychology of Alert Design

Slides | Video

Lindsay presented a number of stories, from air disasters and hospitals, to present two key ideas when designing alerts:

  • Don’t startle or overload the operator (reduce notifications)
  • Don’t suggest, expose (provide more context - give relevant situational data at the same time)

Theo Schlossnagle - Monitoring what the hell?

Slides | Video

This talk covered a lot of ground quite quickly, so these were the key things I took away:

  • Monitor for failure by reviewing problems that you’ve identified before, create detailed descriptions of those events, and only them alert on those descriptions - so that you have enough context to provide with an alert so that it’s meaningful and actionable to the receiver.
  • Alert on your business concerns
  • Store all your data in the same place - treat logs, events and metrics the same
  • Our data isn’t normally distributed and that makes shit hard. The next leaps in dealing with this data will come from outside Computer Science - more likely to be the hard science disciplines.

Michael Panchenko - Monitoring not just for numbers

Slides | Video

Most of this talk was about the problems of configuration drift, and how subtle differences of systems outside configuration management policy scope can yield big surprises.

Michael presented his dream of enhancing numerical monitoring with non-numerical and non-binary observations, with some suggestions:

  • Monitoring of infrastructure state
  • Providing an audit trail of categorical data
  • Bring able to compare states across nodes and time

He also suggested describing activity in a standardised context: who, what, when.

Jarkko Laine - Let your data tell a Story

Slides | Blog | Video

This was a really interesting talk about how humans process ideas and information, how these lead to biases in analysis - and how to exploit them.

The talk was an exposition of two main ideas:

  • Attention and Memory are limited
  • Tell stories to engage the audience

This was condensed to two key directives for dashboard design and using visualisations as a tool for communication:

  • Minimise eye fixations
  • Maximise data-ink ratio

Ryan Smith - Predictable Failure

Slides | Video

This talk featured the best airplane (near-) disaster stories as Ryan was extremely enthusiastic about the content. These provided pertinent links to understanding failure in IT environments.

The most important was learning about failure modes form other users - official documentation will lack in this respect, so you need to hear the war stories from your peers

He also described a bunch of other failure cases, particular related around redundancy and the complexity that ensues.

Daniele De Matteis & Harry Wincup, Server Density - Monitoring, graphs and visualisations

Video

This was a presentation from the designer’s perspective on the theory of how visualisations should be presented, based on the experience of creating the new Server Density interface.

They outlined various design principles they strived for:

  • Consistency
  • Context
  • Clarity (less is more)
  • Perspective (i.e. vertical alignment for context)
  • Appeal (Pleasant user experience)
    • Consistent graphical elements, white space, horizontal and vertical flows, contrast between elements
  • Control (Let user find next path, but make sure it’s only click away)

Testing Logstash Configs With Rspec

At work I’m supporting a rails app, developed by an external company. That app logs a lot of useful performance information, so I’m using logstash to grab that data and send it to statsd+graphite.

I’ve been nagging the developers for more debugging information in the log file, and they’ve now added “enhanced logging”.

Since the log format is changing, I’m taking the opportunity to clean up our logstash configuration. The result of this has been to create an automated testing framework for the logstash configs.

Managing config files

Logstash allows you to point to a directory containing many config files, so I’ve used this feature to split up the config into smaller parts. There is a config file per input and, a filter config for that input, and a related statsd output config. For outputs, I also have a config for elasticsearch.

Because I use grep to tag log messages, then run a grok based on those tags, it was necessary to put all filters for an input in a single file. Otherwise your filter ordering can get messed up as you can’t guarantee what order the files are read by logstash.

If you want to break out your filters in multiple files but need the filters to be loaded in a certain order, then prefix their names with numbers to explicitly specify the order.

100-filter-one.conf
101-filter-two.conf
...

Filter config

The application logfile has several different kinds of messages that I want to extract data from. There are rails controller logs, CRUD requests generated by javascript, SQL requests, passenger logs and memcached logs.

So, when I define the input, those log messages are defined with the type ‘rails’.

The first filter that gets applied is a grok filter which processes the common fields such as timestamp, server, log priority etc. If this grok is matched, the log message is tagged with ‘rails’.

Messages tagged ‘rails’ are subject to several grep filters that differentiate between types of log message. For example, a message could be tagged as ‘rails_controller’, ‘sql’, or ‘memcached’.

Then, each message type tag has a grok filter that extracts all the relevant data out of the log entry.

One of the key things I’m pulling out of the log is the response time, so there are some additional greps and tags for responses that take longer than we consider acceptable.

When constructing the grep filters, I debug the regexes with http://www.rubular.com/, and for the grok filters http://grokdebug.herokuapp.com/ is a massively useful tool.

However, each of these web tools only look at a single log message or regex - I want to test my whole filter configuration, how entries are directed through the filter logic, and know when I break some dependency for another part of the configuration.

rspec tests

Since logstash 1.1.5 it’s been possible to run rspec tests using the monolithic jar:

java -jar logstash-monolithic.jar rspec <filelist>

So, given I have a log message that looks like this:

2013-01-20T13:14:01+0000 INFO server rails[12345]: RailController.index: 123.1234ms db=railsdb request_id=fe3c217 method=GET status=200 uri=/page/123 user=johan

Then I would write a spec test that looks like:

spec/logstash.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
files = Dir['../configs/filter*.conf']
@@configuration = String.new
files.sort.each.do |file|
  @@configuration << File.read(file)
end

describe "my first logstash rspec test"
  extend LogStash::RSpec

  config(@@configuration)

  message = %(2013-01-20T13:14:01+0000 INFO server rails[12345]: RailsController.index: 123.1234ms db=railsdb request_id=fe3c217 method=GET status=200 uri=/page/123 user=johan)

  sample("@message" => message, "@type" => "rails")
    insist { subject.type } == "rails"
    insist { subject.tags }.include?("user")
    reject { subject.tags }.include?("_grokparsefailure")
    insist { subject["TIMESTAMP_ISO8601"] } == "2013-01-20T13:14:01+0000"
    insist { subject["logpriority"] } == "INFO"
    insist { subject["logsource"] } == "rails"
    insist { subject["railscontroller"] } = "RailsController"
    insist { subject["railscontrollerction"] } = "index"
    insist { subject["time"] } == "123.1234"
    insist { subject["database"] } == "railsdb"
    insist { subject["request_id"] } == "fe3c217"
    insist { subject["method"] } == "GET"
    insist { subject["status"] } == "200"
    insist { subject["uri"] } == "/page/123"
    insist { subject["user"] } == "johan"
  end
end

So, this is dynamically including in all my filter configurations from my logstash configuration directory. Then I define my known log message, and what I expect the outputs to be - the tags that should and shouldn’t be there, and the content of fields pulled out of the log message.

Develop - Verify workflow

Before writing any filter config, I take sample log messages and write up rspec tests of what I expect to pull out of those log entries. When I run those tests the first time, they fail.

Then I’ll use the grokdebug website to construct my grok statements. Once they’re working, I’ll update the logstash filter config files with the new grok statements, and run the rspec test suite.

If the tests are failing, often I’ll output subject.inspect within the sample block, to show how logstash has processed the log event. But these debug messages are removed once our tests are passing, so we have clean output for automated testing.

When all the tests are passing I’ll deploy them to the logstash server and restart logstash.

1
2
3
4
5
java -jar /usr/share/logstash/logstash-monolithic.jar rspec examples.rb
..........................

Finished in 0.23 seconds
26 examples, 0 failures

Automating with Jenkins

Now we have a base config in place, I want to automate testing and deploying new configurations. To do this I use my good friend Jenkins.

All my spec tests and configs are stored in a single git repository. Whenever I push my repo to the git server, a post-receive hook is executed that starts a Jenkins job.

This job will fetch the repository and run the logstash rspec tests on a build server. If these pass, then the configs are copied to the logstash server and the logstash service is restarted.

If the tests fail, then a human needs to look at the errors and fix the problem.

Integrating with Configuration Management

You’ll notice my logstash configs are stored in a git repo as files, rather than being generated by configuration management. That’s a choice I made in this situation as it was easier to manage within our environment.

If you manage your logstash configs via CM, then a possible approach would be to apply your CM definition to a test VM and then run your rspec tests within that VM.

Alternatively, the whole logstash conf.d directory could be synced by your CM tool. Then you could grab just that directory for testing, rather than having to do a full CM run.

Catching problems you haven’t written tests for

I send to statsd the number of _grokparsefailure tagged messages - this highlights any log message formats that I haven’t considered, or can show up if the log format changes on me and that I need to update my grok filters.

Testing Puppet Manifests With Toft-puppet

Toft is a ruby gem to manage testing of configuration management manifests with LXC Linux containers – it can manage nodes, run chef or puppet, run ssh commands and then run any testing framework against those nodes – such as rspec or cucumber.

Using containers for this purpose is incredibly useful as they can be created and destroyed very quickly, even compared to a virtual machine. So we can setup a fresh node with a base OS, run our manifests on that, and then run tests on the system we have created. Since we run the tests from the system that hosts the container, we can run tests both from within and outside the node easily.

I’m using toft from jenkins to run cucumber tests on our manifests. Anytime anyone checks into testing, all my cucumber tests (as well as other things like puppet-lint) are run and when they succeed, we merge into master.

Testing the behaviour of your deployed manifests as part of an automated QA – this is a massive result, and toft makes it just that much easier.

Installation

The QA system will need the following packages:

libvirt
lxc

Unfortunately there is no lxc package in EPEL, so I needed to take the source RPM from Fedora 16 and build it on Scientific Linux. This was a hassle free process.

The following gems are required:

toft-puppet
cucumber
rspec

These are most easily installed with gem install if you have access to rubygems.org or an internal repository.

Configuration

For libvirtd, disable multicast DNS. Also change host bridge network interface to br0 and set the bridge network as 192.168.20.0:

/etc/libvirt/libvirtd.conf
1
mdns_adv = 0
/etc/libvirt/qemu/networks/default.xml
1
2
3
4
5
6
7
8
9
10
11
12
<network>
<name>default</name>
<uuid>3b673ba9-be12-4299-95a3-2059be18f7b9</uuid>
<bridge name=”br0″ />
<mac address=’52:54:00:BE:D0:1D’/>
<forward/>
<ip address=”192.168.20.1″ netmask=”255.255.255.0″>
<dhcp>
<range start=”192.168.20.2″ end=”192.168.20.254″ />
</dhcp>
</ip>
</network>

Now we can start the libvirt service with

service libvirtd start

LXC requires cgroup support, so the cgroup filesystem needs mounting. Add the following to /etc/fstab

/etc/fstab
1
none    /cgroup cgroup  defaults        0       0

Then create the directory /cgroup and mount it

mkdir /cgroup
mount /cgroup 

Our templates and support files need to be put in place, and directories created. Toft supplies templates for natty, lucid and centos-6. I use Scientific Linux, so I needed to create my own template from the centos-6 template – you can find it here: https://github.com/johanek/toft/blob/master/scripts/lxc-templates/lxc-scientific-6

mkdir -p /usr/lib64/templates/files
cp lxc-scientific-6 /usr/lib64/lxc/templates/
chmod 0755 /usr/lib64/lxc/templates/lxc-scientific-6
cp /usr/lib/ruby/gems/1.8/gems/toft-puppet-0.0.11/scripts/lxc-templates
/files/rc.local /usr/lib64/lxc/templates/files/

Base Image

Now we need to get a base image to run. You can grab one from the OpenVZ project, as the images are compatible. http://wiki.openvz.org/Download/template/precreated

Since the image is just a bunch of files on disk, you can extract that tarball and chroot into the resulting directory structure to modify the image to your needs. Once you’re happy, tar.gz the directory tree and copy that file to:

/var/cache/lxc/scientific-6-x86_64.tar.gz

Test LXC

To test lxc is working, create a quick node config file:

/tmp/n1.conf
1
2
3
4
5
lxc.network.type = veth
lxc.network.flags = up
lxc.network.link = br0
lxc.network.name = eth0
lxc.network.ipv4 = 192.168.20.2/24

Then, create a node by running:

lxc-create -n n1 -f /tmp/n1.conf -t scientific-6

This should extract the image and configure it ready to be started. When you start the image, a a tty on the guest machine will take over your terminal, so it’s best to run the following in a screen session:

lxc-start -n n1

Once that’s booted you should be able to login via your terminal, and ssh to the machine on the IP we configured above. Once that’s working well, we can stop and destroy the guest with the following:

lxc-stop -n n1
lxc-destroy -n n1

Running Tests

The toft module comes with a lot of example cucumber tests, step definitions and puppet configs to start you off. They’re worth reading through to understand what can be done – I need to modify them for my own needs, so let’s make a copy of the examples as a baseline:

mkdir /root/cucumber/
cd /root/cucumber
rsync -av /usr/lib/ruby/gems/1.8/gems/toft-puppet-0.0.11/features .
rsync -av /usr/lib/ruby/gems/1.8/gems/toft-puppet-0.0.11/fixtures .

The supplied puppet.conf file has lots of localisations specific to the developers setup, so I created a very simple one designed just to specify the path to our puppet modules

fixtures/puppet/conf/puppet.conf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
[main]
# The Puppet log directory.
# The default value is ‘$vardir/log’.
logdir = /var/log/puppet

# Where Puppet PID files are kept.
# The default value is ‘$vardir/run’.
rundir = /var/run/puppet

# Where SSL certificates are kept.
# The default value is ‘$confdir/ssl’.
ssldir = $vardir/ssl

modulepath = /tmp/toft-puppet-tmp/modules/

[agent]
# The file in which puppetd stores a list of the classes
# associated with the retrieved configuratiion. Can be loaded in
# the separate “puppet“ executable using the “–loadclasses“
# option.
# The default value is ‘$confdir/classes.txt’.
classfile = $vardir/classes.txt

# Where puppetd caches the local configuration. An
# extension indicating the cache format is added automatically.
# The default value is ‘$confdir/localconfig’.
localconfig = $vardir/localconfig

At this point I also removed the chef examples, to avoid any errors related to software I’m not using

rm -rf fixtures/chef

Now we need to copy our own modules to fixtures/puppet/modules/

Toft uses a routine located in the rc.local file we copied earlier to update DNS with the hostname of the new node. It does this via nsupdate. Since I’m only ever going to create one node with a known IP, and access it from the one machine we’re running the tests from, I can just add the hostname to /etc/hosts:

192.168.20.2    n1    n1.foo

Now we’re ready to write tests – again there are a lot of examples of cucumber tests in the features directory. I removed them all, apart from puppet.features, which I pared down and added my own tests:

features/puppet.feature
1
2
3
4
5
6
7
8
9
10
11
12
13

Feature: Puppet support

Scenario: Run Puppet manifest on nodes
Given I have a clean running node n1
When I run puppet manifest “manifests/test.pp” on node “n1″
Then Node “n1″ should have file or directory “/tmp/puppet_test”

Scenario: Apache module
Given I have a clean running node n1
When I run puppet manifest “manifests/apache.pp” with config file “puppet.conf” on node “n1″
Then Node “n1″ should have package “httpd” installed in the centos box
And Node “n1″ should have service “httpd” running in the centos box

Note the references to the “centos box” – that’s just the way the step definitions have been written in the toft module and can be modified easily. My Apache module test has an associated manifest:

fixtures/puppet/manifests/apache.pp
1
2

include apache

Now we can run our tests:

cd features
cucumber puppet.feature

Which gives a result like:

Creating Scientific Linux 6 node…
Checking image cache in /var/cache/lxc/rootfs-x86_64 …
Extracting rootfs image to /var/lib/lxc/n1/rootfs …
Set root password to ‘root’
‘scientific-6′ template installed
‘n1′ created
Feature: Puppet support
Scenario: Run Puppet manifest on nodes # puppet.feature:3
Starting host node…
Waiting for host ssdh ready………….
Waiting for host to be reachable.
SSH connection on ‘n1/192.168.20.2′ is ready.
Given I have a clean running node n1 # step_definitions/node.rb:1
notice: /Stage[main]/Test/File[/tmp/puppet_test]/ensure: created
notice: Finished catalog run in 0.04 seconds
When I run puppet manifest “manifests/test.pp” on node “n1″ # step_definitions/puppet.rb:1
Then Node “n1″ should have file or directory “/tmp/puppet_test” # step_definitions/checker.rb:5

Scenario: Apache module # puppet.feature:8
Starting host node…
Waiting for host ssdh ready.
Waiting for host to be reachable.
SSH connection on ‘n1/192.168.20.2′ is ready.
Given I have a clean running node n1 # step_definitions/node.rb:1
notice: /Stage[main]/Apache::Install/Package[httpd]/ensure: created
notice: /Stage[main]/Apache::Service/Service[httpd]/ensure: ensure changed ‘stopped’ to ‘running’
notice: Finished catalog run in 10.68 seconds
When I run puppet manifest “manifests/apache.pp” with config file “puppet.conf” on node “n1″ # step_definitions/puppet.rb:5
httpd-2.2.15-15.sl6.x86_64
Then Node “n1″ should have package “httpd” installed in the centos box # step_definitions/centos/checks.rb:1
9
And Node “n1″ should have service “httpd” running in the centos box # step_definitions/centos/checks.rb:6

2 scenarios (2 passed)
7 steps (7 passed)
0m26.193s

Job done – now you can write more tests and automate them with Jenkins.