Monitorama 2018 PDX Day 3

Achieving Google-levels of Observability into your Application with OpenCensus - Morgan McLean (Google)

OpenCensus - distributed traces, tags, metrics (+ logs)

Collection of libraries for multiple languages, instrumentation and support for multiple exporters for tracing and metrics

One thing to do all the things

The present and future of Serverless observability - Yan Cui (DAZN)

New challenges - no agents or daemons on the platform, or background processing, higher concurrency to telemetry systems, adding to user facing latency when sending telemetry in functions, async invocation makes tracing mmore difficult.

Write metrics in logs, as the platform provides logging. Extract that data in post processing.

Async processing - just send metrics in function

Our tools need to do more to help understand the health of the system, rather than a single function

Putting billions of timeseries to work at Uber with autonomous monitoring - Prateek Rungta (Uber)

Built their own TSDB because existing solutions don’t work at their scale - M3DB: https://github.com/m3db/m3db

Building Open Source Monitoring Tools - Mercedes Coyle (Sensu)

Lessons learnt in the Sensu v2 rewrite

Autoscaling Containers… with Math - Allan Espinosa

Using control theory to help regulate an autoscaling system

  • Iterate on feedback of the system
  • Models that tell you if your feedback is effective
  • Linear models go a long way
  • Re-evaluate your models

Assisted Remediation: By trying to build an autoremediation system, we realized we never actually wanted one - Kale Stedman (Demonware)

Online gaming systems

Nagios constantly full of CRIT warnings. Hostile environment for 1st responders: hard to find alerts in the floor, hard to understand context. Cost of making an incorrect decision is high. Low morale, high turnover.

Plan: Rebrand, Auto-remediation, ???, Profit.

Lots of same tasks being run on the same systems.

System as planned: check results, alert processing engine (calls other APIs), remediation runner. Domains: Detection (Sensu), Decision (StackStorm), Remediation (Didn’t decide).

Review and process all alerts. Migrated many to sensu. Started processing alerts more thoroughly - put all alerts in JIRA and root causes started to be addressed.

Dropped idea of remediation, as alerts decreased realised they had what they needed already - better info for responder, working on fixing underlying causes

Razz: Detection (sensu), Decision (Stackstorm alert workflow), Enrichment (stackstorm widget workflow), Escalation (Stackstorm JIRA workflow)

Monitoring Maxims:

  • Humans are expensive (don’t want to waste their time)
  • Humans are expensive (make mistakes, unpredictable)
  • Keep pages holy (issue must be urgent and for the responder)
  • Monitor for success (ask if it is broken, rather than what is broken)
  • Complexity extends incidents
  • Restoration > investigation
  • LCARS: link dashboards, elasticsearch queries, other system statuses, last few alerts


  • Alert volume down
  • Data driven decisions up
  • Reliability and support quality up
  • Team morale up

Security through Observability - Dave Cadwallader (DNAnexus)

Improve relationship between security and ops

Security and compliance concerns handling medical data.

Compliance is about learning from mistakes and creating best practices. Designed with safety in mind, but doesn’t guarantee safety.

Security want to do less compliance, and more time threat hunting. How do we automate compliance checks?

Inspec - compliance as code. Check security in VMs, settings on cloud resources. Community is thriving, plenty of existing baselines available for use.

Run inspec locally, and write out to json. Prometheus to pull the data, custom exporter.

Apply SLOs - actionable compliance alerts. Link back to documentation on compliance documentation. Friendly HTML report pushed to S3 and linked in alert.

How to include Whistler, Kate Libby, and appreciate that our differences make our teams better. - Beth Cornils (Hashicorp)

Talk about inclusivity in tech, how we can improve

Monitorama 2018 PDX Day 2

Want to solve Over-Monitoring and Alert Fatigue? Create the right incentives! - Kishore Jalleda (Microsoft)

Lessons from healthcare - more monitoring is not always better

Problems of far too many alerts at Zynga. Vision of <2 alerts per shift. Dev on call, SRE for engineering infra and tooling. Zero TVs.

One day massive outage, multiple outages with same root cause.

Idea: Deny SRE coverage based on Alert budgets. Lot of push back - from SRE team and from supported teams.

Baby steps - focus on monitoring and prove we can do one thing well. Promise of performing higher value work.

Leverage outages - Postmortems, follow ups. Build credibility, show you care.

Find Allies - identify who is aligned with you. Need to go out and find them.

Call your Boss - ensure they are aligned. Get buy in from Senior Leaders.

Establish contracts. Targets for all teams. Give time to clean up before launch.

Reduce alert noise - aggregation, auto remediation, what should be a logs or a tickets or an alert.

Success - 90% drop in alerts, 5min SRE response, uptime improved

Next-Generation Observability for Next-Generation Data: Video, Sensors, Telemetry - Peter Bailis (Standford CS)

Taking systems engineering/thinking approach to dealing with speeding up applying ML models to video analysis.

Interesting technology and family of tools being developed.

Coordination through community: A swarm of friendly slack bots to improve knowledge sharing - Aruna Sankaranarayanan (Mapbox)

Chatbots in slack. Commands are AWS lamba functions.

sumo logic queries. benchmarking. user lookup. get/set rate limits. platform-question: assign questions to an engineer, with escalation.

Automate Your Context - Andy Domeier (SPS Commerce)

Complexity is increasing to enable velocity

Context: circumstances that form the setting for an event, statement or idea, and in terms which it can be fully understood and assessed

As complexity increases, amount of available context increases. Eficiency of an organisation directly correlated to how effective you are with your available context.

Goal - make the right context available at the right time.

Context. People, monitoring, observability, obscure and hard to find things.

People: One consistent set of operational readiness values. Operation info, Peformance (KPI/Cost), Agility (Deployment maturity), Security

Monitoring: taking action. Examples:

  • Put alerts onto SNS topic. Automate actions with lambda functions. Search the wiki for documentation, post back to alerts commenter.
  • JIRA incidents to SNS as well, automate incident communication. Search for recent changes from JIRA.
  • Change management in JIRA, dependency lookup & commment back to JIRA.

Slack in the Age of Prometheus - George Luong (Slack)

We replace monitoring systems because our needs have evolved

Ganglia & Librato, migrated to Graphite. Looked at migrating away.

User Problems: metrics difficult to discover. Query performance made for slow rendering. Other problems specific to their usage.

Operation Problems: Could not horizontally scale the cluster (lack of tooling, time and effort). Single node fails led to missing metrics. Susceptible to developers accidentally taking down cluster.

Requirements. User: Ease of discover, fast response, custom retention, scales with us. Operational: Remove single node POF. Teams want to own their monitoring.

Prometheus ticked almost all those boxes, except custom retention and ingestion. (can only be set per box)


  • in region: duplicate prom servers, scraping same servers. primary and secondary.
  • between regions: single pair of federated servers scraping other prom servers.

Configuration Managemenet: terraform & chef. Prom jobs and rules stored in chef, ruby hash converts to yaml.

Webapp server about 70k metrics, x 500 servers (35m)

Job worker, 79k metrics x 300 servers (24m)

Sparky the fire dog: incident response as code - Tapasweni Pathak (Mapbox)

Sparky the firedog:

SNS topic - lamba function - forwards to Pagerduty

sparky is a npm module

documentation for alarms in github. sparky enriches the alarms.

  • reformat the alarm info
  • aggregate all your teams alarms into a single Pagerduty policy
  • links to targetted searches
  • get the root cause analysis?

Future: score an alarm. Does a human need to action this? Based on the triaging/context we have existing in documentation.

Sample questions that need answering on an alert:

  • How many errors around trigger time?
  • Are the errors ongoing?
  • What was the very first error?
  • Group and count the errors by time

Reclaiming your Time: Automating Canary Analysis - Megan Kanne (Twitter)

Is my build healthy? Want to catch errors before causing problems.

Canary - partial deployment. Use canary cluster for canary deploys. Needs production level traffic. Control cluster to compare against canary.

Use statistical tools to compare.

Median Absolute Deviation - maxPercentile

DBSCAN - density based spatial clusering of applications with noise. - toleranceFactor

HDDBSCAN - hierarchial dbscan - minSimilarShardsPercent (minimum cluster size)

Mann-Whitney U Test - tolerance, confidenceLevel, direction

Simplify configuration (of your statistical tests). Sensible defaults speed up adoption.

Choosing metrics - SLOs, existing alerts.

User Trust - Build it.

Monitorama 2018 PDX Day 1

Optimising for learning - Logan McDonald

expert intuition is achievable. improve memory.

  • prep

hierarchy of learning. important one is rule learning.

  • gaining knowledge

reading docs/textbooks not an effective way to embed knowledge in long term memory. do low stakes testing. best way of learning if to retrieve data.

  • mental models

observabiity and reflection

move from events to patterns to structure

engaging in incident reviews gave opportunity to quiz engineers on their actions during incident

  • learning together

symmathesy - systems of understanding

embrace cultural memory

Serverless and CatOps: Balancing trade-offs in operations and instrumentation - Pam Selle

I sadly missed most of this because I was debugging a problem

Mentoring Metrics Engineers: How to Grow and Empower Your Own Monitoring Experts - Zach Musgrave, Angelo Licastro (Yelp)

growth in teams, and managing growth in individuals within a team

mandate growth - more skills needed, more knowledge gaps, more planning (meetings), trade offs.

knowledge silos - reaction is mentoring

breadth: build confidence. defined mentor relationship.

depth: make an advanced contribution to one system.

next steps: end of mentor relationship, can start on-call. hold a retrospective - provide feedback for improvement

impostor syndrome - accept compliments for your work and acknowledge other’s contributions

consulting: monitoring is a specialised field. there’s a lot of nuance. make people aware of what tools can solve their problems. talk with other teams, ask insightful questions: what problem are you trying to solve? listen, as assume you know nothing about their problem.

The Power of Storytelling - Dawn Parzych (Catchpoint)

Cognitive bias

  • too much info
  • not enough meaning
  • not enough time
  • not enough memory

To tell better stories, when presenting data view it from the perspective of other parts of the organisation

Persuasive storytelling doesn’t need to be complicated - keep it simple

Use simpler language - it doesn’t make it less powerful

Present the most important information earlier

  • Use visuals when telling a story, keep them simple to understand
  • Don’t overcomplicate things
  • Remember the power of 3

Principia SLOdica - A Treatise on the Metrology of Service Level Objectives - Jamie Wilkinson

Overloaded team, tasked with reducing load to build a sustainable system they could manage

Symptom Based Alerting - focus on very few expectations about your service, so as system grows you are still focusing on the same things

Symptom: a user is having a bad time. Causes: internal views on the service.

What’s your tolerance for failure? Error budget. Set expectations: SLO

Symptom anything that can be measured by the SLO. Symptom based alert, is when SLO is in danger of being missed.

Debugging tools - metrics, tracing, logs - replace cause based alerts

On-call Simulator! - Building an interactive game for teaching incident response - Franka Schmidt (Mapbox)

On call onboarding

Safe and low stakes place to practice handing on call incidents


  • Buddy System: Observing
  • Bucket list: checklist, how far along you are with experiences of being on call
  • Alarm scrum: review last days alarms

Simulator - “choose your adventure” type text adventure game. Tool: Twine.

Craft a story - use Incident Reviews, past notes and enrich with detail. Or just make it up.

Iterate and get feedback.

Observability: the Hard Parts - Peter Bourgon (Fastly)

Level up monitoring story at fastly

Senior engineering org, 120 engineers. Heterogeneous solutions throughout the organisation. Organic growth is good - up to a point.

Goals: are production systems healthy? if not why not?

Non goals: long term storage, replacing logging. (Observed metric usage VERY different from self-reported usage)

Strategy: prometheus, add instrumentation, curate new set of dashboards and alerts, run in parallel to build confidence, decom old systems.

Rollout: Inventory all services. Rank by criticality & friendliness.

The grind…

Embedded expert model. Pair with responsible engineer to get migration done. Hour or two of pair programming. Let documentation emerge naturally. Deploy prometheus, plumb service into prometheus (time consuming).

Technical autonomy is a form of technical debt. Focus on local vs. global optimization.

Dashboards and alerts. Import from other systems. Make sure to only build things in service to original goals. Reduce, keep high level views.

Some services, ownership can be indistinct. Senior leadership needed to wield stick, need their buy in.

Warning: This Talk Contains Content Known to the State of California to Reduce Alert Fatigue - Aditya Mukerjee (Stripe)


We can learn from clinical healthcare

Alert Fatigue - frequency/severity - causes responder to ignore or make mistakes

Decision Fatigue - frequency/complexity of decision points, causes person to avoid devices or make mistakes

Certain patterns of alerts and decisions contribute disproportionately to fatigue. Multiple false positives for an individual patient, impacts that alert over all patients.

Reduce alert fatigue: STAT. Supported, Trustworthy, Actionable, Triaged.

Supported: who owns this alert? Responders should own, or feel ownership over end result.

Trustworthy: do I trust this alert to notify me when a problem happens? stays silent with all is well? give sufficient information to diagnose problems?

Anomaly detection: if you don’t understand why an alert triggered, you don’t understand if it’s real.

Actionable: One decision require to respond. Alerts difficult to action are ignored. Make alerts specific, add decision trees. Alerts must have a specific owner.

Triaged: Triage alerts. Type should reflect urgency. Urgency can change. Commonly understood tiers. Regular re-evaluation process.

Monitory Report: I Have Seen Your Observability Future. You Can Choose a Better One. - Ian Bennett (Twitter)

Twitter has a big monitoring system and migrating was hard

Monitorama 2016 Portland Day 3

Justin Reynolds, Netflix - Intuition Engineering at Netflix


Discussed problems at Netflix that regions were siloed, they worked on serving users out of any region.

To fail regions, need to scale up other regions to server all traffic

Dashboards, good at looking back, but need to know now. How to provide intuition of the now?

Created vizceral - see the blog post for screenshots & video: http://techblog.netflix.com/2015/10/flux-new-approach-to-system-intuition.html

Brian Brazil, Robust Perceiver - Prometheus

Slides | Video

Prometheus is a TSDB offering ‘whitebox monitoring’ for looking inside applications. supports labels, alerting and graphing are unified, using the same language.

Pull based system, links into service discovery. HTTP api for graphing, supports persistent queries which are used for alerting.

Provides instrumentation library, incredibly simple to instrument functiions and expose metrics to prometheus. Client libraries don’t tie you into prometheus - can use graphite.

Can use prometheous as a clearing house to translate between different data formats.

Doesn’t use a notion of a machine. HA by duplicating servers, but alertmanager deduplicates alerts. Alertmanager can also group alerts.

Data stored as file per database on disk, not round-robin - stores all data without downsampling.

Torkel Ödegaard, Raintank - Grafana Master Class


Gave a demo on how to use grafana, as well as recently added and future features.

Katherine Daniels, Etsy - How to Teach an Old Monitoring System New Tricks

Slides | Video

Old Monitoring System == Nagios

Adding new servers.

  • Use deployinator to deploy nagios configs. Uses chef to provide inventory to generate a currently list of hosts and hostgroups.
  • Run validation via Jenkins by running nagios -v, as well as writing tool for nagios validation.
  • New hosts are added with scheduled downtime so they don’t alert until the next day. Chat bots send reminders when downtime is going to finish.

Making Alerts (Marginally) Less Annoying

  • Created nagdash to provide federated view of multiple nagios instances.
  • Created nagios-herald to add context to nagios alerts. Also supports allowing people to sign up to alerts for things they’re interested in.

Tracking Sleep

  • Ops weekly tool. Provides on call reports, engineers flag what they had to do with alerts.
  • Sleep tracking and alert tracking for on call staff to understand how many alerts they’re facing and how it’s impacting their sleep.

An On Call bedtime story

  • Plenty of alerts because scheduled downtime expired for ongoing work.
  • Create daily reports of what downtime will soon expire and which will raise alerts.

Joe Damato, packagecloud.io - All of Your Network Monitoring is (probably) Wrong

Slides | Video

There’s too much stuff to know about

  • ever copy paste config or tune settings you didn’t understand?
  • do you really understand every graph you’re generating?
  • what makes you think you can monitor this stuff?

Claim: the more complex the system is the harder it is to monitor.

whats p. complicated? linux networking stack! lots of features, lots of bugs. with no docs!

  • /proc/net stats can be buggy
  • ethtool inconsistent, not always implemented
  • meaning of driver stats are not standardised
  • stats meaning for a dirver/device can change over time
  • /proc/net/snmp has bugs: double counting, not being counted correctly

Monitoring something requires a very deep understanding of what you’re monitoring.

Properly monitoring and setting alerts requires significant investment.

Megan Kanne, Justin Nguyen, and Dan Sotolongo, Twitter - Building Twitter’s Next-Gen Alerting System

Slides | Video

3.5B metrics per minute

Old Alerting System

  • 25k alerts/minute, 3m alert monitors
  • single config language, lot of existing example, easy to write and add
  • all those points were good and bad!
  • lots of orphaned and unmaintained configs, no validation
  • alerts and dashboards were seperate
  • problems with reliability when zones suffer problems


  • combined alerts and dashboard configuration
  • dashboards defined in python, common libraries that can be included
  • python allows testing configs
  • created multi-zone alerting system
  • reduced time to detect from 2.5mins to 1.75mins

Helping Human Reasoning

  • bring together global, dependencies and local context
  • including runbooks, contacts and escalations directly in the UI

Lessons Learned

  • distributed system, challenges about consistency, structural complexity and reasoning about time
  • sharding choices are hard, impossible to always avoid making mistakes
  • support and collaborate with users, try and reduce support burden with good information at interaction points (UI, CLI etc.), good user guides and docs
  • migrating - some happy to move, others not. some push back. had to accept schedule compromise and extra work.

Joey Parsons, Airbnb - Monitoring and Health at Airbnb


Perfers to buy stuff:

  • New Relic
  • Datadog, instrument apps using dogstatsd
  • Alerting through metrics

Created open source tool, Interferon, to store alerts configuration as code.

Volunteer on call system. SREs make sure things in place so anyone can be on call.

  • Sysops training for volunteers, monitoring systems, how to be effective and learn from historical incidents
  • Shadow on call, learn from current primary/secondary
  • Promoted to on call

Weekly sysops meeting, go through incidents, hand offs, discuss scheduled maintenance.

On call health:

  • are alerting trends appropriate?
  • do we understand impact on engineers?
  • do we need to tune false positives?
  • are notifications and notification policies appropriate?

Dashboards for:

  • incident numbers over time
  • counts by service
  • total notifcations per user and how many come at night
  • false positive incident counts

Heinrich Hartmann, Circonus - Statistics for Engineers

Slides | Video

Monitoring Goals

  • Measure user experience/ quality of service
  • Determine implications of service degradation
  • Define sensible SLA targets

External API Monitoring:

  • sythetic request, measures availability, but bad for user experience
  • on long time ranges, rolled-up data is commonly displayed, erodes spikes

Log Analysis:

  • write request stats to log file
  • rich information but expensive and delay for indexing

Monitoring Latency Averages:

  • mean values, cheap to collect, store and analyse, but skewed by outliers/low volumes
  • percentiles, cheap to collect store and abalyse, robust to outliers but up front percentile choice required and cannot be aggregated

percentiles: keep all your data. don’t take averages! store percentiles for all reporting periods you are interested in - i.e. per min/hour/day. store all percentiles you’ll ever be interested in.

Mointoring with Histograms:

  • divide latency scale into bands
  • divide time scale into reporting periods
  • count the name of samples in each band x period

Can be aggregated across times. Can be visualised as heatmaps.

John Banning, Google - Monarch, Google’s Planet-Scale Monitoring Infrastructure


Huge volume, global span, many teams - constant change

Previously borgmon. Each group had it’s own borgmon. Large load on anyone doing monitoring. Hazing ritual - new engineer gets to do borgmon config maintenance.


  • can handle the scale
  • small/no load to get up and running
  • capable of handling the largest services

Monitor locally. Collect and store the data near where it’s generated. Each zone has a monarch.

  • Targets collect data with streamz library. Metrics are multi dimensional information, stores histogram.
  • Metrics sent to monarch ingestion router, send to leaf which is in-memory data store and also written to recovery log. From log to long term disk respository.
  • Streams stored in a table, basis for queries
  • Evaluator runs queries and stores new data for streams or sends notifications

Integrate Globally. Global Monarch - Distributed across zones, but a single place to configure/query all monarchs in all zones.

Provides both Web UI and Python interfaces.

Monarch is backend for Stackdriver

Monitoring as a service is the right idea. Make the service a platform to build monitoring solutions.

Monitorama 2016 Portland Day 2

Brian Overstreet, Pinterest - Scaling Pinterest’s Monitoring


Started with Ganglia, Pingdom

Deployed Graphite, single box

Second Graphite architecture - Load Balancer, 2x relay servers, multiple cache/web boxes etc.

Suffered lots of UDP packet receive errors

Put statsd everywhere

  • fixed packet loss, unique metric names per host
  • latency only per host, too many metrics

Sharded statsd

  • not unique per host now,
  • shard mapping in client, client version needs to be same everywhere

Multiple graphite clusters - one per application (python/java)

More maintenance, more routing rules etc.

Problems with reads, multiple glob searches can be slow

Deployed OpenTSDB

Replace statsd

  • local metrics agent, kafka, storm - send to graphite/opentsdb


  • interface for opentsdb and statsd
  • sends metrics to kafka
  • processed by storm

120k/sec graphite, 1.5m/sec opentsdb. no more graphite, move to opentsdb.

Create statsboard - integrates graphite and opentsdb for dashboards and alerts

Graphite User Education - underlying info about how metrics are collected, precision, aggregation etc.

Protect System from Clients

  • alert on unique metrics
  • block metrics using zookeeper and shared blacklist (created on fly)

Lessen Operational Overhead

  • more tools, more overhead
  • more monitoring systems, more monitoring of the monitoring system
  • removing a tool in prod is hard

Set expectations

  • data has a lifetime
  • not magical data warehouse tool that returns data instantly
  • not all metrics will be efficient


  • match monitoring system to where the company is at
  • user editation is key to scale tools organizationally
  • tools scale with the number of engineers, not users

Emily Nakashima, Bugsnag - What your javascript does when you’re not around


Lots of app moving to frontend, so running in browser not on backend servers


  • capture load performance from browser, send to app server, use statsd + grafana & google analytics
  • capture uncaught exections in the browser, using their own product

Sorry, Javascript is just not that relevant in my line of work

Eron Nicholson and Noah Lorang, Basecamp - CHICKEN and WAFFLES: Identifying and Handling Malice

Slides | Video

Suffered DDoS and blackmail. 80 gigabits - DNS reflection, NTP reflection, SYN floods, ICMP flood

Defense and Mitigation:

  • DC partner filters for them
  • More 10G circuits and routers
  • Arrangements with vendors to provide emergency access and other mitigation tools

Experience got them serious about more subtle application level attacks:

  • vulnerability scanners
  • repeat slow page requests
  • brute force attempts
  • nefarious crawlers

What do we want from a defense system?

  1. Protection against application-level attacks
  2. Keep user access uninterrupted
  3. Take advantage of the data we have available
  4. Transparent in what gets blocked and why

Chicken: who is a real user and who is malicious?

Considered Machine Learning classification. Problems: really hard to get a good training set. Need to be able to explain why an IP was blocked.

Simpler approach:

  • Some behaviours are known to be from people up to no good. crawling phpmyadmin, path reversal, repeated failed login attempts etc.
  • Request history gives a good idea of whether someone is a normal user, broken script or a malicious actor.
  • External indicators: geoip databases, badip dbs, facebook threat exchange

Removing simple things reduces noise. Every incoming request is scored and per-IP aggregate score calculated based on return code. Create Exponentially Weighted Moving Average from that data. About 12% had negative reputation.

Scaning for blockable actions and scoring requests in near real-time using request byproducts.

Request logs, netflow data, threat exchange -> kafka -> request scoring, scanner for known bad bahaviour, tools for manual evaluation.

Average IP reputation gives an early indicator to monitor for application level attack.

Provides list of good, bad, and dubious IPs.


  • originally provided by iptables rules on haproxy hosts
  • then tried rule on loadbalancer
  • then tried null routing on routers
  • finally created waffles

Using BGP flowspec to send data from routers to waffles, which then decides what path to take: error, app or challenge. Waffles host live in a seperate network with limited access.

Waffles is redis and nginx.

John Stanford, Solinea - Fake It Until You Make It


Monitoring an openstack cluster, 1 controller and 6 compute nodes, taking logs with heja and sending it to elasticsearch. Can I scale this up to a thousand nodes? How big can it get?

How do you go about figuring that out?

  • took 7 days of logs from lab, 25k messages/hr
  • number of logs coming from a node
  • number of logs coming from a component
  • 7 day message rate, look at histograms, identifies recurring outlier
  • message size, percentiles of payload size

What models look like what we’re doing for simulation? Add some random noise.

Flood process, monitoring everything, repeat until it breaks. System sustained 4k x 1k messages/sec, started to pause above that, but no messages were dropped.

Next steps: - find bottlenecks - improve the model

Tammy Butow, Dropbox - Database Monitoring at Dropbox

Slides | Video

Achieving any goal requires honest and regular monitoring of your progress.

Originally used nagios, thruk (web ui) and ganglia

Created own tool vortex in 2013

why create in house monitoring?

  • performance, reliability iissues, scaling number of metrics fast

Create Vortex:

  • Time Series Database with dashboards, alerting, aggregation
  • Rich metric metadata, tag a metric with lots of attributes

Monthra: single way of scheduling and relaying metrics, discourage scheduling with cron

Service Metrics: - what durability, reliability goals? align monitoring to goals? - threads running/ threads connected

Run a Monthly Metrics Review (great idea)

Dave Josephsen, Librato - 5 Lines I couldn’t draw


  1. Making cofnitive leap to use monitoring tools to recognise system behaviour independent of alerting. misapprehension about what monitoring was and whom it was for.

  2. Monitoring is not for alerting. Nobody owns moitoring. ‘Tape measure that I share with every engineer I work with’. Ops owns monitoring vs everyone owns monitoring. Monitoring is for asking questions.

  3. Complexity isolates. Effective monitoring gives you the things that allow you to ‘Cynefin’ - make things more familiar and knowable. Reduce complexity rather than embracing it. Monitoring can build bridges to help people understand things across boundaries.

  4. Effective monitoring can bring about cultural change, how people interact between each other.

  5. Repeated point 4

Jessie Frazelle, Google - Everything is broken


Talked about problems with Software Engineering and Operations

Demonstrated how they monitored community and external maintainer PR statistics for Docker project.

James Fryman, Auth0 - Metrics are for Chumps - Understanding and overcoming the roadblocks to implementing instrumentation


Story of implementation of instrumentation at Auth0

Wanted data driven conversations. Metrics implementation happened in past, was ripped out because not well understood, thought to cause latency. Created adversions.

Make the chase. Get buy-in.

To have good decent conversations with someone you need to have metrics.


  • Not the most important feature - but it is!
  • Cannot start until we understand the data retention requirements - premature optimisations
  • We don’t run a SaaS - need to understand what your software is doing regardless

Make decisions based on knowledge, not intuition or luck.

Be opportunistic - success is 90% planning, 10% timing and luck. Find opportunites to accellerate efforts.

Needed to get something going fast - went for full service SaaS Datadog, but with common interfaces and shims to allow moving things in house later. Don’t delay, jump in and iterate.

Keep in sync with developers - change is difficult and there will be resistance, pay attention to feedback. Need to support interpretation of data.

Build out data flows, find potention choke points in system, take a baseline measurement, check systems in isolation

Fix and Repair bottlenecks. Solved 3 major bottlenecks, went from 500 to 10k RPS.

Monitorama 2016 Portland Day 1

This year I was lucky enough to attend Monitorama in Portland. Thanks to Sohonet for sending me! I’d wanted to attend again since going to Berlin in 2013, because the quality of the talks is the highest I’ve seen in any conference that’s relevant to my interests. I wasn’t disappointed, it was awesome again.

Here are my notes from the conference:

Adrian Cockcroft, Battery Ventures - Monitoring Challenges

Slides | Video

This talk reflected on new trends and how things have changed since Adrian talked about monitoring “new rules” in 2014

What problems does monitoring address?

  • measuring business value (customer happiness, cost efficiency)

Why isn’t it solved?

  • Lots of change, each generation has different vendors and tools.
  • New vendors have new schemas, cost per node is much lower each generation so vendors get disrupted

Talked about serverless model - now monitorable entities only exist during execution. Leads to zipkin style distributed tracing inside the functions.

Current Monitoring Challenges:

  • There’s too much new stuff
  • Monitored entities are too ephemeral
  • Price disruption in compute resources - how can you make money from monitoring it?

 Greg Poirier, Opsee - Monitoring is Dead

References | Video

Greg gave a history and definition of monitoring, and argued that how we think about monitoring needs to change.

Historically monitoring is about taking a single thing in isolation and making assertions about it.

  • resource utilisation, process aliveness, system aliveness
  • thresholds
  • timeseries

Made a defintion of monitoring:

Observability: A system is observable if you can determine the behaviour of the system based on it’s outputs

Behaviour: Manner in which a system acts

Outputs: Concrete results of it’s behaviours

Sensors: Emit data

Agents: Interpret data

Monitoring is the action of observing and checking the bahaviour and outputs of a system and its components over time

Failures in distributed systems are now: responds too slowly, fails to respond.

Monitoring should now be about Service Level Objectives - can it respond in a certain time, handle a certain throughput, better health checks

We need to better Understand Problems (of distributed systems), and to Build better tools (event correlation particularly)

Nicole Forsgren, Chef - How Metrics Shape Your Culture

Slides | Video

Measurement is culture. Something to talk about, across silos/boundaries

Good ideas must be able to seek an objective test. Everyone must be able to experiment, learn and iterate. For innovation to flourish, measurement must rule. - Greg Linden

Data over opinions

You can’t improve what yu don’t measure. Always measure things that matter. That which is measured gets managed. If you capture only one metric you know what will be gamed.

Metrics inform incentives, shape behaviour:

  • Give meaningful names
  • Define well
  • Communicate them across boundaries

Cory Watson, Stripe - Building a Culture of Observability at Stripe


To create a culture of observability, how can we get others to agree and work toward it?

Where to begin? Spend time with the tools, improve if possible, replace if not, leverage past knowledge of teams

Empathy - People are busy, doing best with what they have, help people be great at their jobs

Nemawashi - Move slowly. Lay a foundation and gather feedback. (Write down and attribute feedback). Ask how you can improve.

Identify Power Users - Find interested parties, give them what they need, empower them to help others

What are you improving? How do you measure it?

Get started. Be willing to do the work, shave the preposterous line of yaks. Strike when opportunies arise (incidents). Stigmergy - how uncordinated systems work together.

Advertise - promote accomplishments, and accomplishment of others.

Alerts with context - link to info, runbooks etc. Get feedback on alerts, was it useful?

Start small, seek feedback, think about your value, measure effectiveness

Kelsey Hightower, Google - /healthz


Kelsey gave a demo of the /healthz pattern, and how that can protect you from deploying non-functional software on a platform that can leverage internal health checks.

Stop reverse engineering apps and start monitoring from the inside

Move health/db checks and functional/smoke tests inside app, and expose over a HTTP endpoint

Ops need to move closer to the application.

 Brian Smith, Facebook - The Art of Performance Monitoring


Gave an overview of some of the guiding ideas behind monitoring at facebook

Bad stuff:

  • High Cardinality - same notifications for 100x machines
  • Reactive Alarms - alarms which are no londer relevant
  • Tool Fatigure - too few/too many

It can Mechanical, Simple and Obvious to do these things at the time. But the cumulative effect is a thing thats hard to maintain.

Properties of Good Alarms:

  • Signal
  • Actionability
  • Relevancy

Your Dashboards are a debugger - metrics are debugger in production.

Caitie McCaffrey, Twitter - Tackling Alert Fatigue

Slides | References | Video

When alerts ae more often false than true, people are desensitised to alerts.

Unhappy customers is the result, but they are also unplanned work, and a distraction from focusing on your core business.

Same problem experienced by nurses responding to alarms in hospitals. What they have done:

  • Increase thresholds
  • Only crisis alarms would emit audible aleters
  • Nursing staff required to tune false positive alerts

What Caitie’s team did:

  • Runbook and alert audits - ensure ther eare runbooks for alerts, templated, single page for all alerts, each alert has customer impact and remediation steps. Importantly, also includes notification steps.

  • Empower oncall - tune alert thresholds, delete alerts or re-time them (only alert during business hours)

  • Weekly on-call retro - handoff ongoing issues, review alerts, schedule work to improve on-call

This resuted in less alerts, and improved visibility on systems that alert a lot.

To prevent alert fatigue:

  • Critical alerts need to be actionable
  • Do not alert on machine specific metrics
  • Tech lead or Eng manager should be on call

Mark Imbriaco, Operable - Human Scale Systems


It’s common to say now that “Tools don’t matter” … but they do. We sweat the details of our tools because they matter. All software is horrible.

We operate in a complex Socio-Technical System. Human practitioners are the adaptable element of complex systems.

Make sure you think about the interface and interactions (human - software interactions)

  • Think about the intent, what problem are you likely to be solving (use cases)
  • Consistency is really important
  • Will it blend - how does it interact with other systems
  • Consider state of mind - high intensity situations/ tired operators

 Sarah Hagan, Redfin - Going for Brokerage: People analytics at Redfin


Redfin is an online Estate Agency with agents on ground

Monitoring hiring

  • Capture lots of data on the market
  • Where should we move?
  • How many staff should we have in each location?
  • Useful tooling for the audience
  • Hire employees rather than contractors, analyse sold house price data to make sure employees earn enough vs. commission agents

Monitoring employees

  • Customer reviews for agents
  • Agents paid based on rating
  • Let the customer monitor the business
  • Monitor loading capacity of agents

Monitoring culture

  • Internal forums for feedback on tooling.

Pete Cheslock, Threat Stack - Everything @obfuscurity Taught Me About Monitoring

Slides | Video

Told the story of his history of learning about monitoring, and how he has approached monitoring problems at his current startup.

Telemetry and Alerting system is not core competancy.

  • Do simple things early when it makes sense (put metrics in logs).
  • When it’s necessary to get more data - just buy something.
  • Hosted TSDB is useful and just works, but there a faster non-durable metrics which are important. So he used graphite for 10s interval metrics, with 2 collectd processes writing to two outputs
  • Ended up with a full graphite deployment

Operability Day Two

Day one is available here

Charity Majors, Parse/Facebook - Building a world class ops team


This was a talk focusing on bootstrapping an ops team for startups.

Do you need an ops team?

Ops engineering at scale is a specialised skillset. is is not someone to do all the annoying parts of the running systems. Or do you need software engineers to get better at ops?

You need an ops team if you have hard problems:

  • extreme reliability
  • extreme scalability (3x-10x year over year)
  • extreme security
  • solving operational problems for the whole internet

What makes a good startup ops hire?

Its not possible to hire people who are good at everything - unicorns. What you can get are engineers who are good at some things, bad at others. People who can learn on the fly are valuable.

“A good operations engineer is broadly literate and can go deep on at least one or two areas”

Great ops engineers:

  • strong automation instincts
  • ownership over their systems
  • strong opinions, weakly held
  • simplify
  • excellent communication skills, calm in a crisis
  • value process (as that is what stops you making the same mistakes over again)
  • empathy

Things that aren’t good indicators:

  • whiteboarding code
  • any particular technology or language
  • any particular degree
  • big company pedigree

Succeeds at a big company:

  • structured roadmap
  • execute well on small coherent slices
  • classical cs backgrounds
  • value cleanliness & correctness
  • technical depth

Succeeds at startup:

  • comfortable with chaos
  • knows when to solve 80% and move on
  • total responsibility for outcomes
  • good judgement
  • highly reactive
  • technical breadth

How do you interview and sort for these qualities?

Don’t hire for lack of weaknesses. Figure out what strengths you really need and hire for those.

Good questions:

  • leading and broad, probe the candidates self reported strengths
  • related to your real problems
  • ask culture questions, screen for learned helplessness

Bad questions:

  • depend on a specific technology
  • designed to trip then up, looking for a reason to say no
  • deny candidates the resources they would use to solve something in the real world

You hired an ops engineer, now what?

How to spot a bad ops enginner:

  • tweaking indefinitely and pointlessly
  • walling off production from developers
  • adding complexity
  • won’t admit they don’t know things
  • disconnected from customer experience

How to lose good ops engineers:

  • all the responsibility, none of the authority
  • all the tedious shitwork
  • blameful culture
  • no interesting operational problems

David Mytton, Server Density - Human Ops - Scaling teams and handling incidents


This talk covered how incidents are handled at Server Density.

We should expect downtime - prepare, respond, postmortem.


Things that need to be in place before an incident

  • on call schedue with primary & secondary
  • off call - 24hr recovery after overnight incident
  • docs, and must be located independently from primary infrastructure
  • key info must be available: team contacts, vendor contacts, credentials
  • plan for unexpected situations: loss of communication, loss of internet access
  • use war games to practise for incidents


Process to follow during an incident:

  • First responder
    • load incident response checklist
    • log into ops war room
    • log incident in jira
  • Follow Checklist(s)
    • due to complexity
    • easy to follow in times of stress and fatigue
    • take a beginners mind - ego can get in the way, don’t wing it
  • Key Principles
    • log everything (all commands run, by who and where and what the result was)
    • communicate frequently
    • gather the whole team for major incidents


  • Do within a few days
  • Tell the story of what happened - from your logs
  • Cover the appropriate technical detail
  • What failed, and why? How is it going to be fixed?

Emma Jane Hogbin Westby, Trillium Consultancy - Emphatically Empathetic

Emma talked about how she taught herself to be more empathetic.

Normal people have a lack of empathy, it’s a skill that can be practised and learned.

What is empathy: ability to understand the feelings of another

Level 1: Care just enough to learn about a person’s life

Doing this improves team cohesion, but requires a time investment.

Collect stories - learn about people by asking them questions. Shut up and listen. Respond in a way to encourage more info gathering.

Later, refer back to stories and follow up for more information.

Level 2: Strategies to structure interactions

Doing this you can engineer successful outcomes, and improve capacity for diverse thinking. But you risk being perceived as manipulative.

It’s a mistake to believe there is only one way to have a connection. Try to uncover motivation, why do people behave the way they do?

There are three types of thinking strategies, and you’ll see language patterns that match each of them.

creative thinking: ‘can we try…’ ‘what about…

understanding thinking: ‘so what you’re saying is…’ ‘just to clarify…’

decision thinking: ‘im ready to move on to…’ ‘last time we tried this…’

How can you create outcome based interactions for these sort of people? Perhaps you can plan for specific types of discussions in meeting agendas.

Find a system to use with your team to make communications more explicit, and to take advantage of the thinking strategies they use.

Level 3, engage with work from another’s perspective

This can foster creative problem solving. The risks are it is potentially overwhelming, and can cause doubt for self worth.

Seek to understand - complain about yourself from the other’s prespective, or situation. Live your day through the other’s constraints.

Thinking process should be no more left to chance than the delivery practise of a skill.

There was an interesting question after Emma’s talk - “How do I make Bob care about Dave from another team”. Her suggestion was to create a situation where they can bond over a common enemy - i.e. say something you know to be untrue and that they would both respond to in a similar way.

Scott Klein statuspage.io - Effective Incident Communication

Remember that there is someone on the other end of our incidents who is affected personally.

The talk covered what to do before, during and after an incident. You need a dedicate place to communicate system status to your users.


Get a status page. It needs the following:

  • timestamps
  • to be very fast, very reliable
  • keep away from primary infrastructure, even DNS
  • contact info - give a way to get in touch


  • communicate early. say you are investigating - it means ‘we have no clue but at least we’re not asleep’
  • communicate often. always communicate when the next update is.
  • communicate precisely: be very declarative
    • don’t do etas, they will disappoint people
    • don’t speculate, ‘were still tracking down the cause’
    • ‘verification of the fix is underway’ not ‘we think we fixed it’
  • communicate together. have pre-written templates.
  • one person needs to be assigned as incident communicator


  • apologise first
  • don’t name names
  • be personal. “I’m very sorry”. Take responsbility.
  • details inspire confidence
  • close the loop - what we’re doing about it

Why do this?

  • gain trust with users/customers
  • turn bad experience into good experience
  • service recover paradox - people think more highly of a company if they respond properly.
  • show that you do your job well

Rich Archbold, Intercom.io - Leading a Team with Values

Talk covered Rich’s experience of introducing core values to drive performance of the team. They reduced downtime and infrastructure costs, and number of ops pages.

Enabled autonomy, distributed decision making.

Problems they were facing:

  • roadmap randomisation. easily distrated from what planned to do.
  • projects take a long time and delivered late
  • not feeling like a tribe

Criteria for values:

  • fit with the business
  • personal, specific
  • aspirational and inspirational
  • drive daily decision making
  • not dogma - needed to be flexible

These are the Values they came up with:

  1. Security, Availablilty, Performance, Scalability, Cost - prioritize for maximum impact
  2. Faster, Safer, Easier, Shipping
  3. Zero Touch Ops
  4. Run Less Software

Afterwards they gathered lots of metrics of unplanned work. From this they worked out that they need to multiply estimates by 2.7 to get accurate roadmap planning.

Matthew Skelton, Skeleton Thatcher Consulting - Un-Broken Logging, the foundation of software operability


The way we use logging is broken, how to make it more awesome

What is logging for? It provides an execution trace.

How is logging usually broken? It’s often unloved, discontinuous, contains errors only, bolten on, doesn’t have aggregation and search, severeties aren’t useful because they need to be determined up front.

Also, logs aren’t free. You need to allocate budget and time to make them useful.

Why do we log? For verification, traceability, accountability, and charting the waters.

How to make logging awesome

Continuous Event IDs - use to respresent distinct states. Describe what’s useful for the team to know and describe that as a seperate state. Use enums.

Transaction tracing - Create uniqueish identifier for each request, as pass it through the layers.

Decouple Severity - allow configurable severity levels. Log level should not be fixed at compile or build time. Map Event IDs to a severity.

Log aggregation and search tools - As we move from monolith to microserverice, the debugger does not have the full view anymore. Need an aggregated view of logs across a system. Develop software using log aggregation as a first class thing.

Design for logging - logging is another system component, and needs to be testable.

NTP - Time sync is crucially important for correlating log entries

Referenced the following video:

Evan Phoenix - Structured Logging

Gareth Rushgrove, Puppet Labs - Taking the Operating System out of operations


The age of the general purpose operating system is over. What does this mean for operators?

Lots of new OSes have appeared in the last year

New Breed:

  • Atomic (RedHat)
  • CoreOS
  • Snappy (ubuntu, replace dpkg with contrainers)
  • RancherOS (docker all the way down)
  • Nano - tiny alternative to Windows Server
  • VMWare Bonneville

Common themes:

  • Cluster native
  • RO file systems
  • Transactional updates
  • Integrated with containers

Why the interest in New OSes?

  • Lots of homogeneous workloads
  • Security is front page news
  • Size as a proxy for complexity
  • Utilisation matters at scale
  • Increasingly interacting with higher level abstractions anyway


Compile an application down to a kernel, there is no userspace. Only include the capabilities and libraries you need - everything is opt-in.

  • Hypervisor/hardware isolation
  • Smaller attack surface
  • Less running code
  • Enforced immutability
  • No default remote access

What happens to operators?

Hypervisor becomes the “platform”.

Everything else as an application. Firewalls. Network Switches. IDS. Remote access.

Everyone not running the hypervisor is an application developer. Standards required: Platforms, Containers, Monitoring. Publish more schemas than incompatible implementations in code.

Infrastructure is code.

Revolution not evolution. Distance between old infrastructure and new will be huge. Models of interaction and the skills required to operate.


We have fundamental problems that date back more than 40 years. It Might take a different evolutionary process to build better infrastructure. We may have to throw away things we care about, such as Linux. This is all driven from security concerns.

Ben Hughes, Etsy - Security for Non-Unicorns


Security is hard. Tiny little bugs turn into giant things.

You’re already being probed for security holes, do you want to know or not? Bug Bounties are a way of getting attackers working for you.

You need to prepare a lot for bug bounties. Try and get all the low hanging fruit yourself. The first few weeks will be hell.

With much of our infrastructure in the cloud, it’s easy to expose sensitive information, such as credentials, on places like github. Gitrob helps to analyse git repos for you.

People trust random files off the internet - like docker images, vagrant images, and curl|bash installs etc.

Operability Day One

Today I’ve been at the operability.io conference in London.

The conference in intended to be focused on the ops side of devops, and how to make software operable.

This is a single track conference, here is my personal summary and notes from the talks:

Andrew Clay Shafer, Pivotal - What even is operable?


I’ve seen a few talks by Andrew before, and they’re full of challenging, rapid fire ideas and only loosely tied together - it’s more an expression of his world view rather than a talk on a specific subject. With that in mind, I’m still going to try and summarise it.

Andrew asks the question “what even is operable”, and in the end he came to the conclusion that Operability is the intersection of capability and usability.

There is an emergent architecture, which he calls cloud native, which are a set of patterns that emerged in organisations that deliver highly available applications continuously at scale - like Amazon, Google, Twitter, Facebook etc. There are many associated labels, like devops, continuous delivery and microservices, and these are all inter-related as part of that architecture.

“Do not seek to follow in the footsteps of the wise. Seek what they sought.” - Matsuo Basho

The human tendency is to fixate on the solution. We need to think about the problem more. Principles > Practices > Tools. You’re not going to be able to do the right things until you internalise the principles. Otherwise you are just imitating or cargo culting.

Equally, we can focus on automation tools and capabilities, not what is being automated.

Other pertinent points:

  • if tetris has touagh me anything is that errors pile up and accomplishments disappear
  • systems thinking teeachs that we should minimise resistence rather than push harder
  • highlighted borg paper and that all tasks run in born run http server that reports health status and metrics
  • operations problems become easier when apps are aware of their own health
  • worse is better won: broken gets fixed but shitty lasts forever

Referenced the following for reading:

Colin Humphreys, CloudCredo - inoperability.io

Colin told a war story about what happens when you completely ignore operations? what’s the worst that could happen?

A game he worked on had 5 million users and “Launch issues”. It hardly worked, because they expected 20k users and ran initially with no budget for operations.

There was a big launch, but the infrastructure for the system was build the day before and completely untested. It was swamped in 10sec. 1.7 million people tried to play 1st day, but couldn’t. There were various media reports of the failure.

As a result, managed to steal some funding from the marketing budger and scale 100x, add caching, and throw lots of hardware at the problem. To use the 64 DB servers, they used sharded PHP ORM to distribute API server traffic.

He did get it working, but in time they found transactions not working, as this was supposed to be addressed within the application. It wasn’t because of differences between the development and production environments, and as a result cash was disappearing from the game. As a result, everyone’s scores had to be reset.

Personally, Colin worked 54 hours solid to get the thing running. For the three months the site ran, he worked 100hr/weeks.

This is how bad a project can get. “I nearly died”. He must certainly have burned out. Took a personal sense of pride in the project. Horrible to other people around. Heroism != success.


  • Work as a team
  • Communicate

Anthony Eden, DNSimple - How small teams accomplish big things

Slides | Video

Anthony’s talk is about scaling a team, and how they had to scale the operational processes - as a result from various experiences, but none more so than losing one of the founders, someone who know everything about everything.

Specific Processes:

Incident Response Plan, consisting of 4 points: Assess, Communicate, Respond, Document.

Assess - think before you act. Determine impact. Have a threshold for requesting help. >10 mins results in everyone in a Google Hangout.

Communicate - 1 person responsible to communicate to customers. Update twitter, status page.

Respond - minimise impact. Triage first. Everyone ask questiosn and propose solutions. Consider available actions and act one best available.

Document - post mortem. what happened? why did it happen? how did we respond and revoer? how might we provent similar issues from occurring again?

Other processes were mentioned: On-call rotation. Security Policies. Security escalation policy. System security, RBAC. Track CVEs. Password rotation - especially on admin passwords. etc.

What makes a good process?

  • Born from Experience
  • Written down and available to all
  • Concise & Clear
  • Act as guidance

Once they’re in place you need to:

Execute it, to test it out. Prune where appropriate. Automate where that makes sense.

Bridget Kromhaut, Pivotal - distributed: of systems and teams

Slides | Video

Bridget’s talk compared how distributed systems are complex, but so are distributed teams in many of the same ways.

Firstly, distributed != remote. Having a few people out of the office is not the same.

What’s important in teams is people > tools. Focusing on people are communicating more important than the tools and how.

She made various points which are important for distributed teams:

Durable communication encourage honesty, transparency and helps future you - “durable communication exhibits the same characteristics as accidental convenient communication in a co-located space. The powerful difference is how inclusive and transparent it is.” - Casey West

  • Let your team know when you’ll be unavailable.
  • Tell the team what you’re doing.
  • Misunderstandings are easy, need to over-communicate. Especially to express emotions, it’s easy to misinerpret textual communication.
  • Be explicit about decisions you’re making.
  • Distribute decision making

Colin Hemmings, dataloop.io - In god we trust, all else bring data

This talk was about the experience of building dataloop as a startup, from working in other companies, and what was learned speaking to 60 companies about monitoring and focued on dashboards.

Generally see the following kinds of dashboards:

  • Analytics dashboards, to diagnose performance issues. Low level, detailed info.
  • NOC dashboard, high level overview of services.
  • Team dashboards, overview of everything not just technical elements - includes business metrics.
  • Public dashboards, high level, simplified & sanitised marketing exercises.

Keeping people in touch of reality is the problem, and knowing what the right thing to work on. Discussions on what features to work on get opinionated. There is data within our applications that we can use to make decisions, and they can be represented on dashboards.

  • Stability dashboards: general performance, known trouble spots.- Feature dashboards: customer driven, features forum. “I suggest you…” & voting
  • Release dashboards: dashboards for monitoring the continuous delivery pipeline

Elik Eizenberg, BigPanda - Alert Correlation in modern production environments

My personal favourite talk of the first day.

Elik’s contention is that there is a lack of automation in regard to responding to alerts.

Incidents are composed of many distinct symptoms, but monitoring tools don’t correlate alerts on those symptoms for us into a single incident.

The number of alerts received might not be proportional to number of incidents. The number of incidents experienced may be similar day to day, but the impact of the incidents can be very different - hence the number of alerts being higher sometimes.

Existing Approaches:

  • compound metrics (i.e. aggregations, or a compound metric build from many hosts). This is relatively effective, but alerts are received late, and you can miss symptoms in buildup to the alert triggering.
  • service hierarchy, i.e. hierarchy of dependencies related to a service. Problem with this approach is that it’s hard to create & manage. Applications and their dependencies are generally not hierarchical.

“What I would like to advocate”: Stateful Alert Correlation

Alerts with a sense of time, aware of what happened before now.

Alerts can create a newincident, or link themself to an existing incident if it’s determined they’re related.

How do you know that an alert belongs to an incident?

  • Topology - some tag that every alert has. (Service? Datacenter? Role?)
  • Time - alerts occuring close in time to another alert
  • Modeling - learning if multiple alerts tend to fire within a short timeframe
  • Training - Machine Learning. User feedback if correlation good or bad.

Even basic heuristics are effective. There is lots of value to be gained from just applying Topology and Time.

There were two other talks after this, but I had to leave early. Sorry to the presenters for missing their talks!

The Path to Peer Review

As a sysadmin, I’ve never been part of a team that did peer review well. When you’re doing a piece of work and you want to check you’re doing the right things, how do you get feedback?

Do you send an email out saying “can you look at this?”

Do you have someone look over your shoulder at your monitor?

Do you have discussions about what you’re going to do?

Sometimes I’ve done these things, and they’ve been partly effective - but usually they just get a reply of “Looks good to me!”

Most of the sysadmins I’ve worked with will just make a change because they want to, or were asked to, and don’t tell anyone. Or they want to be heros who surprise everyone by sweeping in with an amazing solution that they’ve been working on secretly.

I don’t want to work where people are trying to perform heroics. I hate surprises. Having done all those individualistic stupid things myself, I want to work in a team where we work together on problems out in the open. When you involve others, they are more engaged and feel part of the decision making process. And their feedback makes you produce better work.

My current role now has the best culture of reviewing work that I’ve experiences. But we had to create it for ourselves, and this is what we did.

We put all our work in version control

When I started this role, the first project I worked on was to move our configs into git. Most of those configs are stored on shared NFS volumes which are available on all hosts. Previously people made config changes by making a copy of the file they wanted to change and then made the change to the copy. Once that is ready they would take a backup of the production copy of the file, and copy their new version into place.

Importing the files into repos was generally straightforward, but sometimes there was automatically generated content that needed excluding in .gitignore. To deploy the repo we added a post-update hook on the git server that would run git pull from the correct network path.

But we also wanted to catch when people would change things outside of git and do so without overwriting their changes. This would allow us to identify who was changing things and make sure they knew the new git based process. To do this we added a pre-receive hook on the server that would run git status in the destination path and look for changes.

That was the first step in changing the way we worked. It wasn’t a big change, but it got everyone fairly comfortable with using git. The synchronous nature of the deployment was a big plus too, because all the feedback about success or failure would happen in your shell after running git push

Then we generated notifications

This was great progress, and we got more and more of our configs into this system over time. The next thing we wanted to do was to produce notifications about what’s changing. This would allow us to catch mistakes, find bad changes that broke things, and to understand who is making changes to what systems.

We did two things to achieve this - we made all these repos send email changelogs and diffs whenever someone pushed a change, and we created an IM bot that would publish details changes into a group chat root.

This was great to produce an audit trail of changes happening to the system. But whenever you saw a change break something, in hindsight you thought - well I knew that change would break, I could stop that before it was deployed.

Finally, we added Pull Requests

We knew we wanted to implement Pull Requests but we couldn’t do this with the git server software we were originally using.

Since we had started using Confluence and JIRA for our intranet and issue tracking, we moved all our git repos to Stash, which is Atlassian’s github-like git server. This provided us with the functionality we wanted.

On a single repo, we enabled a Pull Request workflow - no one could commit to master any more and everyone had to create a new branch and raise a PR for the changes they wanted merged.

We wrote up some documentation on how to use git branching and raise PRs via the stash web interface, and explained to everyone how it all worked.

We chose a repo that was relatively new, so the workflow was part of the process of learning to work with the new software. It was also one that everyone in the team would have to work with, so that everyone was exposed to the workflow as soon as possible.

To deploy the approved changes, we used the Jenkins plugin for Stash to notify Jenkins when there were changes to the repository. Jenkins then ran a job that did the same thing as our previous post-update hooks - ran git pull in the correct network location.

Running the deployment asynchronously like this felt like we were losing something - if there was a problem you were sent an email & IM from Jenkins, but this felt less urgent than a bunch of red text in your terminal. But for the benefit of review, this was a price worth paying.

In the interim we had moved the checks for changes being made outside of git into our alerting system, so we could catch them earlier than when the next person went to make a change. This meant we didn’t have to implement these checks as part of the git workflow - if there was still a problem we let the hook job fail, and it could be re-run from Jenkins once the cause was resolved.

Over time, we moved all the other repositories across to this workflow, starting with the repos with the highest number of risky changes. But, for a number of repos, we kept the old workflow with synchronous deployment hooks because all the changes being made were low risk and well established practises.

Initially, slowing down is hard

The hardest thing to adapt to was changing the perceived pace that people were working. Everyone was very used to being able to make the change they wanted immediately, and close that issue in JIRA straight away. That’s how we judged how long a piece of work takes, but that doesn’t take into account the time spent troubleshooting and doing unplanned work.

What we were doing was moving more of the work up front, where you can fix problems with less disruption. But making that adjustment to the way you work can be really hard because you perceive the process to be slower.

Everyone in our team is a good sport and willing to give things a go - but as much as we could try to explain the benefit of slowing down, you only realise how much better it is by doing it over and over, and that takes time.

Sometimes it’s necessary to move faster, and we manage that simply - if someone needs a review now, they ask someone and explain why it’s urgent. As a reviewer, if I’m busy I’ll get people to prioritise their requests by asking “I’m probably not going to be able to review everything in my queue today. Is there anything that needs looking at now?”

In time the benefits are demonstrated

By using Pull Requests we created an asynchronous feedback system. First you propose what you’re going to do in JIRA. Then you implement it how you think it should be done and create a PR. Then when a reviewer is available they’ll provide feedback or approve the change. You keep making updates to your PR and JIRA until the change is approved or declined.

With time, everyone experienced all of the following benefits of that feedback:

Catch mistakes before they are deployed

This was what we set out to do! There were breaking changes made before, and there are less of them now.

Learn about dependencies between components

A common type of feedback is “How is this change going to affect X?” - sometimes the requestor has already considered that and can explain the impact and any steps they took to deal with X. But if they haven’t considered it, they need to research it. That way they learn more about how things are connected and have a greater appreciation of how the system works as a whole.

Enforce consistent style and approaches

Everyone has their own preferred style of text editing and coding. With a PR we can say this is the style we want to use, and enforce it. The tidier you keep your configs and code, the more respect others will have for them.

It’s massively helpful to be told about an existing function that achieves what you’re trying to do, or there’s existing examples of approaches to solving a problem. This can help you learn better techniques, and avoid duplicating code.

Identify risky changes

With changes to fragile systems, you’re never confident about hitting deploy even if the change looks good. Until it’s in production and put through it’s paces there is risk. So this has allowed us to schedule deploying changes for times where the impact will be lower, or deploy the change to a subset of users.

It’s also stopped “push and run” scenarios - we avoid merges after 5pm, they can always wait for tomorrow morning!

Explain what you’re trying to do

Much of the review process is not about identifying problems with configs and code, but simply being aware of and understanding the changes that are taking place. This is invaluable to me as a reviewer.

So, when raising a Pull Request, it’s expected that each commit has a reference to the JIRA issue related to the change. The Pull Request can have a comment about what is changing and why, and that can be very helpful but the explanation of why the change is taking place must be in the JIRA.

This way, when looking back at the change history we can also reference the motivations for making that change and see the bigger picture beyond the commit message.

People want to have their work reviewed

Somewhere along the way, getting your work reviewed became desirable. You realise the earlier you put your work out there for review, the earlier you get feedback and the less likely you are to spend time doing the wrong thing.

We still have a number of repos that anyone can just change by pushing to master, there’s no controls because these changes are considered safe. But people choose to create branches and PRs for their changes to these repos, because they want to have their work reviewed.

Change takes time…

Going from no version control to making this cultural change in the way we approached our work took about 2.5 years.

Throughout this period there was no grand plan or defined scenario that we were trying to achieve. At each stage, we could only see a short way forward. We had some things we were looking to do better, so we experimented. When we found what worked, we made sure everyone kept doing it that way.

At the start, I had no idea that peer review of sysadmin work would be done via code review. As we’ve moved things into git and we saw the benefits, we want to manage everything the same way and that drives moving more things into git.

We moved slowly because there was other work going on and we needed to get people comfortable with the new way of working and tooling before we could ask more of them. Change takes time, and many of our team have benefitted from seeing that incremental process take place.

It’s been gratifying that the newest team members who have joined since we established PRs have said “I was uncertain about it at first, but now I get it. It really works.”

… and involves lots of painstaking work

To get to where we are has meant spending a lot of time setting up new repositories and isolating files that are managed by humans and automatically generated, migrating repositories to stash, creating deploy hooks, explaining how git works, and most of all making sure you’re providing useful reviews and that they happen regularly.

It’s one thing to start setting up a couple of repos like this – but to fully establish the change you need to do lots of boring work to make sure everything is migrated, even the more difficult cases. It’s important that everything is managed consistently, within one system.

Monitorama Roundup Part 2

Part one is available here

During the second day there were multiple tracks available - I mostly followed the workshop track, only catching one presentation on the speaker track.

Florian Forster, collectd - Collecting custom metrics


Described the collectd data model and how to use the Exec plugin to execute arbitray scripts for collecting custom metrics.

Also presented a statsd plugin - implementation of statsd network protocol inside collectd.

Abe Stanway - Kale (Skyline and Oculus)


Presentation of architecture for the kale suite of tools recently released by etsy.

You can find a better description on the etsy blog

Skyline - Analyse time series for anomalies in (almost) real time

The setup at etsy includes 250k metrics, which takes about 70sec for anomaly discovery, and requires 64GB memory to store 24 hours of metrics in memory.

  • carbon-relay is used to forward metrics to the horizon listener
  • metrics are stored in redis
  • data is stored in redis as messagepack (allows efficient binary array stream)
  • roomba runs intermittely to clean up old metrics from redis
  • amalyzer does it’s thing and writes info to disk in JSON for the web front end

To identify anomalies, skyline uses some of the techniques that Abe talked about in his presentation the previous day. It uses the consensus model, where a number of models are used and they vote - so if a majority of models detect an anomaly, then that is reported.

Oculus - analyse time series for correlation

Oculus figures out which time series are correlated using Euclidian Distance - the difference between time series values.

It also uses Dynamic Time Warping - to allow for phase shifts, if the change in one series occurs later than in the other. But this is slow, so it’s targeted on the time series to could be correlated by comparing shape descriptions.

  • data pulled from skyline redis and stored in elasticsearch
  • time series converted to shape description (limited number of keywords that describe the pattern)
  • phrase search done for shared shape description fingerprints
  • run dynamic time warping on that data

Pierre-Yves Ritschard - Riemann


Unfortunately this presentation was marred by a projector/slide deck failure which made it hard to follow. It was incredibly disappointing because there was a lot of positive discussion around Riemann and I was looking forward to a better exposition of the tool.

Riemann is a event stream processing tool.

  • All events sent to riemann have a key value data structure - for logs, metrics, etc.
  • You can use all your current collectors: collectd, logstash etc, and has an in app statsd replacement
  • Those events can be manipulated in many ways, and sent to many outputs
  • The query language, clojure, is data driven & from lisp family
  • There is storage available for event correlation but I didn’t really understand from his discussion

Devdas Bhagat - Big Graphite


This workshop covered how booking.com have scaled their graphite setup.

They have somewhere in the region of 5000 hosts, multiple terabytes of data stored in whisper.

Both IO and CPU have become bottlenecks and in each instance they have thrown hardware at the problem to run more agents and shard their data.

I/O probems:

  • Ran into IO wall, disks 100% writing. Lots of seeking
  • Have ended up using SSD drives in RAID0
  • Sharding becomes hard to maintain and balance
    • Don’t know in advance in which namespace metrics will be created
    • Rebalancing tricky when adding more backends - they have a script that replicates graphite hashing function and manually move things
  • Found SSDs not as reliable as spinning disk under high update conditions
    • Lots of drive failures, so replicate data to separate datacenters to provide availability
  • FusionIO performance no improvement from SSDs (I find this hard to believe)

CPU problems:

  • Relays start maxing out CPUs
    • Multiple relay hosts (per datacenter)
    • Multiple relays on each host (1 per core)
    • Use haproxy load balancer (prevent losing metrics)

Other Problems:

  • Software breaking on updates (whisper failure on upgrade)
  • People can use the software in surprising ways - not time series data, or one record a day

David Goodlad - Infrastructure is Secondary

Slides | Video

David presented that your primary metrics should be your business metrics, and that your infrastructure metrics are secondary. Which is not to say infrastructure metrics aren’t important, but the business metrics are measures of how your business is performing and this is the data which you should be alerting on.

  • Alerts should be informational and actionable - not just “cool story, bro”
  • Consider what matters to customers - i.e. instead of measuring queue size, use time to process
  • When a business metric alerts, then correlate against infrastructure monitoring
  • Keep the infrastructure thresholds, but don’t alert on that information - you can access it when necessary

He gave a great one liner to help decide what you should be measuring - “What would get your boss fired?” - Measure these things deeply.

Also pushed the idea of sharing your information outside the team - to be more transparent and visible to the rest of the business, particularly since the information you hold are business metrics. This will provide feedback about the metrics which are important to others.

Michael Gorsuch - Graph Automation

This workshop was a practical introduction to instrumentation and exploration - stepping through configuring StatsD, graphite and collectd to instrument an application and exploring graphs using descartes.