#sysadminsucks

Monitorama 2018 AMS Day 2

Prometheus for Practitioners: Migrating to Prometheus at Fastly - Marcus Barczak, Fastly

First monitoring system: Ganglia + Icinga. Tried to move to SaaS, ended up supporting two systems, growing pains doubled.

Company infrastructure was scaling horizontally, require scaling monitoring vertically.

Project to review monitoring and replace existing systems: scale with infrastructure growth, easy to deploy and operate, engineer friendly instrumentation, first class API support.

Get started:

  • build proof of concept
  • pair with pilot team to instrument their services
  • iterate through the rest

Infrastructure:

  • Pair of instances in each POP, both scraping same devices in the POP.
  • Frontend in GCP. Grafana, nginx router, federation and alert manager.

Prometheus server:

  • ghost tunnel, tls termination and auth
  • service discovery sidecar
  • rules loader

Created their own service discovery and proxy for prometheus.

Rough Edges:

  • Metrics exploration without prior knowledge
  • Alertmanager too flexible
  • Federation and global views
  • Long term storage still an open question

Rethinking UX for AI-driven Monitoring Tools - Stephen Boak, Datadog

This was a really great talk and my notes could never do it justice.

https://bit.ly/2MNDlZ9

Building a monitoring strategy and gaining consensus - Gregory Parker, Trevor Morgan, Standard Chartered Bank

No monitoring strategy for company

Back to Basics: What do your users want? Where is their pain points. Put the building blocks in place. Forget what you have and focus on what you need.

  • Talk to business and tech influencers
  • Listen to fedback and understand contraints
  • Publish clear and detailed first principles based strategy
  • Get feedback from stakeholders, revise early and often

Get teams to plug in. Avoid mandates, generate awareness, encourage modern solutions, enlist champions.

Build a data lake with common event structure

What went well:

  • Engaged with customers throughout
  • Inclusive approach, extensible data model
  • Roadshows, Show and Tell, Awareness sessions

What could be changed:

  • Spend more time gaining consensus early stage
  • Work out how to pay for it
  • Focus less on legacy technologies

Self-hosted & open-source time series analysis for your infrastructure - Alexey Velikiy

Analytic unit - shuld be isolated and customisable

Hastic - python app for processing ts data & plugin for grafana with ui for labelling and rendering preductions

Label patterns in grafana, and then it will generate detections. Pattern is a “shape” - peaks, jumps, troughs etc.

Custom models in python.

https://github.com/hastic/hastic-server

Monitoring Serverless Things - Mandy Waite, Google

Functions as a Service

By definition, the servers are not accessible. More pieces of monitor and operate.

Retrying function executions is the simplest way to handle failures with serverless functions. Making a function idempotent is required to make it behave correctly on retried execution attempts.

Two distinct variants of functions - http functions, at most once. background functions, at least once.

How AI Helps Observe Decentralised Systems - Dominic Wellington, Moogsoft

We are living in a different world from the one our systems and processes were designed for

Old world: Static environment, managable alert volumes. New world: fast growing, fast changing environment, massive alert volumes.

What happens on network edge is now more important, and the edge is far away.

Single faults no longer cause impacts: fault tolerance does not mean zero incidents.

So how do we fix monitoring?

Existing monitoring is incident driven. Assumption that faults are easy to detect, and all failure conditions are knowable.

Dashboards are artefacts of a past failure. Internal health is irrelevant, user requests are what they care about. (Symptom based monitoring)

Observability: continuous stream, high cardinality, build into infra and apps, insight driven. “Actionable Understanding”

Monitoring as it should be: everything passes over a message bus. Then add AI.

AI in IT ops: bring interesting information to the attention of human operators - without having to define it beforehand.

Where to use AI in IT ops:

  • Ingestion: reduce noise and false alarms.
  • Correlation: identify related events across domains, avoid duplication of effort and missed signals.
  • Collaboration: intelligent teaming, root cause analysis, knowledge capture.

Teaching the machine. Inputs matter, choose the right feature vectors. Regression problems: continuous distribution. Classification vs clustering: set of dategories.

In Practise:

  • Speed matters, work in real time
  • You don’t know what you need to know
  • AI is a tool, not magic
  • Process is how you make sure it works for users

Incremental-decremental Methods For Real-Time Monitoring - Joe Ross, SignalFx

Statistically grounded approaches: dynamic environments, complexity, volume.

Demands streaming algorithms for time series.

Simple calculation: 3 sigma. Implementation: rolling windows. Framework: prime and stream.

Rolling windows are natural foundation for streaming algorithms.

Comparing distributons is not enough. Metods for modeling: double exponention smoothing, and detecting non-stationary series.

Forecasting:

level change: current window differs from forecast. mean is a poor forecast.

exponential weighted moving average: more recent data more relevant.

double exponential smoothing: models two quantities - level of the series and trend of the series.

incremental decremental solution. how to remove influence of oldest values in sliding window. store additional state: track influence of the initial terms on the terminal values.

Time series classification:

A time series is stationary if it’s ditribution does not change over time. level stationary statistic: tests that a series is stationary around its mean

incremental decremental rolling window implementation: algebraic machinations

Monitorama 2018 AMS Day 1

You don’t look like an Engineer - Monica Sarbu, Elastic

Issues facing females in technology careers, including work/life balance issues and strategies for managing teams to enhance happiness.

Nagios, Sensu, Prometheus - Why a change of the framework does not change the mindset - Rick Rackow, Ebay

NOC team & SRE team

Started with Nagios

  • high noise level
  • high false positives/negatives
  • culture of downtiming

Migrated to Sensu

  • dropped/refactored checks
  • 20% unrefactored (mistake)
  • attempted knowledge transfer
  • more downtiming than ever
  • less false positives
  • high noise level
  • no dynamic checks created outside of monitoring task force

“Sensu is shit, let’s get something new”

Migration to Prometheus. “It’s going to solve all our problems”

  • event based to metric based
  • need to refactor 100%
  • less noise IF we follow our principles
  • less downtimes IF we follow our principles

What we learned the hard way:

  • Involve everyone
  • Reconsider everything
    • what you monitor and how you check
    • reconsider your alerting

Is observability good for our brain? How about post-mortems? - Radu Gheorghe, Rafał Kuć, Sematext

Interactions with tools, people. Learning and daily structure.

Observability/Tools:

  • Monitor everything/Alert on everything/Automate everything
  • want to see patterns, get stressed when can’t understand why something happening
  • humans evolved to be easily distracted, context shifting is expensive. alert fatigue related.
  • new technology, status threat - threat to self esteem. barrier to learning

Learning:

  • Schools don’t teach us how to efficiently learn
  • Apprentice Method: mentoring/tutoring. 30-50% faster. Experienced person also gains knowledge
  • Crash Test Method: exploratory testing. encoded in brain as experience. Need basic knowledge as starting point.
  • Dancer Method: 5-30% data retention from reading. needs to be reinforced with practise.
  • Place Switch Method: Switch places/roles in organisation. Encourages you to learn and extend knowledge.
  • Parrot Method: To share knowledge, rehearse. Combine reading with emotions, imagine crowd - visualisation.
  • Mastering Method: Taking notes. Writing reinforces knowledge. Writing is most effective, doesn’t work with typing.
  • Box King Method: Take a step at a time.
  • Tutor Method: explain to yourself

People:

  • Meetings - Meeting in person first matters
  • Feedback: naturally wired to seeing whats wrong. constructive feedback doesn’t work and micromanagement bad. need to own your own concerns. share concerns and feelings rather than critising/demanding changes. this lowers status threat.

Structure:

  • Eat/Sleep/Exercise. Brain eats lots of sugar
  • Morning emails, timebox and use for planning
  • Plan all the things

Unquantified Serendipity: Diversity in Development- Quintessence Anx, logzio

Evaluating skills in hires for what diversity of skills they bring to the team.

Mindset Matters. Coach, don’t rescue. Offer information rather than assistance. Provide challenges.

Provide explicit onboarding. Last new hire can onboard new hire, reiterate knowledge and improve process.

How to build observability into a serverless application- Yan Cui, DAZN

survivor bias in monitoring - only focus on failure modes that we identified through experience

could we monitor and alert on the absence of success? something is wrong, but not what is wrong.

otherwise, this talk was repeat of his talk at Monitorama PDX and didn’t answer the questions he posed

Monitoring what you don’t own - Stephen Strowes, Ripe NCC

Easy to monitor things you have control over

Once you put something in the wild you’re at the mercy of other people’s networks - as well as datacenters, cloud providers, CDNs etc.

Service with anycast - traffic easily goes to the wrong place

Which of your users can reach which of your location? Can you isolate a network problem?

You need vantage points to measure, debug or monitor the public network.

RIPE Atlas: global network of measurement probes.

  • Allows measurements, you can instruct platform to run measurements and collect results
  • Need to run probes to earn credits to spend them on measurements

DNS

  • where do ISPs mess with your DNS records? Where are TTLs not honoured?
  • querying dns from problam networks can reveal ISP resolver behaviour

Traceroute

  • routing is not totally within your control, often asymmetic, difficult to debug without path traces
  • traceroutes from many locations give you info in aggregate about congestion/loss along the path

Atlas for Monitoring

  • Expose data, allow people to write tools on top of this
  • streaming API
  • ping: status-checks
  • results API

What the NTSB teaches us about incident management & postmortems - Michael Kehoe, Nina Mushiana, Linkedin

Production SREs

NTSB - US National Transportation Safety Board. Aviation, Land Transportation, Marine and Pipeine accidents. Determines probable cause of accidents and reports on findings.

Investigation Process:

  • preinvestigation preparation
    • go team, ready for assignments: on call schedule.
    • investigator in charge (IIC)
    • subject matter experts
  • notification and initial response
    • regional reponse, notify national headquarters
    • initial PR and stakedown
    • establish severity and assemble go team if necessary
  • on scene
    • establish command rooms.
    • first: organisation meeting, bring up to speed, establish tasks, roles and lines of authority
    • IIC most senior person on scene
    • daily on site progress meetings at end of day, plan next day
    • briefing headquarters
  • post on scene
    • formal report structure. facts, analysis, conclusions, recommendations, appendices.
    • first: work planning and build timelines.
    • fact report based on field nodes created
    • technical review: all parties review all factual information
    • then dedicated department to write report

NTSB advocates for the most important recommendations from their reports each year.

Applying to operations:

  • incident commander pre-assigned
  • test pagers, ensure access works
  • NOC notifies incident commander
  • Once verified launch full response for major incident
  • mitigate ASAP
  • private + public slack channels
  • IC establishes war room. assigns roles and responsibilities, and prioritisation. communication at regular intervals to execs. gather data and update timeline doc

post incident activities:

  • post mortem
    • dedicated team
    • template
    • blameless
  • postmortem rollup
    • action items are prioritized
    • weekly reporting on status of action-items

The more you invest in the process the more you will get out of it

Vital that there is accountability on improvements and action items. Most action items resolved within one month.

A thousand and one postmortems: lessons learned from running complex systems at scale - Alexis Lê-Quôc, Datadog

Talking openly about failures

Learn from past mistakes, feel kinship in the face of system failure, share what you learned with others

Datadog: lots of data, all day long, every day under tight processing deadlines.

Complex systems almost always are in stable but degraded mode. Each incident is recording noteworthy deviation from what we expect. Most incidents deserve a post mortem - got good at writing them.

Looking back- two extremes: detailed narative on why and fixes, and also high level aggregates and trends

Looking at all in aggregate, formalise anti-patterns into a living document. Invaluable tool for new and old hands alike.

Facing and acknowledging failure in front of others is difficult. It’s a learned skill that needs a safe environment.

Talking about failure in public is even harder. Forget about the specifics and focus on patterns. Recurring themes in why section. Simplify, but keep just enough context to make sense.

Anthology of anti-patterns:

  • Configuration: config not immediately picked up, delayed incident. expose config identifier at runtime, assert what runs is what is intended.
  • Dependencies: timeouts - loadbalancer timeout of 10s but downstream 60s. expensive requests dominate runtime. downstream services should have timeouts shorter than upstream.
  • Deployment: uptime means victory. past results are not an indication of future performance. need explicit sanity checks.
  • Development: time, dealing with DST changes. Difficuly to thoroughly test, default to UTC everywhere (except UI).
  • Observability: tail of distribution. aggregates and perventiles mask reality of user experience - slowest could be most important. Pay special attention to the tail, check worse requests individually.
  • Operations: Expensive roll back or forward. Bad deploy - debate what to do is time consuming. Pick one strategy and invest heavily in it and make it really good.
  • Performance: arbitrary resource limits. hard to pick limits for connections, file handles etc. Set reasonable soft limits and ideally no hard limits and let system hit actual resource limits OR reigorously model and validate limits.
  • Routing: retry too often. Simplistic retry approaches multiples load on backend systems. Use adaptive retries such as CoDel.

Monitorama 2018 PDX Day 3

Achieving Google-levels of Observability into your Application with OpenCensus - Morgan McLean (Google)

OpenCensus - distributed traces, tags, metrics (+ logs)

Collection of libraries for multiple languages, instrumentation and support for multiple exporters for tracing and metrics

One thing to do all the things

The present and future of Serverless observability - Yan Cui (DAZN)

New challenges - no agents or daemons on the platform, or background processing, higher concurrency to telemetry systems, adding to user facing latency when sending telemetry in functions, async invocation makes tracing mmore difficult.

Write metrics in logs, as the platform provides logging. Extract that data in post processing.

Async processing - just send metrics in function

Our tools need to do more to help understand the health of the system, rather than a single function

Putting billions of timeseries to work at Uber with autonomous monitoring - Prateek Rungta (Uber)

Built their own TSDB because existing solutions don’t work at their scale - M3DB: https://github.com/m3db/m3db

Building Open Source Monitoring Tools - Mercedes Coyle (Sensu)

Lessons learnt in the Sensu v2 rewrite

Autoscaling Containers… with Math - Allan Espinosa

Using control theory to help regulate an autoscaling system

  • Iterate on feedback of the system
  • Models that tell you if your feedback is effective
  • Linear models go a long way
  • Re-evaluate your models

Assisted Remediation: By trying to build an autoremediation system, we realized we never actually wanted one - Kale Stedman (Demonware)

Online gaming systems

Nagios constantly full of CRIT warnings. Hostile environment for 1st responders: hard to find alerts in the floor, hard to understand context. Cost of making an incorrect decision is high. Low morale, high turnover.

Plan: Rebrand, Auto-remediation, ???, Profit.

Lots of same tasks being run on the same systems.

System as planned: check results, alert processing engine (calls other APIs), remediation runner. Domains: Detection (Sensu), Decision (StackStorm), Remediation (Didn’t decide).

Review and process all alerts. Migrated many to sensu. Started processing alerts more thoroughly - put all alerts in JIRA and root causes started to be addressed.

Dropped idea of remediation, as alerts decreased realised they had what they needed already - better info for responder, working on fixing underlying causes

Razz: Detection (sensu), Decision (Stackstorm alert workflow), Enrichment (stackstorm widget workflow), Escalation (Stackstorm JIRA workflow)

Monitoring Maxims:

  • Humans are expensive (don’t want to waste their time)
  • Humans are expensive (make mistakes, unpredictable)
  • Keep pages holy (issue must be urgent and for the responder)
  • Monitor for success (ask if it is broken, rather than what is broken)
  • Complexity extends incidents
  • Restoration > investigation
  • LCARS: link dashboards, elasticsearch queries, other system statuses, last few alerts

Today:

  • Alert volume down
  • Data driven decisions up
  • Reliability and support quality up
  • Team morale up

Security through Observability - Dave Cadwallader (DNAnexus)

Improve relationship between security and ops

Security and compliance concerns handling medical data.

Compliance is about learning from mistakes and creating best practices. Designed with safety in mind, but doesn’t guarantee safety.

Security want to do less compliance, and more time threat hunting. How do we automate compliance checks?

Inspec - compliance as code. Check security in VMs, settings on cloud resources. Community is thriving, plenty of existing baselines available for use.

Run inspec locally, and write out to json. Prometheus to pull the data, custom exporter.

Apply SLOs - actionable compliance alerts. Link back to documentation on compliance documentation. Friendly HTML report pushed to S3 and linked in alert.

How to include Whistler, Kate Libby, and appreciate that our differences make our teams better. - Beth Cornils (Hashicorp)

Talk about inclusivity in tech, how we can improve

Monitorama 2018 PDX Day 2

Want to solve Over-Monitoring and Alert Fatigue? Create the right incentives! - Kishore Jalleda (Microsoft)

Lessons from healthcare - more monitoring is not always better

Problems of far too many alerts at Zynga. Vision of <2 alerts per shift. Dev on call, SRE for engineering infra and tooling. Zero TVs.

One day massive outage, multiple outages with same root cause.

Idea: Deny SRE coverage based on Alert budgets. Lot of push back - from SRE team and from supported teams.

Baby steps - focus on monitoring and prove we can do one thing well. Promise of performing higher value work.

Leverage outages - Postmortems, follow ups. Build credibility, show you care.

Find Allies - identify who is aligned with you. Need to go out and find them.

Call your Boss - ensure they are aligned. Get buy in from Senior Leaders.

Establish contracts. Targets for all teams. Give time to clean up before launch.

Reduce alert noise - aggregation, auto remediation, what should be a logs or a tickets or an alert.

Success - 90% drop in alerts, 5min SRE response, uptime improved

Next-Generation Observability for Next-Generation Data: Video, Sensors, Telemetry - Peter Bailis (Standford CS)

Taking systems engineering/thinking approach to dealing with speeding up applying ML models to video analysis.

Interesting technology and family of tools being developed.

Coordination through community: A swarm of friendly slack bots to improve knowledge sharing - Aruna Sankaranarayanan (Mapbox)

Chatbots in slack. Commands are AWS lamba functions.

sumo logic queries. benchmarking. user lookup. get/set rate limits. platform-question: assign questions to an engineer, with escalation.

Automate Your Context - Andy Domeier (SPS Commerce)

Complexity is increasing to enable velocity

Context: circumstances that form the setting for an event, statement or idea, and in terms which it can be fully understood and assessed

As complexity increases, amount of available context increases. Eficiency of an organisation directly correlated to how effective you are with your available context.

Goal - make the right context available at the right time.

Context. People, monitoring, observability, obscure and hard to find things.

People: One consistent set of operational readiness values. Operation info, Peformance (KPI/Cost), Agility (Deployment maturity), Security

Monitoring: taking action. Examples:

  • Put alerts onto SNS topic. Automate actions with lambda functions. Search the wiki for documentation, post back to alerts commenter.
  • JIRA incidents to SNS as well, automate incident communication. Search for recent changes from JIRA.
  • Change management in JIRA, dependency lookup & commment back to JIRA.

Slack in the Age of Prometheus - George Luong (Slack)

We replace monitoring systems because our needs have evolved

Ganglia & Librato, migrated to Graphite. Looked at migrating away.

User Problems: metrics difficult to discover. Query performance made for slow rendering. Other problems specific to their usage.

Operation Problems: Could not horizontally scale the cluster (lack of tooling, time and effort). Single node fails led to missing metrics. Susceptible to developers accidentally taking down cluster.

Requirements. User: Ease of discover, fast response, custom retention, scales with us. Operational: Remove single node POF. Teams want to own their monitoring.

Prometheus ticked almost all those boxes, except custom retention and ingestion. (can only be set per box)

Architecture:

  • in region: duplicate prom servers, scraping same servers. primary and secondary.
  • between regions: single pair of federated servers scraping other prom servers.

Configuration Managemenet: terraform & chef. Prom jobs and rules stored in chef, ruby hash converts to yaml.

Webapp server about 70k metrics, x 500 servers (35m)

Job worker, 79k metrics x 300 servers (24m)

Sparky the fire dog: incident response as code - Tapasweni Pathak (Mapbox)

Sparky the firedog:

SNS topic - lamba function - forwards to Pagerduty

sparky is a npm module

documentation for alarms in github. sparky enriches the alarms.

  • reformat the alarm info
  • aggregate all your teams alarms into a single Pagerduty policy
  • links to targetted searches
  • get the root cause analysis?

Future: score an alarm. Does a human need to action this? Based on the triaging/context we have existing in documentation.

Sample questions that need answering on an alert:

  • How many errors around trigger time?
  • Are the errors ongoing?
  • What was the very first error?
  • Group and count the errors by time

Reclaiming your Time: Automating Canary Analysis - Megan Kanne (Twitter)

Is my build healthy? Want to catch errors before causing problems.

Canary - partial deployment. Use canary cluster for canary deploys. Needs production level traffic. Control cluster to compare against canary.

Use statistical tools to compare.

Median Absolute Deviation - maxPercentile

DBSCAN - density based spatial clusering of applications with noise. - toleranceFactor

HDDBSCAN - hierarchial dbscan - minSimilarShardsPercent (minimum cluster size)

Mann-Whitney U Test - tolerance, confidenceLevel, direction

Simplify configuration (of your statistical tests). Sensible defaults speed up adoption.

Choosing metrics - SLOs, existing alerts.

User Trust - Build it.

Monitorama 2018 PDX Day 1

Optimising for learning - Logan McDonald

expert intuition is achievable. improve memory.

  • prep

hierarchy of learning. important one is rule learning.

  • gaining knowledge

reading docs/textbooks not an effective way to embed knowledge in long term memory. do low stakes testing. best way of learning if to retrieve data.

  • mental models

observabiity and reflection

move from events to patterns to structure

engaging in incident reviews gave opportunity to quiz engineers on their actions during incident

  • learning together

symmathesy - systems of understanding

embrace cultural memory

Serverless and CatOps: Balancing trade-offs in operations and instrumentation - Pam Selle

I sadly missed most of this because I was debugging a problem

Mentoring Metrics Engineers: How to Grow and Empower Your Own Monitoring Experts - Zach Musgrave, Angelo Licastro (Yelp)

growth in teams, and managing growth in individuals within a team

mandate growth - more skills needed, more knowledge gaps, more planning (meetings), trade offs.

knowledge silos - reaction is mentoring

breadth: build confidence. defined mentor relationship.

depth: make an advanced contribution to one system.

next steps: end of mentor relationship, can start on-call. hold a retrospective - provide feedback for improvement

impostor syndrome - accept compliments for your work and acknowledge other’s contributions

consulting: monitoring is a specialised field. there’s a lot of nuance. make people aware of what tools can solve their problems. talk with other teams, ask insightful questions: what problem are you trying to solve? listen, as assume you know nothing about their problem.

The Power of Storytelling - Dawn Parzych (Catchpoint)

Cognitive bias

  • too much info
  • not enough meaning
  • not enough time
  • not enough memory

To tell better stories, when presenting data view it from the perspective of other parts of the organisation

Persuasive storytelling doesn’t need to be complicated - keep it simple

Use simpler language - it doesn’t make it less powerful

Present the most important information earlier

  • Use visuals when telling a story, keep them simple to understand
  • Don’t overcomplicate things
  • Remember the power of 3

Principia SLOdica - A Treatise on the Metrology of Service Level Objectives - Jamie Wilkinson

Overloaded team, tasked with reducing load to build a sustainable system they could manage

Symptom Based Alerting - focus on very few expectations about your service, so as system grows you are still focusing on the same things

Symptom: a user is having a bad time. Causes: internal views on the service.

What’s your tolerance for failure? Error budget. Set expectations: SLO

Symptom anything that can be measured by the SLO. Symptom based alert, is when SLO is in danger of being missed.

Debugging tools - metrics, tracing, logs - replace cause based alerts

On-call Simulator! - Building an interactive game for teaching incident response - Franka Schmidt (Mapbox)

On call onboarding

Safe and low stakes place to practice handing on call incidents

Onboarding:

  • Buddy System: Observing
  • Bucket list: checklist, how far along you are with experiences of being on call
  • Alarm scrum: review last days alarms

Simulator - “choose your adventure” type text adventure game. Tool: Twine.

Craft a story - use Incident Reviews, past notes and enrich with detail. Or just make it up.

Iterate and get feedback.

Observability: the Hard Parts - Peter Bourgon (Fastly)

Level up monitoring story at fastly

Senior engineering org, 120 engineers. Heterogeneous solutions throughout the organisation. Organic growth is good - up to a point.

Goals: are production systems healthy? if not why not?

Non goals: long term storage, replacing logging. (Observed metric usage VERY different from self-reported usage)

Strategy: prometheus, add instrumentation, curate new set of dashboards and alerts, run in parallel to build confidence, decom old systems.

Rollout: Inventory all services. Rank by criticality & friendliness.

The grind…

Embedded expert model. Pair with responsible engineer to get migration done. Hour or two of pair programming. Let documentation emerge naturally. Deploy prometheus, plumb service into prometheus (time consuming).

Technical autonomy is a form of technical debt. Focus on local vs. global optimization.

Dashboards and alerts. Import from other systems. Make sure to only build things in service to original goals. Reduce, keep high level views.

Some services, ownership can be indistinct. Senior leadership needed to wield stick, need their buy in.

Warning: This Talk Contains Content Known to the State of California to Reduce Alert Fatigue - Aditya Mukerjee (Stripe)

Slides

We can learn from clinical healthcare

Alert Fatigue - frequency/severity - causes responder to ignore or make mistakes

Decision Fatigue - frequency/complexity of decision points, causes person to avoid devices or make mistakes

Certain patterns of alerts and decisions contribute disproportionately to fatigue. Multiple false positives for an individual patient, impacts that alert over all patients.

Reduce alert fatigue: STAT. Supported, Trustworthy, Actionable, Triaged.

Supported: who owns this alert? Responders should own, or feel ownership over end result.

Trustworthy: do I trust this alert to notify me when a problem happens? stays silent with all is well? give sufficient information to diagnose problems?

Anomaly detection: if you don’t understand why an alert triggered, you don’t understand if it’s real.

Actionable: One decision require to respond. Alerts difficult to action are ignored. Make alerts specific, add decision trees. Alerts must have a specific owner.

Triaged: Triage alerts. Type should reflect urgency. Urgency can change. Commonly understood tiers. Regular re-evaluation process.

Monitory Report: I Have Seen Your Observability Future. You Can Choose a Better One. - Ian Bennett (Twitter)

Twitter has a big monitoring system and migrating was hard

Monitorama 2016 Portland Day 3

Justin Reynolds, Netflix - Intuition Engineering at Netflix

Video

Discussed problems at Netflix that regions were siloed, they worked on serving users out of any region.

To fail regions, need to scale up other regions to server all traffic

Dashboards, good at looking back, but need to know now. How to provide intuition of the now?

Created vizceral - see the blog post for screenshots & video: http://techblog.netflix.com/2015/10/flux-new-approach-to-system-intuition.html

Brian Brazil, Robust Perceiver - Prometheus

Slides | Video

Prometheus is a TSDB offering ‘whitebox monitoring’ for looking inside applications. supports labels, alerting and graphing are unified, using the same language.

Pull based system, links into service discovery. HTTP api for graphing, supports persistent queries which are used for alerting.

Provides instrumentation library, incredibly simple to instrument functiions and expose metrics to prometheus. Client libraries don’t tie you into prometheus - can use graphite.

Can use prometheous as a clearing house to translate between different data formats.

Doesn’t use a notion of a machine. HA by duplicating servers, but alertmanager deduplicates alerts. Alertmanager can also group alerts.

Data stored as file per database on disk, not round-robin - stores all data without downsampling.

Torkel Ödegaard, Raintank - Grafana Master Class

Video

Gave a demo on how to use grafana, as well as recently added and future features.

Katherine Daniels, Etsy - How to Teach an Old Monitoring System New Tricks

Slides | Video

Old Monitoring System == Nagios

Adding new servers.

  • Use deployinator to deploy nagios configs. Uses chef to provide inventory to generate a currently list of hosts and hostgroups.
  • Run validation via Jenkins by running nagios -v, as well as writing tool for nagios validation.
  • New hosts are added with scheduled downtime so they don’t alert until the next day. Chat bots send reminders when downtime is going to finish.

Making Alerts (Marginally) Less Annoying

  • Created nagdash to provide federated view of multiple nagios instances.
  • Created nagios-herald to add context to nagios alerts. Also supports allowing people to sign up to alerts for things they’re interested in.

Tracking Sleep

  • Ops weekly tool. Provides on call reports, engineers flag what they had to do with alerts.
  • Sleep tracking and alert tracking for on call staff to understand how many alerts they’re facing and how it’s impacting their sleep.

An On Call bedtime story

  • Plenty of alerts because scheduled downtime expired for ongoing work.
  • Create daily reports of what downtime will soon expire and which will raise alerts.

Joe Damato, packagecloud.io - All of Your Network Monitoring is (probably) Wrong

Slides | Video

There’s too much stuff to know about

  • ever copy paste config or tune settings you didn’t understand?
  • do you really understand every graph you’re generating?
  • what makes you think you can monitor this stuff?

Claim: the more complex the system is the harder it is to monitor.

whats p. complicated? linux networking stack! lots of features, lots of bugs. with no docs!

  • /proc/net stats can be buggy
  • ethtool inconsistent, not always implemented
  • meaning of driver stats are not standardised
  • stats meaning for a dirver/device can change over time
  • /proc/net/snmp has bugs: double counting, not being counted correctly

Monitoring something requires a very deep understanding of what you’re monitoring.

Properly monitoring and setting alerts requires significant investment.

Megan Kanne, Justin Nguyen, and Dan Sotolongo, Twitter - Building Twitter’s Next-Gen Alerting System

Slides | Video

3.5B metrics per minute

Old Alerting System

  • 25k alerts/minute, 3m alert monitors
  • single config language, lot of existing example, easy to write and add
  • all those points were good and bad!
  • lots of orphaned and unmaintained configs, no validation
  • alerts and dashboards were seperate
  • problems with reliability when zones suffer problems

Solution

  • combined alerts and dashboard configuration
  • dashboards defined in python, common libraries that can be included
  • python allows testing configs
  • created multi-zone alerting system
  • reduced time to detect from 2.5mins to 1.75mins

Helping Human Reasoning

  • bring together global, dependencies and local context
  • including runbooks, contacts and escalations directly in the UI

Lessons Learned

  • distributed system, challenges about consistency, structural complexity and reasoning about time
  • sharding choices are hard, impossible to always avoid making mistakes
  • support and collaborate with users, try and reduce support burden with good information at interaction points (UI, CLI etc.), good user guides and docs
  • migrating - some happy to move, others not. some push back. had to accept schedule compromise and extra work.

Joey Parsons, Airbnb - Monitoring and Health at Airbnb

Video

Perfers to buy stuff:

  • New Relic
  • Datadog, instrument apps using dogstatsd
  • Alerting through metrics

Created open source tool, Interferon, to store alerts configuration as code.

Volunteer on call system. SREs make sure things in place so anyone can be on call.

  • Sysops training for volunteers, monitoring systems, how to be effective and learn from historical incidents
  • Shadow on call, learn from current primary/secondary
  • Promoted to on call

Weekly sysops meeting, go through incidents, hand offs, discuss scheduled maintenance.

On call health:

  • are alerting trends appropriate?
  • do we understand impact on engineers?
  • do we need to tune false positives?
  • are notifications and notification policies appropriate?

Dashboards for:

  • incident numbers over time
  • counts by service
  • total notifcations per user and how many come at night
  • false positive incident counts

Heinrich Hartmann, Circonus - Statistics for Engineers

Slides | Video

Monitoring Goals

  • Measure user experience/ quality of service
  • Determine implications of service degradation
  • Define sensible SLA targets

External API Monitoring:

  • sythetic request, measures availability, but bad for user experience
  • on long time ranges, rolled-up data is commonly displayed, erodes spikes

Log Analysis:

  • write request stats to log file
  • rich information but expensive and delay for indexing

Monitoring Latency Averages:

  • mean values, cheap to collect, store and analyse, but skewed by outliers/low volumes
  • percentiles, cheap to collect store and abalyse, robust to outliers but up front percentile choice required and cannot be aggregated

percentiles: keep all your data. don’t take averages! store percentiles for all reporting periods you are interested in - i.e. per min/hour/day. store all percentiles you’ll ever be interested in.

Mointoring with Histograms:

  • divide latency scale into bands
  • divide time scale into reporting periods
  • count the name of samples in each band x period

Can be aggregated across times. Can be visualised as heatmaps.

John Banning, Google - Monarch, Google’s Planet-Scale Monitoring Infrastructure

Video

Huge volume, global span, many teams - constant change

Previously borgmon. Each group had it’s own borgmon. Large load on anyone doing monitoring. Hazing ritual - new engineer gets to do borgmon config maintenance.

Wanted:

  • can handle the scale
  • small/no load to get up and running
  • capable of handling the largest services

Monitor locally. Collect and store the data near where it’s generated. Each zone has a monarch.

  • Targets collect data with streamz library. Metrics are multi dimensional information, stores histogram.
  • Metrics sent to monarch ingestion router, send to leaf which is in-memory data store and also written to recovery log. From log to long term disk respository.
  • Streams stored in a table, basis for queries
  • Evaluator runs queries and stores new data for streams or sends notifications

Integrate Globally. Global Monarch - Distributed across zones, but a single place to configure/query all monarchs in all zones.

Provides both Web UI and Python interfaces.

Monarch is backend for Stackdriver

Monitoring as a service is the right idea. Make the service a platform to build monitoring solutions.

Monitorama 2016 Portland Day 2

Brian Overstreet, Pinterest - Scaling Pinterest’s Monitoring

Video

Started with Ganglia, Pingdom

Deployed Graphite, single box

Second Graphite architecture - Load Balancer, 2x relay servers, multiple cache/web boxes etc.

Suffered lots of UDP packet receive errors

Put statsd everywhere

  • fixed packet loss, unique metric names per host
  • latency only per host, too many metrics

Sharded statsd

  • not unique per host now,
  • shard mapping in client, client version needs to be same everywhere

Multiple graphite clusters - one per application (python/java)

More maintenance, more routing rules etc.

Problems with reads, multiple glob searches can be slow

Deployed OpenTSDB

Replace statsd

  • local metrics agent, kafka, storm - send to graphite/opentsdb

Metrics-agent:

  • interface for opentsdb and statsd
  • sends metrics to kafka
  • processed by storm

120k/sec graphite, 1.5m/sec opentsdb. no more graphite, move to opentsdb.

Create statsboard - integrates graphite and opentsdb for dashboards and alerts

Graphite User Education - underlying info about how metrics are collected, precision, aggregation etc.

Protect System from Clients

  • alert on unique metrics
  • block metrics using zookeeper and shared blacklist (created on fly)

Lessen Operational Overhead

  • more tools, more overhead
  • more monitoring systems, more monitoring of the monitoring system
  • removing a tool in prod is hard

Set expectations

  • data has a lifetime
  • not magical data warehouse tool that returns data instantly
  • not all metrics will be efficient

Summary:

  • match monitoring system to where the company is at
  • user editation is key to scale tools organizationally
  • tools scale with the number of engineers, not users

Emily Nakashima, Bugsnag - What your javascript does when you’re not around

Video

Lots of app moving to frontend, so running in browser not on backend servers

Toolkit:

  • capture load performance from browser, send to app server, use statsd + grafana & google analytics
  • capture uncaught exections in the browser, using their own product

Sorry, Javascript is just not that relevant in my line of work

Eron Nicholson and Noah Lorang, Basecamp - CHICKEN and WAFFLES: Identifying and Handling Malice

Slides | Video

Suffered DDoS and blackmail. 80 gigabits - DNS reflection, NTP reflection, SYN floods, ICMP flood

Defense and Mitigation:

  • DC partner filters for them
  • More 10G circuits and routers
  • Arrangements with vendors to provide emergency access and other mitigation tools

Experience got them serious about more subtle application level attacks:

  • vulnerability scanners
  • repeat slow page requests
  • brute force attempts
  • nefarious crawlers

What do we want from a defense system?

  1. Protection against application-level attacks
  2. Keep user access uninterrupted
  3. Take advantage of the data we have available
  4. Transparent in what gets blocked and why

Chicken: who is a real user and who is malicious?

Considered Machine Learning classification. Problems: really hard to get a good training set. Need to be able to explain why an IP was blocked.

Simpler approach:

  • Some behaviours are known to be from people up to no good. crawling phpmyadmin, path reversal, repeated failed login attempts etc.
  • Request history gives a good idea of whether someone is a normal user, broken script or a malicious actor.
  • External indicators: geoip databases, badip dbs, facebook threat exchange

Removing simple things reduces noise. Every incoming request is scored and per-IP aggregate score calculated based on return code. Create Exponentially Weighted Moving Average from that data. About 12% had negative reputation.

Scaning for blockable actions and scoring requests in near real-time using request byproducts.

Request logs, netflow data, threat exchange -> kafka -> request scoring, scanner for known bad bahaviour, tools for manual evaluation.

Average IP reputation gives an early indicator to monitor for application level attack.

Provides list of good, bad, and dubious IPs.

Enforcement:

  • originally provided by iptables rules on haproxy hosts
  • then tried rule on loadbalancer
  • then tried null routing on routers
  • finally created waffles

Using BGP flowspec to send data from routers to waffles, which then decides what path to take: error, app or challenge. Waffles host live in a seperate network with limited access.

Waffles is redis and nginx.

John Stanford, Solinea - Fake It Until You Make It

Video

Monitoring an openstack cluster, 1 controller and 6 compute nodes, taking logs with heja and sending it to elasticsearch. Can I scale this up to a thousand nodes? How big can it get?

How do you go about figuring that out?

  • took 7 days of logs from lab, 25k messages/hr
  • number of logs coming from a node
  • number of logs coming from a component
  • 7 day message rate, look at histograms, identifies recurring outlier
  • message size, percentiles of payload size

What models look like what we’re doing for simulation? Add some random noise.

Flood process, monitoring everything, repeat until it breaks. System sustained 4k x 1k messages/sec, started to pause above that, but no messages were dropped.

Next steps: - find bottlenecks - improve the model

Tammy Butow, Dropbox - Database Monitoring at Dropbox

Slides | Video

Achieving any goal requires honest and regular monitoring of your progress.

Originally used nagios, thruk (web ui) and ganglia

Created own tool vortex in 2013

why create in house monitoring?

  • performance, reliability iissues, scaling number of metrics fast

Create Vortex:

  • Time Series Database with dashboards, alerting, aggregation
  • Rich metric metadata, tag a metric with lots of attributes

Monthra: single way of scheduling and relaying metrics, discourage scheduling with cron

Service Metrics: - what durability, reliability goals? align monitoring to goals? - threads running/ threads connected

Run a Monthly Metrics Review (great idea)

Dave Josephsen, Librato - 5 Lines I couldn’t draw

Video

  1. Making cofnitive leap to use monitoring tools to recognise system behaviour independent of alerting. misapprehension about what monitoring was and whom it was for.

  2. Monitoring is not for alerting. Nobody owns moitoring. ‘Tape measure that I share with every engineer I work with’. Ops owns monitoring vs everyone owns monitoring. Monitoring is for asking questions.

  3. Complexity isolates. Effective monitoring gives you the things that allow you to ‘Cynefin’ - make things more familiar and knowable. Reduce complexity rather than embracing it. Monitoring can build bridges to help people understand things across boundaries.

  4. Effective monitoring can bring about cultural change, how people interact between each other.

  5. Repeated point 4

Jessie Frazelle, Google - Everything is broken

Video

Talked about problems with Software Engineering and Operations

Demonstrated how they monitored community and external maintainer PR statistics for Docker project.

James Fryman, Auth0 - Metrics are for Chumps - Understanding and overcoming the roadblocks to implementing instrumentation

Video

Story of implementation of instrumentation at Auth0

Wanted data driven conversations. Metrics implementation happened in past, was ripped out because not well understood, thought to cause latency. Created adversions.

Make the chase. Get buy-in.

To have good decent conversations with someone you need to have metrics.

Pushback:

  • Not the most important feature - but it is!
  • Cannot start until we understand the data retention requirements - premature optimisations
  • We don’t run a SaaS - need to understand what your software is doing regardless

Make decisions based on knowledge, not intuition or luck.

Be opportunistic - success is 90% planning, 10% timing and luck. Find opportunites to accellerate efforts.

Needed to get something going fast - went for full service SaaS Datadog, but with common interfaces and shims to allow moving things in house later. Don’t delay, jump in and iterate.

Keep in sync with developers - change is difficult and there will be resistance, pay attention to feedback. Need to support interpretation of data.

Build out data flows, find potention choke points in system, take a baseline measurement, check systems in isolation

Fix and Repair bottlenecks. Solved 3 major bottlenecks, went from 500 to 10k RPS.

Monitorama 2016 Portland Day 1

This year I was lucky enough to attend Monitorama in Portland. Thanks to Sohonet for sending me! I’d wanted to attend again since going to Berlin in 2013, because the quality of the talks is the highest I’ve seen in any conference that’s relevant to my interests. I wasn’t disappointed, it was awesome again.

Here are my notes from the conference:

Adrian Cockcroft, Battery Ventures - Monitoring Challenges

Slides | Video

This talk reflected on new trends and how things have changed since Adrian talked about monitoring “new rules” in 2014

What problems does monitoring address?

  • measuring business value (customer happiness, cost efficiency)

Why isn’t it solved?

  • Lots of change, each generation has different vendors and tools.
  • New vendors have new schemas, cost per node is much lower each generation so vendors get disrupted

Talked about serverless model - now monitorable entities only exist during execution. Leads to zipkin style distributed tracing inside the functions.

Current Monitoring Challenges:

  • There’s too much new stuff
  • Monitored entities are too ephemeral
  • Price disruption in compute resources - how can you make money from monitoring it?

 Greg Poirier, Opsee - Monitoring is Dead

References | Video

Greg gave a history and definition of monitoring, and argued that how we think about monitoring needs to change.

Historically monitoring is about taking a single thing in isolation and making assertions about it.

  • resource utilisation, process aliveness, system aliveness
  • thresholds
  • timeseries

Made a defintion of monitoring:

Observability: A system is observable if you can determine the behaviour of the system based on it’s outputs

Behaviour: Manner in which a system acts

Outputs: Concrete results of it’s behaviours

Sensors: Emit data

Agents: Interpret data

Monitoring is the action of observing and checking the bahaviour and outputs of a system and its components over time

Failures in distributed systems are now: responds too slowly, fails to respond.

Monitoring should now be about Service Level Objectives - can it respond in a certain time, handle a certain throughput, better health checks

We need to better Understand Problems (of distributed systems), and to Build better tools (event correlation particularly)

Nicole Forsgren, Chef - How Metrics Shape Your Culture

Slides | Video

Measurement is culture. Something to talk about, across silos/boundaries

Good ideas must be able to seek an objective test. Everyone must be able to experiment, learn and iterate. For innovation to flourish, measurement must rule. - Greg Linden

Data over opinions

You can’t improve what yu don’t measure. Always measure things that matter. That which is measured gets managed. If you capture only one metric you know what will be gamed.

Metrics inform incentives, shape behaviour:

  • Give meaningful names
  • Define well
  • Communicate them across boundaries

Cory Watson, Stripe - Building a Culture of Observability at Stripe

Video

To create a culture of observability, how can we get others to agree and work toward it?

Where to begin? Spend time with the tools, improve if possible, replace if not, leverage past knowledge of teams

Empathy - People are busy, doing best with what they have, help people be great at their jobs

Nemawashi - Move slowly. Lay a foundation and gather feedback. (Write down and attribute feedback). Ask how you can improve.

Identify Power Users - Find interested parties, give them what they need, empower them to help others

What are you improving? How do you measure it?

Get started. Be willing to do the work, shave the preposterous line of yaks. Strike when opportunies arise (incidents). Stigmergy - how uncordinated systems work together.

Advertise - promote accomplishments, and accomplishment of others.

Alerts with context - link to info, runbooks etc. Get feedback on alerts, was it useful?

Start small, seek feedback, think about your value, measure effectiveness

Kelsey Hightower, Google - /healthz

Video

Kelsey gave a demo of the /healthz pattern, and how that can protect you from deploying non-functional software on a platform that can leverage internal health checks.

Stop reverse engineering apps and start monitoring from the inside

Move health/db checks and functional/smoke tests inside app, and expose over a HTTP endpoint

Ops need to move closer to the application.

 Brian Smith, Facebook - The Art of Performance Monitoring

Video

Gave an overview of some of the guiding ideas behind monitoring at facebook

Bad stuff:

  • High Cardinality - same notifications for 100x machines
  • Reactive Alarms - alarms which are no londer relevant
  • Tool Fatigure - too few/too many

It can Mechanical, Simple and Obvious to do these things at the time. But the cumulative effect is a thing thats hard to maintain.

Properties of Good Alarms:

  • Signal
  • Actionability
  • Relevancy

Your Dashboards are a debugger - metrics are debugger in production.

Caitie McCaffrey, Twitter - Tackling Alert Fatigue

Slides | References | Video

When alerts ae more often false than true, people are desensitised to alerts.

Unhappy customers is the result, but they are also unplanned work, and a distraction from focusing on your core business.

Same problem experienced by nurses responding to alarms in hospitals. What they have done:

  • Increase thresholds
  • Only crisis alarms would emit audible aleters
  • Nursing staff required to tune false positive alerts

What Caitie’s team did:

  • Runbook and alert audits - ensure ther eare runbooks for alerts, templated, single page for all alerts, each alert has customer impact and remediation steps. Importantly, also includes notification steps.

  • Empower oncall - tune alert thresholds, delete alerts or re-time them (only alert during business hours)

  • Weekly on-call retro - handoff ongoing issues, review alerts, schedule work to improve on-call

This resuted in less alerts, and improved visibility on systems that alert a lot.

To prevent alert fatigue:

  • Critical alerts need to be actionable
  • Do not alert on machine specific metrics
  • Tech lead or Eng manager should be on call

Mark Imbriaco, Operable - Human Scale Systems

Video

It’s common to say now that “Tools don’t matter” … but they do. We sweat the details of our tools because they matter. All software is horrible.

We operate in a complex Socio-Technical System. Human practitioners are the adaptable element of complex systems.

Make sure you think about the interface and interactions (human - software interactions)

  • Think about the intent, what problem are you likely to be solving (use cases)
  • Consistency is really important
  • Will it blend - how does it interact with other systems
  • Consider state of mind - high intensity situations/ tired operators

 Sarah Hagan, Redfin - Going for Brokerage: People analytics at Redfin

Video

Redfin is an online Estate Agency with agents on ground

Monitoring hiring

  • Capture lots of data on the market
  • Where should we move?
  • How many staff should we have in each location?
  • Useful tooling for the audience
  • Hire employees rather than contractors, analyse sold house price data to make sure employees earn enough vs. commission agents

Monitoring employees

  • Customer reviews for agents
  • Agents paid based on rating
  • Let the customer monitor the business
  • Monitor loading capacity of agents

Monitoring culture

  • Internal forums for feedback on tooling.

Pete Cheslock, Threat Stack - Everything @obfuscurity Taught Me About Monitoring

Slides | Video

Told the story of his history of learning about monitoring, and how he has approached monitoring problems at his current startup.

Telemetry and Alerting system is not core competancy.

  • Do simple things early when it makes sense (put metrics in logs).
  • When it’s necessary to get more data - just buy something.
  • Hosted TSDB is useful and just works, but there a faster non-durable metrics which are important. So he used graphite for 10s interval metrics, with 2 collectd processes writing to two outputs
  • Ended up with a full graphite deployment

Operability Day Two

Day one is available here

Charity Majors, Parse/Facebook - Building a world class ops team

Slides

This was a talk focusing on bootstrapping an ops team for startups.

Do you need an ops team?

Ops engineering at scale is a specialised skillset. is is not someone to do all the annoying parts of the running systems. Or do you need software engineers to get better at ops?

You need an ops team if you have hard problems:

  • extreme reliability
  • extreme scalability (3x-10x year over year)
  • extreme security
  • solving operational problems for the whole internet

What makes a good startup ops hire?

Its not possible to hire people who are good at everything - unicorns. What you can get are engineers who are good at some things, bad at others. People who can learn on the fly are valuable.

“A good operations engineer is broadly literate and can go deep on at least one or two areas”

Great ops engineers:

  • strong automation instincts
  • ownership over their systems
  • strong opinions, weakly held
  • simplify
  • excellent communication skills, calm in a crisis
  • value process (as that is what stops you making the same mistakes over again)
  • empathy

Things that aren’t good indicators:

  • whiteboarding code
  • any particular technology or language
  • any particular degree
  • big company pedigree

Succeeds at a big company:

  • structured roadmap
  • execute well on small coherent slices
  • classical cs backgrounds
  • value cleanliness & correctness
  • technical depth

Succeeds at startup:

  • comfortable with chaos
  • knows when to solve 80% and move on
  • total responsibility for outcomes
  • good judgement
  • highly reactive
  • technical breadth

How do you interview and sort for these qualities?

Don’t hire for lack of weaknesses. Figure out what strengths you really need and hire for those.

Good questions:

  • leading and broad, probe the candidates self reported strengths
  • related to your real problems
  • ask culture questions, screen for learned helplessness

Bad questions:

  • depend on a specific technology
  • designed to trip then up, looking for a reason to say no
  • deny candidates the resources they would use to solve something in the real world

You hired an ops engineer, now what?

How to spot a bad ops enginner:

  • tweaking indefinitely and pointlessly
  • walling off production from developers
  • adding complexity
  • won’t admit they don’t know things
  • disconnected from customer experience

How to lose good ops engineers:

  • all the responsibility, none of the authority
  • all the tedious shitwork
  • blameful culture
  • no interesting operational problems

David Mytton, Server Density - Human Ops - Scaling teams and handling incidents

Slides

This talk covered how incidents are handled at Server Density.

We should expect downtime - prepare, respond, postmortem.

Prepare

Things that need to be in place before an incident

  • on call schedue with primary & secondary
  • off call - 24hr recovery after overnight incident
  • docs, and must be located independently from primary infrastructure
  • key info must be available: team contacts, vendor contacts, credentials
  • plan for unexpected situations: loss of communication, loss of internet access
  • use war games to practise for incidents

Respond

Process to follow during an incident:

  • First responder
    • load incident response checklist
    • log into ops war room
    • log incident in jira
  • Follow Checklist(s)
    • due to complexity
    • easy to follow in times of stress and fatigue
    • take a beginners mind - ego can get in the way, don’t wing it
  • Key Principles
    • log everything (all commands run, by who and where and what the result was)
    • communicate frequently
    • gather the whole team for major incidents

Postmortem

  • Do within a few days
  • Tell the story of what happened - from your logs
  • Cover the appropriate technical detail
  • What failed, and why? How is it going to be fixed?

Emma Jane Hogbin Westby, Trillium Consultancy - Emphatically Empathetic

Emma talked about how she taught herself to be more empathetic.

Normal people have a lack of empathy, it’s a skill that can be practised and learned.

What is empathy: ability to understand the feelings of another

Level 1: Care just enough to learn about a person’s life

Doing this improves team cohesion, but requires a time investment.

Collect stories - learn about people by asking them questions. Shut up and listen. Respond in a way to encourage more info gathering.

Later, refer back to stories and follow up for more information.

Level 2: Strategies to structure interactions

Doing this you can engineer successful outcomes, and improve capacity for diverse thinking. But you risk being perceived as manipulative.

It’s a mistake to believe there is only one way to have a connection. Try to uncover motivation, why do people behave the way they do?

There are three types of thinking strategies, and you’ll see language patterns that match each of them.

creative thinking: ‘can we try…’ ‘what about…

understanding thinking: ‘so what you’re saying is…’ ‘just to clarify…’

decision thinking: ‘im ready to move on to…’ ‘last time we tried this…’

How can you create outcome based interactions for these sort of people? Perhaps you can plan for specific types of discussions in meeting agendas.

Find a system to use with your team to make communications more explicit, and to take advantage of the thinking strategies they use.

Level 3, engage with work from another’s perspective

This can foster creative problem solving. The risks are it is potentially overwhelming, and can cause doubt for self worth.

Seek to understand - complain about yourself from the other’s prespective, or situation. Live your day through the other’s constraints.

Thinking process should be no more left to chance than the delivery practise of a skill.


There was an interesting question after Emma’s talk - “How do I make Bob care about Dave from another team”. Her suggestion was to create a situation where they can bond over a common enemy - i.e. say something you know to be untrue and that they would both respond to in a similar way.

Scott Klein statuspage.io - Effective Incident Communication

Remember that there is someone on the other end of our incidents who is affected personally.

The talk covered what to do before, during and after an incident. You need a dedicate place to communicate system status to your users.

Before

Get a status page. It needs the following:

  • timestamps
  • to be very fast, very reliable
  • keep away from primary infrastructure, even DNS
  • contact info - give a way to get in touch

During

  • communicate early. say you are investigating - it means ‘we have no clue but at least we’re not asleep’
  • communicate often. always communicate when the next update is.
  • communicate precisely: be very declarative
    • don’t do etas, they will disappoint people
    • don’t speculate, ‘were still tracking down the cause’
    • ‘verification of the fix is underway’ not ‘we think we fixed it’
  • communicate together. have pre-written templates.
  • one person needs to be assigned as incident communicator

After

  • apologise first
  • don’t name names
  • be personal. “I’m very sorry”. Take responsbility.
  • details inspire confidence
  • close the loop - what we’re doing about it

Why do this?

  • gain trust with users/customers
  • turn bad experience into good experience
  • service recover paradox - people think more highly of a company if they respond properly.
  • show that you do your job well

Rich Archbold, Intercom.io - Leading a Team with Values

Talk covered Rich’s experience of introducing core values to drive performance of the team. They reduced downtime and infrastructure costs, and number of ops pages.

Enabled autonomy, distributed decision making.

Problems they were facing:

  • roadmap randomisation. easily distrated from what planned to do.
  • projects take a long time and delivered late
  • not feeling like a tribe

Criteria for values:

  • fit with the business
  • personal, specific
  • aspirational and inspirational
  • drive daily decision making
  • not dogma - needed to be flexible

These are the Values they came up with:

  1. Security, Availablilty, Performance, Scalability, Cost - prioritize for maximum impact
  2. Faster, Safer, Easier, Shipping
  3. Zero Touch Ops
  4. Run Less Software

Afterwards they gathered lots of metrics of unplanned work. From this they worked out that they need to multiply estimates by 2.7 to get accurate roadmap planning.

Matthew Skelton, Skeleton Thatcher Consulting - Un-Broken Logging, the foundation of software operability

Slides

The way we use logging is broken, how to make it more awesome

What is logging for? It provides an execution trace.

How is logging usually broken? It’s often unloved, discontinuous, contains errors only, bolten on, doesn’t have aggregation and search, severeties aren’t useful because they need to be determined up front.

Also, logs aren’t free. You need to allocate budget and time to make them useful.

Why do we log? For verification, traceability, accountability, and charting the waters.

How to make logging awesome

Continuous Event IDs - use to respresent distinct states. Describe what’s useful for the team to know and describe that as a seperate state. Use enums.

Transaction tracing - Create uniqueish identifier for each request, as pass it through the layers.

Decouple Severity - allow configurable severity levels. Log level should not be fixed at compile or build time. Map Event IDs to a severity.

Log aggregation and search tools - As we move from monolith to microserverice, the debugger does not have the full view anymore. Need an aggregated view of logs across a system. Develop software using log aggregation as a first class thing.

Design for logging - logging is another system component, and needs to be testable.

NTP - Time sync is crucially important for correlating log entries

Referenced the following video:

Evan Phoenix - Structured Logging

Gareth Rushgrove, Puppet Labs - Taking the Operating System out of operations

Slides

The age of the general purpose operating system is over. What does this mean for operators?

Lots of new OSes have appeared in the last year

New Breed:

  • Atomic (RedHat)
  • CoreOS
  • Snappy (ubuntu, replace dpkg with contrainers)
  • RancherOS (docker all the way down)
  • Nano - tiny alternative to Windows Server
  • VMWare Bonneville

Common themes:

  • Cluster native
  • RO file systems
  • Transactional updates
  • Integrated with containers

Why the interest in New OSes?

  • Lots of homogeneous workloads
  • Security is front page news
  • Size as a proxy for complexity
  • Utilisation matters at scale
  • Increasingly interacting with higher level abstractions anyway

Unikernels

Compile an application down to a kernel, there is no userspace. Only include the capabilities and libraries you need - everything is opt-in.

  • Hypervisor/hardware isolation
  • Smaller attack surface
  • Less running code
  • Enforced immutability
  • No default remote access

What happens to operators?

Hypervisor becomes the “platform”.

Everything else as an application. Firewalls. Network Switches. IDS. Remote access.

Everyone not running the hypervisor is an application developer. Standards required: Platforms, Containers, Monitoring. Publish more schemas than incompatible implementations in code.

Infrastructure is code.

Revolution not evolution. Distance between old infrastructure and new will be huge. Models of interaction and the skills required to operate.

Conclusions:

We have fundamental problems that date back more than 40 years. It Might take a different evolutionary process to build better infrastructure. We may have to throw away things we care about, such as Linux. This is all driven from security concerns.

Ben Hughes, Etsy - Security for Non-Unicorns

Slides

Security is hard. Tiny little bugs turn into giant things.

You’re already being probed for security holes, do you want to know or not? Bug Bounties are a way of getting attackers working for you.

You need to prepare a lot for bug bounties. Try and get all the low hanging fruit yourself. The first few weeks will be hell.

With much of our infrastructure in the cloud, it’s easy to expose sensitive information, such as credentials, on places like github. Gitrob helps to analyse git repos for you.

People trust random files off the internet - like docker images, vagrant images, and curl|bash installs etc.

Operability Day One

Today I’ve been at the operability.io conference in London.

The conference in intended to be focused on the ops side of devops, and how to make software operable.

This is a single track conference, here is my personal summary and notes from the talks:

Andrew Clay Shafer, Pivotal - What even is operable?

Video

I’ve seen a few talks by Andrew before, and they’re full of challenging, rapid fire ideas and only loosely tied together - it’s more an expression of his world view rather than a talk on a specific subject. With that in mind, I’m still going to try and summarise it.

Andrew asks the question “what even is operable”, and in the end he came to the conclusion that Operability is the intersection of capability and usability.

There is an emergent architecture, which he calls cloud native, which are a set of patterns that emerged in organisations that deliver highly available applications continuously at scale - like Amazon, Google, Twitter, Facebook etc. There are many associated labels, like devops, continuous delivery and microservices, and these are all inter-related as part of that architecture.

“Do not seek to follow in the footsteps of the wise. Seek what they sought.” - Matsuo Basho

The human tendency is to fixate on the solution. We need to think about the problem more. Principles > Practices > Tools. You’re not going to be able to do the right things until you internalise the principles. Otherwise you are just imitating or cargo culting.

Equally, we can focus on automation tools and capabilities, not what is being automated.

Other pertinent points:

  • if tetris has touagh me anything is that errors pile up and accomplishments disappear
  • systems thinking teeachs that we should minimise resistence rather than push harder
  • highlighted borg paper and that all tasks run in born run http server that reports health status and metrics
  • operations problems become easier when apps are aware of their own health
  • worse is better won: broken gets fixed but shitty lasts forever

Referenced the following for reading:

Colin Humphreys, CloudCredo - inoperability.io

Colin told a war story about what happens when you completely ignore operations? what’s the worst that could happen?

A game he worked on had 5 million users and “Launch issues”. It hardly worked, because they expected 20k users and ran initially with no budget for operations.

There was a big launch, but the infrastructure for the system was build the day before and completely untested. It was swamped in 10sec. 1.7 million people tried to play 1st day, but couldn’t. There were various media reports of the failure.

As a result, managed to steal some funding from the marketing budger and scale 100x, add caching, and throw lots of hardware at the problem. To use the 64 DB servers, they used sharded PHP ORM to distribute API server traffic.

He did get it working, but in time they found transactions not working, as this was supposed to be addressed within the application. It wasn’t because of differences between the development and production environments, and as a result cash was disappearing from the game. As a result, everyone’s scores had to be reset.

Personally, Colin worked 54 hours solid to get the thing running. For the three months the site ran, he worked 100hr/weeks.

This is how bad a project can get. “I nearly died”. He must certainly have burned out. Took a personal sense of pride in the project. Horrible to other people around. Heroism != success.

Takeaways:

  • Work as a team
  • Communicate

Anthony Eden, DNSimple - How small teams accomplish big things

Slides | Video

Anthony’s talk is about scaling a team, and how they had to scale the operational processes - as a result from various experiences, but none more so than losing one of the founders, someone who know everything about everything.

Specific Processes:

Incident Response Plan, consisting of 4 points: Assess, Communicate, Respond, Document.

Assess - think before you act. Determine impact. Have a threshold for requesting help. >10 mins results in everyone in a Google Hangout.

Communicate - 1 person responsible to communicate to customers. Update twitter, status page.

Respond - minimise impact. Triage first. Everyone ask questiosn and propose solutions. Consider available actions and act one best available.

Document - post mortem. what happened? why did it happen? how did we respond and revoer? how might we provent similar issues from occurring again?

Other processes were mentioned: On-call rotation. Security Policies. Security escalation policy. System security, RBAC. Track CVEs. Password rotation - especially on admin passwords. etc.

What makes a good process?

  • Born from Experience
  • Written down and available to all
  • Concise & Clear
  • Act as guidance

Once they’re in place you need to:

Execute it, to test it out. Prune where appropriate. Automate where that makes sense.

Bridget Kromhaut, Pivotal - distributed: of systems and teams

Slides | Video

Bridget’s talk compared how distributed systems are complex, but so are distributed teams in many of the same ways.

Firstly, distributed != remote. Having a few people out of the office is not the same.

What’s important in teams is people > tools. Focusing on people are communicating more important than the tools and how.

She made various points which are important for distributed teams:

Durable communication encourage honesty, transparency and helps future you - “durable communication exhibits the same characteristics as accidental convenient communication in a co-located space. The powerful difference is how inclusive and transparent it is.” - Casey West

  • Let your team know when you’ll be unavailable.
  • Tell the team what you’re doing.
  • Misunderstandings are easy, need to over-communicate. Especially to express emotions, it’s easy to misinerpret textual communication.
  • Be explicit about decisions you’re making.
  • Distribute decision making

Colin Hemmings, dataloop.io - In god we trust, all else bring data

This talk was about the experience of building dataloop as a startup, from working in other companies, and what was learned speaking to 60 companies about monitoring and focued on dashboards.

Generally see the following kinds of dashboards:

  • Analytics dashboards, to diagnose performance issues. Low level, detailed info.
  • NOC dashboard, high level overview of services.
  • Team dashboards, overview of everything not just technical elements - includes business metrics.
  • Public dashboards, high level, simplified & sanitised marketing exercises.

Keeping people in touch of reality is the problem, and knowing what the right thing to work on. Discussions on what features to work on get opinionated. There is data within our applications that we can use to make decisions, and they can be represented on dashboards.

  • Stability dashboards: general performance, known trouble spots.- Feature dashboards: customer driven, features forum. “I suggest you…” & voting
  • Release dashboards: dashboards for monitoring the continuous delivery pipeline

Elik Eizenberg, BigPanda - Alert Correlation in modern production environments

My personal favourite talk of the first day.

Elik’s contention is that there is a lack of automation in regard to responding to alerts.

Incidents are composed of many distinct symptoms, but monitoring tools don’t correlate alerts on those symptoms for us into a single incident.

The number of alerts received might not be proportional to number of incidents. The number of incidents experienced may be similar day to day, but the impact of the incidents can be very different - hence the number of alerts being higher sometimes.

Existing Approaches:

  • compound metrics (i.e. aggregations, or a compound metric build from many hosts). This is relatively effective, but alerts are received late, and you can miss symptoms in buildup to the alert triggering.
  • service hierarchy, i.e. hierarchy of dependencies related to a service. Problem with this approach is that it’s hard to create & manage. Applications and their dependencies are generally not hierarchical.

“What I would like to advocate”: Stateful Alert Correlation

Alerts with a sense of time, aware of what happened before now.

Alerts can create a newincident, or link themself to an existing incident if it’s determined they’re related.

How do you know that an alert belongs to an incident?

  • Topology - some tag that every alert has. (Service? Datacenter? Role?)
  • Time - alerts occuring close in time to another alert
  • Modeling - learning if multiple alerts tend to fire within a short timeframe
  • Training - Machine Learning. User feedback if correlation good or bad.

Even basic heuristics are effective. There is lots of value to be gained from just applying Topology and Time.


There were two other talks after this, but I had to leave early. Sorry to the presenters for missing their talks!