Monitorama 2018 PDX Day 1

06 Jun 2018

Optimising for learning - Logan McDonald

expert intuition is achievable. improve memory.

prep

hierarchy of learning. important one is rule learning.

gaining knowledge

reading docs/textbooks not an effective way to embed knowledge in long term memory. do low stakes testing. best way of learning if to retrieve data.

mental models

observabiity and reflection

move from events to patterns to structure

engaging in incident reviews gave opportunity to quiz engineers on their actions during incident

learning together

symmathesy - systems of understanding

embrace cultural memory

Serverless and CatOps: Balancing trade-offs in operations and instrumentation - Pam Selle

I sadly missed most of this because I was debugging a problem

Mentoring Metrics Engineers: How to Grow and Empower Your Own Monitoring Experts - Zach Musgrave, Angelo Licastro (Yelp)

growth in teams, and managing growth in individuals within a team

mandate growth - more skills needed, more knowledge gaps, more planning (meetings), trade offs.

knowledge silos - reaction is mentoring

breadth: build confidence. defined mentor relationship.

depth: make an advanced contribution to one system.

next steps: end of mentor relationship, can start on-call. hold a retrospective - provide feedback for improvement

impostor syndrome - accept compliments for your work and acknowledge other’s contributions

consulting: monitoring is a specialised field. there’s a lot of nuance. make people aware of what tools can solve their problems. talk with other teams, ask insightful questions: what problem are you trying to solve? listen, as assume you know nothing about their problem.

The Power of Storytelling - Dawn Parzych (Catchpoint)

Cognitive bias

too much info
not enough meaning
not enough time
not enough memory

To tell better stories, when presenting data view it from the perspective of other parts of the organisation

Persuasive storytelling doesn’t need to be complicated - keep it simple

Use simpler language - it doesn’t make it less powerful

Present the most important information earlier

Use visuals when telling a story, keep them simple to understand
Don’t overcomplicate things
Remember the power of 3

Principia SLOdica - A Treatise on the Metrology of Service Level Objectives - Jamie Wilkinson

Overloaded team, tasked with reducing load to build a sustainable system they could manage

Symptom Based Alerting - focus on very few expectations about your service, so as system grows you are still focusing on the same things

Symptom: a user is having a bad time. Causes: internal views on the service.

What’s your tolerance for failure? Error budget. Set expectations: SLO

Symptom anything that can be measured by the SLO. Symptom based alert, is when SLO is in danger of being missed.

Debugging tools - metrics, tracing, logs - replace cause based alerts

On-call Simulator! - Building an interactive game for teaching incident response - Franka Schmidt (Mapbox)

On call onboarding

Safe and low stakes place to practice handing on call incidents

Onboarding:

Buddy System: Observing
Bucket list: checklist, how far along you are with experiences of being on call
Alarm scrum: review last days alarms

Simulator - “choose your adventure” type text adventure game. Tool: Twine.

Craft a story - use Incident Reviews, past notes and enrich with detail. Or just make it up.

Iterate and get feedback.

Observability: the Hard Parts - Peter Bourgon (Fastly)

Level up monitoring story at fastly

Senior engineering org, 120 engineers. Heterogeneous solutions throughout the organisation. Organic growth is good - up to a point.

Goals: are production systems healthy? if not why not?

Non goals: long term storage, replacing logging. (Observed metric usage VERY different from self-reported usage)

Strategy: prometheus, add instrumentation, curate new set of dashboards and alerts, run in parallel to build confidence, decom old systems.

Rollout: Inventory all services. Rank by criticality & friendliness.

The grind…

Embedded expert model. Pair with responsible engineer to get migration done. Hour or two of pair programming. Let documentation emerge naturally. Deploy prometheus, plumb service into prometheus (time consuming).

Technical autonomy is a form of technical debt. Focus on local vs. global optimization.

Dashboards and alerts. Import from other systems. Make sure to only build things in service to original goals. Reduce, keep high level views.

Some services, ownership can be indistinct. Senior leadership needed to wield stick, need their buy in.

Warning: This Talk Contains Content Known to the State of California to Reduce Alert Fatigue - Aditya Mukerjee (Stripe)

Slides

We can learn from clinical healthcare

Alert Fatigue - frequency/severity - causes responder to ignore or make mistakes

Decision Fatigue - frequency/complexity of decision points, causes person to avoid devices or make mistakes

Certain patterns of alerts and decisions contribute disproportionately to fatigue. Multiple false positives for an individual patient, impacts that alert over all patients.

Reduce alert fatigue: STAT. Supported, Trustworthy, Actionable, Triaged.

Supported: who owns this alert? Responders should own, or feel ownership over end result.

Trustworthy: do I trust this alert to notify me when a problem happens? stays silent with all is well? give sufficient information to diagnose problems?

Anomaly detection: if you don’t understand why an alert triggered, you don’t understand if it’s real.

Actionable: One decision require to respond. Alerts difficult to action are ignored. Make alerts specific, add decision trees. Alerts must have a specific owner.

Triaged: Triage alerts. Type should reflect urgency. Urgency can change. Commonly understood tiers. Regular re-evaluation process.

Monitory Report: I Have Seen Your Observability Future. You Can Choose a Better One. - Ian Bennett (Twitter)

Twitter has a big monitoring system and migrating was hard

Monitorama 2016 Portland Day 3

29 Jun 2016

Justin Reynolds, Netflix - Intuition Engineering at Netflix

Video

Discussed problems at Netflix that regions were siloed, they worked on serving users out of any region.

To fail regions, need to scale up other regions to server all traffic

Dashboards, good at looking back, but need to know now. How to provide intuition of the now?

Created vizceral - see the blog post for screenshots & video: http://techblog.netflix.com/2015/10/flux-new-approach-to-system-intuition.html

Brian Brazil, Robust Perceiver - Prometheus

Slides

Video

Prometheus is a TSDB offering ‘whitebox monitoring’ for looking inside applications. supports labels, alerting and graphing are unified, using the same language.

Pull based system, links into service discovery. HTTP api for graphing, supports persistent queries which are used for alerting.

Provides instrumentation library, incredibly simple to instrument functiions and expose metrics to prometheus. Client libraries don’t tie you into prometheus - can use graphite.

Can use prometheous as a clearing house to translate between different data formats.

Doesn’t use a notion of a machine. HA by duplicating servers, but alertmanager deduplicates alerts. Alertmanager can also group alerts.

Data stored as file per database on disk, not round-robin - stores all data without downsampling.

Torkel Ödegaard, Raintank - Grafana Master Class

Video

Gave a demo on how to use grafana, as well as recently added and future features.

Katherine Daniels, Etsy - How to Teach an Old Monitoring System New Tricks

Slides

Video

Old Monitoring System == Nagios

Adding new servers.

Use deployinator to deploy nagios configs. Uses chef to provide inventory to generate a currently list of hosts and hostgroups.
Run validation via Jenkins by running nagios -v, as well as writing tool for nagios validation.
New hosts are added with scheduled downtime so they don’t alert until the next day. Chat bots send reminders when downtime is going to finish.

Making Alerts (Marginally) Less Annoying

Created nagdash to provide federated view of multiple nagios instances.
Created nagios-herald to add context to nagios alerts. Also supports allowing people to sign up to alerts for things they’re interested in.

Tracking Sleep

Ops weekly tool. Provides on call reports, engineers flag what they had to do with alerts.
Sleep tracking and alert tracking for on call staff to understand how many alerts they’re facing and how it’s impacting their sleep.

An On Call bedtime story

Plenty of alerts because scheduled downtime expired for ongoing work.
Create daily reports of what downtime will soon expire and which will raise alerts.

Joe Damato, packagecloud.io - All of Your Network Monitoring is (probably) Wrong

Slides

Video

There’s too much stuff to know about

ever copy paste config or tune settings you didn’t understand?
do you really understand every graph you’re generating?
what makes you think you can monitor this stuff?

Claim: the more complex the system is the harder it is to monitor.

whats p. complicated? linux networking stack! lots of features, lots of bugs. with no docs!

/proc/net stats can be buggy
ethtool inconsistent, not always implemented
meaning of driver stats are not standardised
stats meaning for a dirver/device can change over time
/proc/net/snmp has bugs: double counting, not being counted correctly

Monitoring something requires a very deep understanding of what you’re monitoring.

Properly monitoring and setting alerts requires significant investment.

Megan Kanne, Justin Nguyen, and Dan Sotolongo, Twitter - Building Twitter’s Next-Gen Alerting System

Slides

Video

3.5B metrics per minute

Old Alerting System

25k alerts/minute, 3m alert monitors
single config language, lot of existing example, easy to write and add
all those points were good and bad!
lots of orphaned and unmaintained configs, no validation
alerts and dashboards were seperate
problems with reliability when zones suffer problems

Solution

combined alerts and dashboard configuration
dashboards defined in python, common libraries that can be included
python allows testing configs
created multi-zone alerting system
reduced time to detect from 2.5mins to 1.75mins

Helping Human Reasoning

bring together global, dependencies and local context
including runbooks, contacts and escalations directly in the UI

Lessons Learned

distributed system, challenges about consistency, structural complexity and reasoning about time
sharding choices are hard, impossible to always avoid making mistakes
support and collaborate with users, try and reduce support burden with good information at interaction points (UI, CLI etc.), good user guides and docs
migrating - some happy to move, others not. some push back. had to accept schedule compromise and extra work.

Joey Parsons, Airbnb - Monitoring and Health at Airbnb

Video

Perfers to buy stuff:

New Relic
Datadog, instrument apps using dogstatsd
Alerting through metrics

Created open source tool, Interferon, to store alerts configuration as code.

Volunteer on call system. SREs make sure things in place so anyone can be on call.

Sysops training for volunteers, monitoring systems, how to be effective and learn from historical incidents
Shadow on call, learn from current primary/secondary
Promoted to on call

Weekly sysops meeting, go through incidents, hand offs, discuss scheduled maintenance.

On call health:

are alerting trends appropriate?
do we understand impact on engineers?
do we need to tune false positives?
are notifications and notification policies appropriate?

Dashboards for:

incident numbers over time
counts by service
total notifcations per user and how many come at night
false positive incident counts

Heinrich Hartmann, Circonus - Statistics for Engineers

Slides

Video

Monitoring Goals

Measure user experience/ quality of service
Determine implications of service degradation
Define sensible SLA targets

External API Monitoring:

sythetic request, measures availability, but bad for user experience
on long time ranges, rolled-up data is commonly displayed, erodes spikes

Log Analysis:

write request stats to log file
rich information but expensive and delay for indexing

Monitoring Latency Averages:

mean values, cheap to collect, store and analyse, but skewed by outliers/low volumes
percentiles, cheap to collect store and abalyse, robust to outliers but up front percentile choice required and cannot be aggregated

percentiles: keep all your data. don’t take averages! store percentiles for all reporting periods you are interested in - i.e. per min/hour/day. store all percentiles you’ll ever be interested in.

Mointoring with Histograms:

divide latency scale into bands
divide time scale into reporting periods
count the name of samples in each band x period

Can be aggregated across times. Can be visualised as heatmaps.

John Banning, Google - Monarch, Google’s Planet-Scale Monitoring Infrastructure

Video

Huge volume, global span, many teams - constant change

Previously borgmon. Each group had it’s own borgmon. Large load on anyone doing monitoring. Hazing ritual - new engineer gets to do borgmon config maintenance.

Wanted:

can handle the scale
small/no load to get up and running
capable of handling the largest services

Monitor locally. Collect and store the data near where it’s generated. Each zone has a monarch.

Targets collect data with streamz library. Metrics are multi dimensional information, stores histogram.
Metrics sent to monarch ingestion router, send to leaf which is in-memory data store and also written to recovery log. From log to long term disk respository.
Streams stored in a table, basis for queries
Evaluator runs queries and stores new data for streams or sends notifications

Integrate Globally. Global Monarch - Distributed across zones, but a single place to configure/query all monarchs in all zones.

Provides both Web UI and Python interfaces.

Monarch is backend for Stackdriver

Monitoring as a service is the right idea. Make the service a platform to build monitoring solutions.

Monitorama 2016 Portland Day 2

28 Jun 2016

Brian Overstreet, Pinterest - Scaling Pinterest’s Monitoring

Video

Started with Ganglia, Pingdom

Deployed Graphite, single box

Second Graphite architecture - Load Balancer, 2x relay servers, multiple cache/web boxes etc.

Suffered lots of UDP packet receive errors

Put statsd everywhere

fixed packet loss, unique metric names per host
latency only per host, too many metrics

Sharded statsd

not unique per host now,
shard mapping in client, client version needs to be same everywhere

Multiple graphite clusters - one per application (python/java)

More maintenance, more routing rules etc.

Problems with reads, multiple glob searches can be slow

Deployed OpenTSDB

Replace statsd

local metrics agent, kafka, storm - send to graphite/opentsdb

Metrics-agent:

interface for opentsdb and statsd
sends metrics to kafka
processed by storm

120k/sec graphite, 1.5m/sec opentsdb. no more graphite, move to opentsdb.

Create statsboard - integrates graphite and opentsdb for dashboards and alerts

Graphite User Education - underlying info about how metrics are collected, precision, aggregation etc.

Protect System from Clients

alert on unique metrics
block metrics using zookeeper and shared blacklist (created on fly)

Lessen Operational Overhead

more tools, more overhead
more monitoring systems, more monitoring of the monitoring system
removing a tool in prod is hard

Set expectations

data has a lifetime
not magical data warehouse tool that returns data instantly
not all metrics will be efficient

Summary:

match monitoring system to where the company is at
user editation is key to scale tools organizationally
tools scale with the number of engineers, not users

Emily Nakashima, Bugsnag - What your javascript does when you’re not around

Video

Lots of app moving to frontend, so running in browser not on backend servers

Toolkit:

capture load performance from browser, send to app server, use statsd + grafana & google analytics
capture uncaught exections in the browser, using their own product

Sorry, Javascript is just not that relevant in my line of work

Eron Nicholson and Noah Lorang, Basecamp - CHICKEN and WAFFLES: Identifying and Handling Malice

Slides

Video

Suffered DDoS and blackmail. 80 gigabits - DNS reflection, NTP reflection, SYN floods, ICMP flood

Defense and Mitigation:

DC partner filters for them
More 10G circuits and routers
Arrangements with vendors to provide emergency access and other mitigation tools

Experience got them serious about more subtle application level attacks:

vulnerability scanners
repeat slow page requests
brute force attempts
nefarious crawlers

What do we want from a defense system?

Protection against application-level attacks
Keep user access uninterrupted
Take advantage of the data we have available
Transparent in what gets blocked and why

Chicken: who is a real user and who is malicious?

Considered Machine Learning classification. Problems: really hard to get a good training set. Need to be able to explain why an IP was blocked.

Simpler approach:

Some behaviours are known to be from people up to no good. crawling phpmyadmin, path reversal, repeated failed login attempts etc.
Request history gives a good idea of whether someone is a normal user, broken script or a malicious actor.
External indicators: geoip databases, badip dbs, facebook threat exchange

Removing simple things reduces noise. Every incoming request is scored and per-IP aggregate score calculated based on return code. Create Exponentially Weighted Moving Average from that data. About 12% had negative reputation.

Scaning for blockable actions and scoring requests in near real-time using request byproducts.

Request logs, netflow data, threat exchange -> kafka -> request scoring, scanner for known bad bahaviour, tools for manual evaluation.

Average IP reputation gives an early indicator to monitor for application level attack.

Provides list of good, bad, and dubious IPs.

Enforcement:

originally provided by iptables rules on haproxy hosts
then tried rule on loadbalancer
then tried null routing on routers
finally created waffles

Using BGP flowspec to send data from routers to waffles, which then decides what path to take: error, app or challenge. Waffles host live in a seperate network with limited access.

Waffles is redis and nginx.

John Stanford, Solinea - Fake It Until You Make It

Video

Monitoring an openstack cluster, 1 controller and 6 compute nodes, taking logs with heja and sending it to elasticsearch. Can I scale this up to a thousand nodes? How big can it get?

How do you go about figuring that out?

took 7 days of logs from lab, 25k messages/hr
number of logs coming from a node
number of logs coming from a component
7 day message rate, look at histograms, identifies recurring outlier
message size, percentiles of payload size

What models look like what we’re doing for simulation? Add some random noise.

Flood process, monitoring everything, repeat until it breaks. System sustained 4k x 1k messages/sec, started to pause above that, but no messages were dropped.

Next steps:

find bottlenecks
improve the model

Tammy Butow, Dropbox - Database Monitoring at Dropbox

Slides

Video

Achieving any goal requires honest and regular monitoring of your progress.

Originally used nagios, thruk (web ui) and ganglia

Created own tool vortex in 2013

why create in house monitoring?

performance, reliability iissues, scaling number of metrics fast

Create Vortex:

Time Series Database with dashboards, alerting, aggregation
Rich metric metadata, tag a metric with lots of attributes

Monthra: single way of scheduling and relaying metrics, discourage scheduling with cron

Service Metrics:

what durability, reliability goals? align monitoring to goals?
threads running/ threads connected

Run a Monthly Metrics Review (great idea)

Dave Josephsen, Librato - 5 Lines I couldn’t draw

Video

Making cofnitive leap to use monitoring tools to recognise system behaviour independent of alerting. misapprehension about what monitoring was and whom it was for.
Monitoring is not for alerting. Nobody owns moitoring. ‘Tape measure that I share with every engineer I work with’. Ops owns monitoring vs everyone owns monitoring. Monitoring is for asking questions.
Complexity isolates. Effective monitoring gives you the things that allow you to ‘Cynefin’ - make things more familiar and knowable. Reduce complexity rather than embracing it. Monitoring can build bridges to help people understand things across boundaries.
Effective monitoring can bring about cultural change, how people interact between each other.
Repeated point 4

Jessie Frazelle, Google - Everything is broken

Video

Talked about problems with Software Engineering and Operations

Demonstrated how they monitored community and external maintainer PR statistics for Docker project.

James Fryman, Auth0 - Metrics are for Chumps - Understanding and overcoming the roadblocks to implementing instrumentation

Video

Story of implementation of instrumentation at Auth0

Wanted data driven conversations. Metrics implementation happened in past, was ripped out because not well understood, thought to cause latency. Created adversions.

Make the chase. Get buy-in.

To have good decent conversations with someone you need to have metrics.

Pushback:

Not the most important feature - but it is!
Cannot start until we understand the data retention requirements - premature optimisations
We don’t run a SaaS - need to understand what your software is doing regardless

Make decisions based on knowledge, not intuition or luck.

Be opportunistic - success is 90% planning, 10% timing and luck. Find opportunites to accellerate efforts.

Needed to get something going fast - went for full service SaaS Datadog, but with common interfaces and shims to allow moving things in house later. Don’t delay, jump in and iterate.

Keep in sync with developers - change is difficult and there will be resistance, pay attention to feedback. Need to support interpretation of data.

Build out data flows, find potention choke points in system, take a baseline measurement, check systems in isolation

Fix and Repair bottlenecks. Solved 3 major bottlenecks, went from 500 to 10k RPS.

Monitorama 2016 Portland Day 1

27 Jun 2016

This year I was lucky enough to attend Monitorama in Portland. Thanks to Sohonet for sending me! I’d wanted to attend again since going to Berlin in 2013, because the quality of the talks is the highest I’ve seen in any conference that’s relevant to my interests. I wasn’t disappointed, it was awesome again.

Here are my notes from the conference:

Adrian Cockcroft, Battery Ventures - Monitoring Challenges

Slides

Video

This talk reflected on new trends and how things have changed since Adrian talked about monitoring “new rules” in 2014

What problems does monitoring address?

measuring business value (customer happiness, cost efficiency)

Why isn’t it solved?

Lots of change, each generation has different vendors and tools.
New vendors have new schemas, cost per node is much lower each generation so vendors get disrupted

Talked about serverless model - now monitorable entities only exist during execution. Leads to zipkin style distributed tracing inside the functions.

Current Monitoring Challenges:

There’s too much new stuff
Monitored entities are too ephemeral
Price disruption in compute resources - how can you make money from monitoring it?

## Greg Poirier, Opsee - Monitoring is Dead

References

Video

Greg gave a history and definition of monitoring, and argued that how we think about monitoring needs to change.

Historically monitoring is about taking a single thing in isolation and making assertions about it.

resource utilisation, process aliveness, system aliveness
thresholds
timeseries

Made a defintion of monitoring:

Observability: A system is observable if you can determine the behaviour of the system based on it’s outputs

Behaviour: Manner in which a system acts

Outputs: Concrete results of it’s behaviours

Sensors: Emit data

Agents: Interpret data

Monitoring is the action of observing and checking the bahaviour and outputs of a system and its components over time

Failures in distributed systems are now: responds too slowly, fails to respond.

Monitoring should now be about Service Level Objectives - can it respond in a certain time, handle a certain throughput, better health checks

We need to better Understand Problems (of distributed systems), and to Build better tools (event correlation particularly)

Nicole Forsgren, Chef - How Metrics Shape Your Culture

Slides

Video

Measurement is culture. Something to talk about, across silos/boundaries

Good ideas must be able to seek an objective test. Everyone must be able to experiment, learn and iterate. For innovation to flourish, measurement must rule. - Greg Linden

Data over opinions

You can’t improve what yu don’t measure. Always measure things that matter. That which is measured gets managed. If you capture only one metric you know what will be gamed.

Metrics inform incentives, shape behaviour:

Give meaningful names
Define well
Communicate them across boundaries

Cory Watson, Stripe - Building a Culture of Observability at Stripe

Video

To create a culture of observability, how can we get others to agree and work toward it?

Where to begin? Spend time with the tools, improve if possible, replace if not, leverage past knowledge of teams

Empathy - People are busy, doing best with what they have, help people be great at their jobs

Nemawashi - Move slowly. Lay a foundation and gather feedback. (Write down and attribute feedback). Ask how you can improve.

Identify Power Users - Find interested parties, give them what they need, empower them to help others

What are you improving? How do you measure it?

Get started. Be willing to do the work, shave the preposterous line of yaks. Strike when opportunies arise (incidents). Stigmergy - how uncordinated systems work together.

Advertise - promote accomplishments, and accomplishment of others.

Alerts with context - link to info, runbooks etc. Get feedback on alerts, was it useful?

Start small, seek feedback, think about your value, measure effectiveness

Kelsey Hightower, Google - /healthz

Video

Kelsey gave a demo of the /healthz pattern, and how that can protect you from deploying non-functional software on a platform that can leverage internal health checks.

Stop reverse engineering apps and start monitoring from the inside

Move health/db checks and functional/smoke tests inside app, and expose over a HTTP endpoint

Ops need to move closer to the application.

## Brian Smith, Facebook - The Art of Performance Monitoring

Video

Gave an overview of some of the guiding ideas behind monitoring at facebook

Bad stuff:

High Cardinality - same notifications for 100x machines
Reactive Alarms - alarms which are no londer relevant
Tool Fatigure - too few/too many

It can Mechanical, Simple and Obvious to do these things at the time. But the cumulative effect is a thing thats hard to maintain.

Properties of Good Alarms:

Signal
Actionability
Relevancy

Your Dashboards are a debugger - metrics are debugger in production.

Caitie McCaffrey, Twitter - Tackling Alert Fatigue

Slides

References

Video

When alerts ae more often false than true, people are desensitised to alerts.

Unhappy customers is the result, but they are also unplanned work, and a distraction from focusing on your core business.

Same problem experienced by nurses responding to alarms in hospitals. What they have done:

Increase thresholds
Only crisis alarms would emit audible aleters
Nursing staff required to tune false positive alerts

What Caitie’s team did:

Runbook and alert audits - ensure ther eare runbooks for alerts, templated, single page for all alerts, each alert has customer impact and remediation steps. Importantly, also includes notification steps.
Empower oncall - tune alert thresholds, delete alerts or re-time them (only alert during business hours)
Weekly on-call retro - handoff ongoing issues, review alerts, schedule work to improve on-call

This resuted in less alerts, and improved visibility on systems that alert a lot.

To prevent alert fatigue:

Critical alerts need to be actionable
Do not alert on machine specific metrics
Tech lead or Eng manager should be on call

Mark Imbriaco, Operable - Human Scale Systems

Video

It’s common to say now that “Tools don’t matter” … but they do. We sweat the details of our tools because they matter. All software is horrible.

We operate in a complex Socio-Technical System. Human practitioners are the adaptable element of complex systems.

Make sure you think about the interface and interactions (human - software interactions)

Think about the intent, what problem are you likely to be solving (use cases)
Consistency is really important
Will it blend - how does it interact with other systems
Consider state of mind - high intensity situations/ tired operators

## Sarah Hagan, Redfin - Going for Brokerage: People analytics at Redfin

Video

Redfin is an online Estate Agency with agents on ground

Monitoring hiring

Capture lots of data on the market
Where should we move?
How many staff should we have in each location?
Useful tooling for the audience
Hire employees rather than contractors, analyse sold house price data to make sure employees earn enough vs. commission agents

Monitoring employees

Customer reviews for agents
Agents paid based on rating
Let the customer monitor the business
Monitor loading capacity of agents

Monitoring culture

Internal forums for feedback on tooling.

Pete Cheslock, Threat Stack - Everything @obfuscurity Taught Me About Monitoring

Slides

Video

Told the story of his history of learning about monitoring, and how he has approached monitoring problems at his current startup.

Telemetry and Alerting system is not core competancy.

Do simple things early when it makes sense (put metrics in logs).
When it’s necessary to get more data - just buy something.
Hosted TSDB is useful and just works, but there a faster non-durable metrics which are important. So he used graphite for 10s interval metrics, with 2 collectd processes writing to two outputs
Ended up with a full graphite deployment

Operability Day Two

25 Sep 2015

Day one is available here

Charity Majors, Parse/Facebook - Building a world class ops team

Slides

This was a talk focusing on bootstrapping an ops team for startups.

Do you need an ops team?

Ops engineering at scale is a specialised skillset. is is not someone to do all the annoying parts of the running systems. Or do you need software engineers to get better at ops?

You need an ops team if you have hard problems:

extreme reliability
extreme scalability (3x-10x year over year)
extreme security
solving operational problems for the whole internet

What makes a good startup ops hire?

Its not possible to hire people who are good at everything - unicorns. What you can get are engineers who are good at some things, bad at others. People who can learn on the fly are valuable.

“A good operations engineer is broadly literate and can go deep on at least one or two areas”

Great ops engineers:

strong automation instincts
ownership over their systems
strong opinions, weakly held
simplify
excellent communication skills, calm in a crisis
value process (as that is what stops you making the same mistakes over again)
empathy

Things that aren’t good indicators:

whiteboarding code
any particular technology or language
any particular degree
big company pedigree

Succeeds at a big company:

structured roadmap
execute well on small coherent slices
classical cs backgrounds
value cleanliness & correctness
technical depth

Succeeds at startup:

comfortable with chaos
knows when to solve 80% and move on
total responsibility for outcomes
good judgement
highly reactive
technical breadth

How do you interview and sort for these qualities?

Don’t hire for lack of weaknesses. Figure out what strengths you really need and hire for those.

Good questions:

leading and broad, probe the candidates self reported strengths
related to your real problems
ask culture questions, screen for learned helplessness

Bad questions:

depend on a specific technology
designed to trip then up, looking for a reason to say no
deny candidates the resources they would use to solve something in the real world

You hired an ops engineer, now what?

How to spot a bad ops enginner:

tweaking indefinitely and pointlessly
walling off production from developers
adding complexity
won’t admit they don’t know things
disconnected from customer experience

How to lose good ops engineers:

all the responsibility, none of the authority
all the tedious shitwork
blameful culture
no interesting operational problems

David Mytton, Server Density - Human Ops - Scaling teams and handling incidents

Slides

This talk covered how incidents are handled at Server Density.

We should expect downtime - prepare, respond, postmortem.

Prepare

Things that need to be in place before an incident

on call schedue with primary & secondary
off call - 24hr recovery after overnight incident
docs, and must be located independently from primary infrastructure
key info must be available: team contacts, vendor contacts, credentials
plan for unexpected situations: loss of communication, loss of internet access
use war games to practise for incidents

Respond

Process to follow during an incident:

First responder
- load incident response checklist
- log into ops war room
- log incident in jira
Follow Checklist(s)
- due to complexity
- easy to follow in times of stress and fatigue
- take a beginners mind - ego can get in the way, don’t wing it
Key Principles
- log everything (all commands run, by who and where and what the result was)
- communicate frequently
- gather the whole team for major incidents

Postmortem

Do within a few days
Tell the story of what happened - from your logs
Cover the appropriate technical detail
What failed, and why? How is it going to be fixed?

Emma Jane Hogbin Westby, Trillium Consultancy - Emphatically Empathetic

Emma talked about how she taught herself to be more empathetic.

Normal people have a lack of empathy, it’s a skill that can be practised and learned.

What is empathy: ability to understand the feelings of another

Level 1: Care just enough to learn about a person’s life

Doing this improves team cohesion, but requires a time investment.

Collect stories - learn about people by asking them questions. Shut up and listen. Respond in a way to encourage more info gathering.

Later, refer back to stories and follow up for more information.

Level 2: Strategies to structure interactions

Doing this you can engineer successful outcomes, and improve capacity for diverse thinking. But you risk being perceived as manipulative.

It’s a mistake to believe there is only one way to have a connection. Try to uncover motivation, why do people behave the way they do?

There are three types of thinking strategies, and you’ll see language patterns that match each of them.

creative thinking: ‘can we try…’ ‘what about…

understanding thinking: ‘so what you’re saying is…’ ‘just to clarify…’

decision thinking: ‘im ready to move on to…’ ‘last time we tried this…’

How can you create outcome based interactions for these sort of people? Perhaps you can plan for specific types of discussions in meeting agendas.

Find a system to use with your team to make communications more explicit, and to take advantage of the thinking strategies they use.

Level 3, engage with work from another’s perspective

This can foster creative problem solving. The risks are it is potentially overwhelming, and can cause doubt for self worth.

Seek to understand - complain about yourself from the other’s prespective, or situation. Live your day through the other’s constraints.

Thinking process should be no more left to chance than the delivery practise of a skill.

There was an interesting question after Emma’s talk - “How do I make Bob care about Dave from another team”. Her suggestion was to create a situation where they can bond over a common enemy - i.e. say something you know to be untrue and that they would both respond to in a similar way.

Scott Klein statuspage.io - Effective Incident Communication

Remember that there is someone on the other end of our incidents who is affected personally.

The talk covered what to do before, during and after an incident. You need a dedicate place to communicate system status to your users.

Before

Get a status page. It needs the following:

timestamps
to be very fast, very reliable
keep away from primary infrastructure, even DNS
contact info - give a way to get in touch

During

communicate early. say you are investigating - it means ‘we have no clue but at least we’re not asleep’
communicate often. always communicate when the next update is.
communicate precisely: be very declarative
- don’t do etas, they will disappoint people
- don’t speculate, ‘were still tracking down the cause’
- ‘verification of the fix is underway’ not ‘we think we fixed it’
communicate together. have pre-written templates.
one person needs to be assigned as incident communicator

After

apologise first
don’t name names
be personal. “I’m very sorry”. Take responsbility.
details inspire confidence
close the loop - what we’re doing about it

Why do this?

gain trust with users/customers
turn bad experience into good experience
service recover paradox - people think more highly of a company if they respond properly.
show that you do your job well

Rich Archbold, Intercom.io - Leading a Team with Values

Talk covered Rich’s experience of introducing core values to drive performance of the team. They reduced downtime and infrastructure costs, and number of ops pages.

Enabled autonomy, distributed decision making.

Problems they were facing:

roadmap randomisation. easily distrated from what planned to do.
projects take a long time and delivered late
not feeling like a tribe

Criteria for values:

fit with the business
personal, specific
aspirational and inspirational
drive daily decision making
not dogma - needed to be flexible

These are the Values they came up with:

Security, Availablilty, Performance, Scalability, Cost - prioritize for maximum impact
Faster, Safer, Easier, Shipping
Zero Touch Ops
Run Less Software

Afterwards they gathered lots of metrics of unplanned work. From this they worked out that they need to multiply estimates by 2.7 to get accurate roadmap planning.

Matthew Skelton, Skeleton Thatcher Consulting - Un-Broken Logging, the foundation of software operability

Slides

The way we use logging is broken, how to make it more awesome

What is logging for? It provides an execution trace.

How is logging usually broken? It’s often unloved, discontinuous, contains errors only, bolten on, doesn’t have aggregation and search, severeties aren’t useful because they need to be determined up front.

Also, logs aren’t free. You need to allocate budget and time to make them useful.

Why do we log? For verification, traceability, accountability, and charting the waters.

How to make logging awesome

Continuous Event IDs - use to respresent distinct states. Describe what’s useful for the team to know and describe that as a seperate state. Use enums.

Transaction tracing - Create uniqueish identifier for each request, as pass it through the layers.

Decouple Severity - allow configurable severity levels. Log level should not be fixed at compile or build time. Map Event IDs to a severity.

Log aggregation and search tools - As we move from monolith to microserverice, the debugger does not have the full view anymore. Need an aggregated view of logs across a system. Develop software using log aggregation as a first class thing.

Design for logging - logging is another system component, and needs to be testable.

NTP - Time sync is crucially important for correlating log entries

Referenced the following video:

Evan Phoenix - Structured Logging

Gareth Rushgrove, Puppet Labs - Taking the Operating System out of operations

Slides

The age of the general purpose operating system is over. What does this mean for operators?

Lots of new OSes have appeared in the last year

New Breed:

Atomic (RedHat)
CoreOS
Snappy (ubuntu, replace dpkg with contrainers)
RancherOS (docker all the way down)
Nano - tiny alternative to Windows Server
VMWare Bonneville

Common themes:

Cluster native
RO file systems
Transactional updates
Integrated with containers

Why the interest in New OSes?

Lots of homogeneous workloads
Security is front page news
Size as a proxy for complexity
Utilisation matters at scale
Increasingly interacting with higher level abstractions anyway

Unikernels

Compile an application down to a kernel, there is no userspace. Only include the capabilities and libraries you need - everything is opt-in.

Hypervisor/hardware isolation
Smaller attack surface
Less running code
Enforced immutability
No default remote access

What happens to operators?

Hypervisor becomes the “platform”.

Everything else as an application. Firewalls. Network Switches. IDS. Remote access.

Everyone not running the hypervisor is an application developer. Standards required: Platforms, Containers, Monitoring. Publish more schemas than incompatible implementations in code.

Infrastructure is code.

Revolution not evolution. Distance between old infrastructure and new will be huge. Models of interaction and the skills required to operate.

Conclusions:

We have fundamental problems that date back more than 40 years. It Might take a different evolutionary process to build better infrastructure. We may have to throw away things we care about, such as Linux. This is all driven from security concerns.

Ben Hughes, Etsy - Security for Non-Unicorns

Slides

Security is hard. Tiny little bugs turn into giant things.

You’re already being probed for security holes, do you want to know or not? Bug Bounties are a way of getting attackers working for you.

You need to prepare a lot for bug bounties. Try and get all the low hanging fruit yourself. The first few weeks will be hell.

With much of our infrastructure in the cloud, it’s easy to expose sensitive information, such as credentials, on places like github. Gitrob helps to analyse git repos for you.

People trust random files off the internet - like docker images, vagrant images, and curl

bash installs etc.

Older