Optimising for learning - Logan McDonald
expert intuition is achievable. improve memory.
hierarchy of learning. important one is rule learning.
reading docs/textbooks not an effective way to embed knowledge in long term memory. do low stakes testing. best way of learning if to retrieve data.
observabiity and reflection
move from events to patterns to structure
engaging in incident reviews gave opportunity to quiz engineers on their actions during incident
symmathesy - systems of understanding
embrace cultural memory
Serverless and CatOps: Balancing trade-offs in operations and instrumentation - Pam Selle
I sadly missed most of this because I was debugging a problem
Mentoring Metrics Engineers: How to Grow and Empower Your Own Monitoring Experts - Zach Musgrave, Angelo Licastro (Yelp)
growth in teams, and managing growth in individuals within a team
mandate growth - more skills needed, more knowledge gaps, more planning (meetings), trade offs.
knowledge silos - reaction is mentoring
breadth: build confidence. defined mentor relationship.
depth: make an advanced contribution to one system.
next steps: end of mentor relationship, can start on-call. hold a retrospective - provide feedback for improvement
impostor syndrome - accept compliments for your work and acknowledge other’s contributions
consulting: monitoring is a specialised field. there’s a lot of nuance.
make people aware of what tools can solve their problems. talk with other teams, ask insightful questions: what problem are you trying to solve? listen, as assume you know nothing about their problem.
The Power of Storytelling - Dawn Parzych (Catchpoint)
- too much info
- not enough meaning
- not enough time
- not enough memory
To tell better stories, when presenting data view it from the perspective of other parts of the organisation
Persuasive storytelling doesn’t need to be complicated - keep it simple
Use simpler language - it doesn’t make it less powerful
Present the most important information earlier
- Use visuals when telling a story, keep them simple to understand
- Don’t overcomplicate things
- Remember the power of 3
Principia SLOdica - A Treatise on the Metrology of Service Level Objectives - Jamie Wilkinson
Overloaded team, tasked with reducing load to build a sustainable system they could manage
Symptom Based Alerting - focus on very few expectations about your service, so as system grows you are still focusing on the same things
Symptom: a user is having a bad time. Causes: internal views on the service.
What’s your tolerance for failure? Error budget. Set expectations: SLO
Symptom anything that can be measured by the SLO. Symptom based alert, is when SLO is in danger of being missed.
Debugging tools - metrics, tracing, logs - replace cause based alerts
On-call Simulator! - Building an interactive game for teaching incident response - Franka Schmidt (Mapbox)
On call onboarding
Safe and low stakes place to practice handing on call incidents
- Buddy System: Observing
- Bucket list: checklist, how far along you are with experiences of being on call
- Alarm scrum: review last days alarms
Simulator - “choose your adventure” type text adventure game. Tool: Twine.
Craft a story - use Incident Reviews, past notes and enrich with detail. Or just make it up.
Iterate and get feedback.
Observability: the Hard Parts - Peter Bourgon (Fastly)
Level up monitoring story at fastly
Senior engineering org, 120 engineers. Heterogeneous solutions throughout the organisation. Organic growth is good - up to a point.
Goals: are production systems healthy? if not why not?
Non goals: long term storage, replacing logging. (Observed metric usage VERY different from self-reported usage)
Strategy: prometheus, add instrumentation, curate new set of dashboards and alerts, run in parallel to build confidence, decom old systems.
Rollout: Inventory all services. Rank by criticality & friendliness.
Embedded expert model. Pair with responsible engineer to get migration done. Hour or two of pair programming. Let documentation emerge naturally. Deploy prometheus, plumb service into prometheus (time consuming).
Technical autonomy is a form of technical debt. Focus on local vs. global optimization.
Dashboards and alerts. Import from other systems. Make sure to only build things in service to original goals. Reduce, keep high level views.
Some services, ownership can be indistinct. Senior leadership needed to wield stick, need their buy in.
Warning: This Talk Contains Content Known to the State of California to Reduce Alert Fatigue - Aditya Mukerjee (Stripe)
We can learn from clinical healthcare
Alert Fatigue - frequency/severity - causes responder to ignore or make mistakes
Decision Fatigue - frequency/complexity of decision points, causes person to avoid devices or make mistakes
Certain patterns of alerts and decisions contribute disproportionately to fatigue. Multiple false positives for an individual patient, impacts that alert over all patients.
Reduce alert fatigue: STAT. Supported, Trustworthy, Actionable, Triaged.
Supported: who owns this alert? Responders should own, or feel ownership over end result.
Trustworthy: do I trust this alert to notify me when a problem happens? stays silent with all is well? give sufficient information to diagnose problems?
Anomaly detection: if you don’t understand why an alert triggered, you don’t understand if it’s real.
Actionable: One decision require to respond. Alerts difficult to action are ignored. Make alerts specific, add decision trees. Alerts must have a specific owner.
Triaged: Triage alerts. Type should reflect urgency. Urgency can change. Commonly understood tiers. Regular re-evaluation process.
Twitter has a big monitoring system and migrating was hard
Justin Reynolds, Netflix - Intuition Engineering at Netflix
Discussed problems at Netflix that regions were siloed, they worked on serving users out of any region.
To fail regions, need to scale up other regions to server all traffic
Dashboards, good at looking back, but need to know now. How to provide intuition of the now?
Created vizceral - see the blog post for screenshots & video: http://techblog.netflix.com/2015/10/flux-new-approach-to-system-intuition.html
Brian Brazil, Robust Perceiver - Prometheus
Prometheus is a TSDB offering ‘whitebox monitoring’ for looking inside applications. supports labels, alerting and graphing are unified, using the same language.
Pull based system, links into service discovery. HTTP api for graphing, supports persistent queries which are used for alerting.
Provides instrumentation library, incredibly simple to instrument functiions and expose metrics to prometheus. Client libraries don’t tie you into prometheus - can use graphite.
Can use prometheous as a clearing house to translate between different data formats.
Doesn’t use a notion of a machine. HA by duplicating servers, but alertmanager deduplicates alerts. Alertmanager can also group alerts.
Data stored as file per database on disk, not round-robin - stores all data without downsampling.
Torkel Ödegaard, Raintank - Grafana Master Class
Gave a demo on how to use grafana, as well as recently added and future features.
Katherine Daniels, Etsy - How to Teach an Old Monitoring System New Tricks
Old Monitoring System == Nagios
Adding new servers.
- Use deployinator to deploy nagios configs. Uses chef to provide inventory to generate a currently list of hosts and hostgroups.
- Run validation via Jenkins by running nagios -v, as well as writing tool for nagios validation.
- New hosts are added with scheduled downtime so they don’t alert until the next day. Chat bots send reminders when downtime is going to finish.
Making Alerts (Marginally) Less Annoying
- Created nagdash to provide federated view of multiple nagios instances.
- Created nagios-herald to add context to nagios alerts. Also supports allowing people to sign up to alerts for things they’re interested in.
- Ops weekly tool. Provides on call reports, engineers flag what they had to do with alerts.
- Sleep tracking and alert tracking for on call staff to understand how many alerts they’re facing and how it’s impacting their sleep.
An On Call bedtime story
- Plenty of alerts because scheduled downtime expired for ongoing work.
- Create daily reports of what downtime will soon expire and which will raise alerts.
Joe Damato, packagecloud.io - All of Your Network Monitoring is (probably) Wrong
There’s too much stuff to know about
- ever copy paste config or tune settings you didn’t understand?
- do you really understand every graph you’re generating?
- what makes you think you can monitor this stuff?
Claim: the more complex the system is the harder it is to monitor.
whats p. complicated? linux networking stack! lots of features, lots of bugs. with no docs!
- /proc/net stats can be buggy
- ethtool inconsistent, not always implemented
- meaning of driver stats are not standardised
- stats meaning for a dirver/device can change over time
- /proc/net/snmp has bugs: double counting, not being counted correctly
Monitoring something requires a very deep understanding of what you’re monitoring.
Properly monitoring and setting alerts requires significant investment.
3.5B metrics per minute
Old Alerting System
- 25k alerts/minute, 3m alert monitors
- single config language, lot of existing example, easy to write and add
- all those points were good and bad!
- lots of orphaned and unmaintained configs, no validation
- alerts and dashboards were seperate
- problems with reliability when zones suffer problems
- combined alerts and dashboard configuration
- dashboards defined in python, common libraries that can be included
- python allows testing configs
- created multi-zone alerting system
- reduced time to detect from 2.5mins to 1.75mins
Helping Human Reasoning
- bring together global, dependencies and local context
- including runbooks, contacts and escalations directly in the UI
- distributed system, challenges about consistency, structural complexity and reasoning about time
- sharding choices are hard, impossible to always avoid making mistakes
- support and collaborate with users, try and reduce support burden with good information at interaction points (UI, CLI etc.), good user guides and docs
- migrating - some happy to move, others not. some push back. had to accept schedule compromise and extra work.
Joey Parsons, Airbnb - Monitoring and Health at Airbnb
Perfers to buy stuff:
- New Relic
- Datadog, instrument apps using dogstatsd
- Alerting through metrics
Created open source tool, Interferon, to store alerts configuration as code.
Volunteer on call system. SREs make sure things in place so anyone can be on call.
- Sysops training for volunteers, monitoring systems, how to be effective and learn from historical incidents
- Shadow on call, learn from current primary/secondary
- Promoted to on call
Weekly sysops meeting, go through incidents, hand offs, discuss scheduled maintenance.
On call health:
- are alerting trends appropriate?
- do we understand impact on engineers?
- do we need to tune false positives?
- are notifications and notification policies appropriate?
- incident numbers over time
- counts by service
- total notifcations per user and how many come at night
- false positive incident counts
Heinrich Hartmann, Circonus - Statistics for Engineers
- Measure user experience/ quality of service
- Determine implications of service degradation
- Define sensible SLA targets
External API Monitoring:
- sythetic request, measures availability, but bad for user experience
- on long time ranges, rolled-up data is commonly displayed, erodes spikes
- write request stats to log file
- rich information but expensive and delay for indexing
Monitoring Latency Averages:
- mean values, cheap to collect, store and analyse, but skewed by outliers/low volumes
- percentiles, cheap to collect store and abalyse, robust to outliers but up front percentile choice required and cannot be aggregated
percentiles: keep all your data. don’t take averages! store percentiles for all reporting periods you are interested in - i.e. per min/hour/day. store all percentiles you’ll ever be interested in.
Mointoring with Histograms:
- divide latency scale into bands
- divide time scale into reporting periods
- count the name of samples in each band x period
Can be aggregated across times. Can be visualised as heatmaps.
John Banning, Google - Monarch, Google’s Planet-Scale Monitoring Infrastructure
Huge volume, global span, many teams - constant change
Previously borgmon. Each group had it’s own borgmon. Large load on anyone doing monitoring. Hazing ritual - new engineer gets to do borgmon config maintenance.
- can handle the scale
- small/no load to get up and running
- capable of handling the largest services
Monitor locally. Collect and store the data near where it’s generated. Each zone has a monarch.
- Targets collect data with streamz library. Metrics are multi dimensional information, stores histogram.
- Metrics sent to monarch ingestion router, send to leaf which is in-memory data store and also written to recovery log. From log to long term disk respository.
- Streams stored in a table, basis for queries
- Evaluator runs queries and stores new data for streams or sends notifications
Integrate Globally. Global Monarch - Distributed across zones, but a single place to configure/query all monarchs in all zones.
Provides both Web UI and Python interfaces.
Monarch is backend for Stackdriver
Monitoring as a service is the right idea. Make the service a platform to build monitoring solutions.
Brian Overstreet, Pinterest - Scaling Pinterest’s Monitoring
Started with Ganglia, Pingdom
Deployed Graphite, single box
Second Graphite architecture - Load Balancer, 2x relay servers, multiple cache/web boxes etc.
Suffered lots of UDP packet receive errors
Put statsd everywhere
- fixed packet loss, unique metric names per host
- latency only per host, too many metrics
- not unique per host now,
- shard mapping in client, client version needs to be same everywhere
Multiple graphite clusters - one per application (python/java)
More maintenance, more routing rules etc.
Problems with reads, multiple glob searches can be slow
- local metrics agent, kafka, storm - send to graphite/opentsdb
- interface for opentsdb and statsd
- sends metrics to kafka
- processed by storm
120k/sec graphite, 1.5m/sec opentsdb. no more graphite, move to opentsdb.
Create statsboard - integrates graphite and opentsdb for dashboards and alerts
Graphite User Education - underlying info about how metrics are collected, precision, aggregation etc.
Protect System from Clients
- alert on unique metrics
- block metrics using zookeeper and shared blacklist (created on fly)
Lessen Operational Overhead
- more tools, more overhead
- more monitoring systems, more monitoring of the monitoring system
- removing a tool in prod is hard
- data has a lifetime
- not magical data warehouse tool that returns data instantly
- not all metrics will be efficient
- match monitoring system to where the company is at
- user editation is key to scale tools organizationally
- tools scale with the number of engineers, not users
Lots of app moving to frontend, so running in browser not on backend servers
- capture load performance from browser, send to app server, use statsd + grafana & google analytics
- capture uncaught exections in the browser, using their own product
Eron Nicholson and Noah Lorang, Basecamp - CHICKEN and WAFFLES: Identifying and Handling Malice
Suffered DDoS and blackmail. 80 gigabits - DNS reflection, NTP reflection, SYN floods, ICMP flood
Defense and Mitigation:
- DC partner filters for them
- More 10G circuits and routers
- Arrangements with vendors to provide emergency access and other mitigation tools
Experience got them serious about more subtle application level attacks:
- vulnerability scanners
- repeat slow page requests
- brute force attempts
- nefarious crawlers
What do we want from a defense system?
- Protection against application-level attacks
- Keep user access uninterrupted
- Take advantage of the data we have available
- Transparent in what gets blocked and why
Chicken: who is a real user and who is malicious?
Considered Machine Learning classification. Problems: really hard to get a good training set. Need to be able to explain why an IP was blocked.
- Some behaviours are known to be from people up to no good. crawling phpmyadmin, path reversal, repeated failed login attempts etc.
- Request history gives a good idea of whether someone is a normal user, broken script or a malicious actor.
- External indicators: geoip databases, badip dbs, facebook threat exchange
Removing simple things reduces noise. Every incoming request is scored and per-IP aggregate score calculated based on return code. Create Exponentially Weighted Moving Average from that data. About 12% had negative reputation.
Scaning for blockable actions and scoring requests in near real-time using request byproducts.
Request logs, netflow data, threat exchange -> kafka -> request scoring, scanner for known bad bahaviour, tools for manual evaluation.
Average IP reputation gives an early indicator to monitor for application level attack.
Provides list of good, bad, and dubious IPs.
- originally provided by iptables rules on haproxy hosts
- then tried rule on loadbalancer
- then tried null routing on routers
- finally created waffles
Using BGP flowspec to send data from routers to waffles, which then decides what path to take: error, app or challenge. Waffles host live in a seperate network with limited access.
Waffles is redis and nginx.
John Stanford, Solinea - Fake It Until You Make It
Monitoring an openstack cluster, 1 controller and 6 compute nodes, taking logs with heja and sending it to elasticsearch. Can I scale this up to a thousand nodes? How big can it get?
How do you go about figuring that out?
- took 7 days of logs from lab, 25k messages/hr
- number of logs coming from a node
- number of logs coming from a component
- 7 day message rate, look at histograms, identifies recurring outlier
- message size, percentiles of payload size
What models look like what we’re doing for simulation? Add some random noise.
Flood process, monitoring everything, repeat until it breaks. System sustained 4k x 1k messages/sec, started to pause above that, but no messages were dropped.
- find bottlenecks
- improve the model
Tammy Butow, Dropbox - Database Monitoring at Dropbox
Achieving any goal requires honest and regular monitoring of your progress.
Originally used nagios, thruk (web ui) and ganglia
Created own tool vortex in 2013
why create in house monitoring?
- performance, reliability iissues, scaling number of metrics fast
- Time Series Database with dashboards, alerting, aggregation
- Rich metric metadata, tag a metric with lots of attributes
Monthra: single way of scheduling and relaying metrics, discourage scheduling with cron
- what durability, reliability goals? align monitoring to goals?
- threads running/ threads connected
Run a Monthly Metrics Review (great idea)
Dave Josephsen, Librato - 5 Lines I couldn’t draw
Making cofnitive leap to use monitoring tools to recognise system behaviour independent of alerting. misapprehension about what monitoring was and whom it was for.
Monitoring is not for alerting. Nobody owns moitoring. ‘Tape measure that I share with every engineer I work with’. Ops owns monitoring vs everyone owns monitoring. Monitoring is for asking questions.
Complexity isolates. Effective monitoring gives you the things that allow you to ‘Cynefin’ - make things more familiar and knowable. Reduce complexity rather than embracing it. Monitoring can build bridges to help people understand things across boundaries.
Effective monitoring can bring about cultural change, how people interact between each other.
Repeated point 4
Jessie Frazelle, Google - Everything is broken
Talked about problems with Software Engineering and Operations
Demonstrated how they monitored community and external maintainer PR statistics for Docker project.
James Fryman, Auth0 - Metrics are for Chumps - Understanding and overcoming the roadblocks to implementing instrumentation
Story of implementation of instrumentation at Auth0
Wanted data driven conversations. Metrics implementation happened in past, was ripped out because not well understood, thought to cause latency. Created adversions.
Make the chase. Get buy-in.
To have good decent conversations with someone you need to have metrics.
- Not the most important feature - but it is!
- Cannot start until we understand the data retention requirements - premature optimisations
- We don’t run a SaaS - need to understand what your software is doing regardless
Make decisions based on knowledge, not intuition or luck.
Be opportunistic - success is 90% planning, 10% timing and luck. Find opportunites to accellerate efforts.
Needed to get something going fast - went for full service SaaS Datadog, but with common interfaces and shims to allow moving things in house later. Don’t delay, jump in and iterate.
Keep in sync with developers - change is difficult and there will be resistance, pay attention to feedback. Need to support interpretation of data.
Build out data flows, find potention choke points in system, take a baseline measurement, check systems in isolation
Fix and Repair bottlenecks. Solved 3 major bottlenecks, went from 500 to 10k RPS.
This year I was lucky enough to attend Monitorama in Portland. Thanks to Sohonet for sending me! I’d wanted to attend again since going to Berlin in 2013, because the quality of the talks is the highest I’ve seen in any conference that’s relevant to my interests. I wasn’t disappointed, it was awesome again.
Here are my notes from the conference:
Adrian Cockcroft, Battery Ventures - Monitoring Challenges
This talk reflected on new trends and how things have changed since Adrian talked about monitoring “new rules” in 2014
What problems does monitoring address?
- measuring business value (customer happiness, cost efficiency)
Why isn’t it solved?
- Lots of change, each generation has different vendors and tools.
- New vendors have new schemas, cost per node is much lower each generation so vendors get disrupted
Talked about serverless model - now monitorable entities only exist during execution. Leads to zipkin style distributed tracing inside the functions.
Current Monitoring Challenges:
- There’s too much new stuff
- Monitored entities are too ephemeral
- Price disruption in compute resources - how can you make money from monitoring it?
## Greg Poirier, Opsee - Monitoring is Dead
Greg gave a history and definition of monitoring, and argued that how we think about monitoring needs to change.
Historically monitoring is about taking a single thing in isolation and making assertions about it.
- resource utilisation, process aliveness, system aliveness
Made a defintion of monitoring:
Observability: A system is observable if you can determine the behaviour of the system based on it’s outputs
Behaviour: Manner in which a system acts
Outputs: Concrete results of it’s behaviours
Sensors: Emit data
Agents: Interpret data
Monitoring is the action of observing and checking the bahaviour and outputs of a system and its components over time
Failures in distributed systems are now: responds too slowly, fails to respond.
Monitoring should now be about Service Level Objectives - can it respond in a certain time, handle a certain throughput, better health checks
We need to better Understand Problems (of distributed systems), and to Build better tools (event correlation particularly)
Nicole Forsgren, Chef - How Metrics Shape Your Culture
Measurement is culture. Something to talk about, across silos/boundaries
Good ideas must be able to seek an objective test. Everyone must be able to experiment, learn and iterate. For innovation to flourish, measurement must rule. - Greg Linden
Data over opinions
You can’t improve what yu don’t measure. Always measure things that matter. That which is measured gets managed. If you capture only one metric you know what will be gamed.
Metrics inform incentives, shape behaviour:
- Give meaningful names
- Define well
- Communicate them across boundaries
Cory Watson, Stripe - Building a Culture of Observability at Stripe
To create a culture of observability, how can we get others to agree and work toward it?
Where to begin? Spend time with the tools, improve if possible, replace if not, leverage past knowledge of teams
Empathy - People are busy, doing best with what they have, help people be great at their jobs
Nemawashi - Move slowly. Lay a foundation and gather feedback. (Write down and attribute feedback). Ask how you can improve.
Identify Power Users - Find interested parties, give them what they need, empower them to help others
What are you improving? How do you measure it?
Get started. Be willing to do the work, shave the preposterous line of yaks. Strike when opportunies arise (incidents). Stigmergy - how uncordinated systems work together.
Advertise - promote accomplishments, and accomplishment of others.
Alerts with context - link to info, runbooks etc. Get feedback on alerts, was it useful?
Start small, seek feedback, think about your value, measure effectiveness
Kelsey gave a demo of the /healthz pattern, and how that can protect you from deploying non-functional software on a platform that can leverage internal health checks.
Stop reverse engineering apps and start monitoring from the inside
Move health/db checks and functional/smoke tests inside app, and expose over a HTTP endpoint
Ops need to move closer to the application.
## Brian Smith, Facebook - The Art of Performance Monitoring
Gave an overview of some of the guiding ideas behind monitoring at facebook
- High Cardinality - same notifications for 100x machines
- Reactive Alarms - alarms which are no londer relevant
- Tool Fatigure - too few/too many
It can Mechanical, Simple and Obvious to do these things at the time. But the cumulative effect is a thing thats hard to maintain.
Properties of Good Alarms:
Your Dashboards are a debugger - metrics are debugger in production.
When alerts ae more often false than true, people are desensitised to alerts.
Unhappy customers is the result, but they are also unplanned work, and a distraction from focusing on your core business.
Same problem experienced by nurses responding to alarms in hospitals. What they have done:
- Increase thresholds
- Only crisis alarms would emit audible aleters
- Nursing staff required to tune false positive alerts
What Caitie’s team did:
Runbook and alert audits - ensure ther eare runbooks for alerts, templated, single page for all alerts, each alert has customer impact and remediation steps. Importantly, also includes notification steps.
Empower oncall - tune alert thresholds, delete alerts or re-time them (only alert during business hours)
Weekly on-call retro - handoff ongoing issues, review alerts, schedule work to improve on-call
This resuted in less alerts, and improved visibility on systems that alert a lot.
To prevent alert fatigue:
- Critical alerts need to be actionable
- Do not alert on machine specific metrics
- Tech lead or Eng manager should be on call
Mark Imbriaco, Operable - Human Scale Systems
It’s common to say now that “Tools don’t matter” … but they do. We sweat the details of our tools because they matter. All software is horrible.
We operate in a complex Socio-Technical System. Human practitioners are the adaptable element of complex systems.
Make sure you think about the interface and interactions (human - software interactions)
- Think about the intent, what problem are you likely to be solving (use cases)
- Consistency is really important
- Will it blend - how does it interact with other systems
- Consider state of mind - high intensity situations/ tired operators
## Sarah Hagan, Redfin - Going for Brokerage: People analytics at Redfin
Redfin is an online Estate Agency with agents on ground
- Capture lots of data on the market
- Where should we move?
- How many staff should we have in each location?
- Useful tooling for the audience
- Hire employees rather than contractors, analyse sold house price data to make sure employees earn enough vs. commission agents
- Customer reviews for agents
- Agents paid based on rating
- Let the customer monitor the business
- Monitor loading capacity of agents
- Internal forums for feedback on tooling.
Pete Cheslock, Threat Stack - Everything @obfuscurity Taught Me About Monitoring
Told the story of his history of learning about monitoring, and how he has approached monitoring problems at his current startup.
Telemetry and Alerting system is not core competancy.
- Do simple things early when it makes sense (put metrics in logs).
- When it’s necessary to get more data - just buy something.
- Hosted TSDB is useful and just works, but there a faster non-durable metrics which are important. So he used graphite for 10s interval metrics, with 2 collectd processes writing to two outputs
- Ended up with a full graphite deployment
Day one is available here
Charity Majors, Parse/Facebook - Building a world class ops team
This was a talk focusing on bootstrapping an ops team for startups.
Do you need an ops team?
Ops engineering at scale is a specialised skillset. is is not someone to do all the annoying parts of the running systems. Or do you need software engineers to get better at ops?
You need an ops team if you have hard problems:
- extreme reliability
- extreme scalability (3x-10x year over year)
- extreme security
- solving operational problems for the whole internet
What makes a good startup ops hire?
Its not possible to hire people who are good at everything - unicorns. What you can get are engineers who are good at some things, bad at others. People who can learn on the fly are valuable.
“A good operations engineer is broadly literate and can go deep on at least one or two areas”
Great ops engineers:
- strong automation instincts
- ownership over their systems
- strong opinions, weakly held
- excellent communication skills, calm in a crisis
- value process (as that is what stops you making the same mistakes over again)
Things that aren’t good indicators:
- whiteboarding code
- any particular technology or language
- any particular degree
- big company pedigree
Succeeds at a big company:
- structured roadmap
- execute well on small coherent slices
- classical cs backgrounds
- value cleanliness & correctness
- technical depth
Succeeds at startup:
- comfortable with chaos
- knows when to solve 80% and move on
- total responsibility for outcomes
- good judgement
- highly reactive
- technical breadth
How do you interview and sort for these qualities?
Don’t hire for lack of weaknesses. Figure out what strengths you really need and hire for those.
- leading and broad, probe the candidates self reported strengths
- related to your real problems
- ask culture questions, screen for learned helplessness
- depend on a specific technology
- designed to trip then up, looking for a reason to say no
- deny candidates the resources they would use to solve something in the real world
You hired an ops engineer, now what?
How to spot a bad ops enginner:
- tweaking indefinitely and pointlessly
- walling off production from developers
- adding complexity
- won’t admit they don’t know things
- disconnected from customer experience
How to lose good ops engineers:
- all the responsibility, none of the authority
- all the tedious shitwork
- blameful culture
- no interesting operational problems
David Mytton, Server Density - Human Ops - Scaling teams and handling incidents
This talk covered how incidents are handled at Server Density.
We should expect downtime - prepare, respond, postmortem.
Things that need to be in place before an incident
- on call schedue with primary & secondary
- off call - 24hr recovery after overnight incident
- docs, and must be located independently from primary infrastructure
- key info must be available: team contacts, vendor contacts, credentials
- plan for unexpected situations: loss of communication, loss of internet access
- use war games to practise for incidents
Process to follow during an incident:
- First responder
- load incident response checklist
- log into ops war room
- log incident in jira
- Follow Checklist(s)
- due to complexity
- easy to follow in times of stress and fatigue
- take a beginners mind - ego can get in the way, don’t wing it
- Key Principles
- log everything (all commands run, by who and where and what the result was)
- communicate frequently
- gather the whole team for major incidents
- Do within a few days
- Tell the story of what happened - from your logs
- Cover the appropriate technical detail
- What failed, and why? How is it going to be fixed?
Emma Jane Hogbin Westby, Trillium Consultancy - Emphatically Empathetic
Emma talked about how she taught herself to be more empathetic.
Normal people have a lack of empathy, it’s a skill that can be practised and learned.
What is empathy: ability to understand the feelings of another
Level 1: Care just enough to learn about a person’s life
Doing this improves team cohesion, but requires a time investment.
Collect stories - learn about people by asking them questions. Shut up and listen. Respond in a way to encourage more info gathering.
Later, refer back to stories and follow up for more information.
Level 2: Strategies to structure interactions
Doing this you can engineer successful outcomes, and improve capacity for diverse thinking. But you risk being perceived as manipulative.
It’s a mistake to believe there is only one way to have a connection. Try to uncover motivation, why do people behave the way they do?
There are three types of thinking strategies, and you’ll see language patterns that match each of them.
creative thinking: ‘can we try…’ ‘what about…
understanding thinking: ‘so what you’re saying is…’ ‘just to clarify…’
decision thinking: ‘im ready to move on to…’ ‘last time we tried this…’
How can you create outcome based interactions for these sort of people? Perhaps you can plan for specific types of discussions in meeting agendas.
Find a system to use with your team to make communications more explicit, and to take advantage of the thinking strategies they use.
Level 3, engage with work from another’s perspective
This can foster creative problem solving. The risks are it is potentially overwhelming, and can cause doubt for self worth.
Seek to understand - complain about yourself from the other’s prespective, or situation. Live your day through the other’s constraints.
Thinking process should be no more left to chance than the delivery practise of a skill.
There was an interesting question after Emma’s talk - “How do I make Bob care about Dave from another team”. Her suggestion was to create a situation where they can bond over a common enemy - i.e. say something you know to be untrue and that they would both respond to in a similar way.
Scott Klein statuspage.io - Effective Incident Communication
Remember that there is someone on the other end of our incidents who is affected personally.
The talk covered what to do before, during and after an incident. You need a dedicate place to communicate system status to your users.
Get a status page. It needs the following:
- to be very fast, very reliable
- keep away from primary infrastructure, even DNS
- contact info - give a way to get in touch
- communicate early. say you are investigating - it means ‘we have no clue but at least we’re not asleep’
- communicate often. always communicate when the next update is.
- communicate precisely: be very declarative
- don’t do etas, they will disappoint people
- don’t speculate, ‘were still tracking down the cause’
- ‘verification of the fix is underway’ not ‘we think we fixed it’
- communicate together. have pre-written templates.
- one person needs to be assigned as incident communicator
- apologise first
- don’t name names
- be personal. “I’m very sorry”. Take responsbility.
- details inspire confidence
- close the loop - what we’re doing about it
Why do this?
- gain trust with users/customers
- turn bad experience into good experience
- service recover paradox - people think more highly of a company if they respond properly.
- show that you do your job well
Rich Archbold, Intercom.io - Leading a Team with Values
Talk covered Rich’s experience of introducing core values to drive performance of the team. They reduced downtime and infrastructure costs, and number of ops pages.
Enabled autonomy, distributed decision making.
Problems they were facing:
- roadmap randomisation. easily distrated from what planned to do.
- projects take a long time and delivered late
- not feeling like a tribe
Criteria for values:
- fit with the business
- personal, specific
- aspirational and inspirational
- drive daily decision making
- not dogma - needed to be flexible
These are the Values they came up with:
- Security, Availablilty, Performance, Scalability, Cost - prioritize for maximum impact
- Faster, Safer, Easier, Shipping
- Zero Touch Ops
- Run Less Software
Afterwards they gathered lots of metrics of unplanned work. From this they worked out that they need to multiply estimates by 2.7 to get accurate roadmap planning.
Matthew Skelton, Skeleton Thatcher Consulting - Un-Broken Logging, the foundation of software operability
The way we use logging is broken, how to make it more awesome
What is logging for? It provides an execution trace.
How is logging usually broken? It’s often unloved, discontinuous, contains errors only, bolten on, doesn’t have aggregation and search, severeties aren’t useful because they need to be determined up front.
Also, logs aren’t free. You need to allocate budget and time to make them useful.
Why do we log? For verification, traceability, accountability, and charting the waters.
How to make logging awesome
Continuous Event IDs - use to respresent distinct states. Describe what’s useful for the team to know and describe that as a seperate state. Use enums.
Transaction tracing - Create uniqueish identifier for each request, as pass it through the layers.
Decouple Severity - allow configurable severity levels. Log level should not be fixed at compile or build time. Map Event IDs to a severity.
Log aggregation and search tools - As we move from monolith to microserverice, the debugger does not have the full view anymore.
Need an aggregated view of logs across a system. Develop software using log aggregation as a first class thing.
Design for logging - logging is another system component, and needs to be testable.
NTP - Time sync is crucially important for correlating log entries
Referenced the following video:
Evan Phoenix - Structured Logging
Gareth Rushgrove, Puppet Labs - Taking the Operating System out of operations
The age of the general purpose operating system is over. What does this mean for operators?
Lots of new OSes have appeared in the last year
- Atomic (RedHat)
- Snappy (ubuntu, replace dpkg with contrainers)
- RancherOS (docker all the way down)
- Nano - tiny alternative to Windows Server
- VMWare Bonneville
- Cluster native
- RO file systems
- Transactional updates
- Integrated with containers
Why the interest in New OSes?
- Lots of homogeneous workloads
- Security is front page news
- Size as a proxy for complexity
- Utilisation matters at scale
- Increasingly interacting with higher level abstractions anyway
Compile an application down to a kernel, there is no userspace. Only include the capabilities and libraries you need - everything is opt-in.
- Hypervisor/hardware isolation
- Smaller attack surface
- Less running code
- Enforced immutability
- No default remote access
What happens to operators?
Hypervisor becomes the “platform”.
Everything else as an application. Firewalls. Network Switches. IDS. Remote access.
Everyone not running the hypervisor is an application developer.
Standards required: Platforms, Containers, Monitoring. Publish more schemas than incompatible implementations in code.
Infrastructure is code.
Revolution not evolution. Distance between old infrastructure and new will be huge. Models of interaction and the skills required to operate.
We have fundamental problems that date back more than 40 years. It Might take a different evolutionary process to build better infrastructure. We may have to throw away things we care about, such as Linux. This is all driven from security concerns.
Ben Hughes, Etsy - Security for Non-Unicorns
Security is hard. Tiny little bugs turn into giant things.
You’re already being probed for security holes, do you want to know or not? Bug Bounties are a way of getting attackers working for you.
You need to prepare a lot for bug bounties. Try and get all the low hanging fruit yourself. The first few weeks will be hell.
With much of our infrastructure in the cloud, it’s easy to expose sensitive information, such as credentials, on places like github. Gitrob helps to analyse git repos for you.
|People trust random files off the internet - like docker images, vagrant images, and curl
||bash installs etc.