Optimising for learning - Logan McDonald
expert intuition is achievable. improve memory.
- prep
hierarchy of learning. important one is rule learning.
- gaining knowledge
reading docs/textbooks not an effective way to embed knowledge in long term memory. do low stakes testing. best way of learning if to retrieve data.
- mental models
observabiity and reflection
move from events to patterns to structure
engaging in incident reviews gave opportunity to quiz engineers on their actions during incident
- learning together
symmathesy - systems of understanding
embrace cultural memory
Serverless and CatOps: Balancing trade-offs in operations and instrumentation - Pam Selle
I sadly missed most of this because I was debugging a problem
Mentoring Metrics Engineers: How to Grow and Empower Your Own Monitoring Experts - Zach Musgrave, Angelo Licastro (Yelp)
growth in teams, and managing growth in individuals within a team
mandate growth - more skills needed, more knowledge gaps, more planning (meetings), trade offs.
knowledge silos - reaction is mentoring
breadth: build confidence. defined mentor relationship.
depth: make an advanced contribution to one system.
next steps: end of mentor relationship, can start on-call. hold a retrospective - provide feedback for improvement
impostor syndrome - accept compliments for your work and acknowledge other’s contributions
consulting: monitoring is a specialised field. there’s a lot of nuance. make people aware of what tools can solve their problems. talk with other teams, ask insightful questions: what problem are you trying to solve? listen, as assume you know nothing about their problem.
The Power of Storytelling - Dawn Parzych (Catchpoint)
Cognitive bias
- too much info
- not enough meaning
- not enough time
- not enough memory
To tell better stories, when presenting data view it from the perspective of other parts of the organisation
Persuasive storytelling doesn’t need to be complicated - keep it simple
Use simpler language - it doesn’t make it less powerful
Present the most important information earlier
- Use visuals when telling a story, keep them simple to understand
- Don’t overcomplicate things
- Remember the power of 3
Principia SLOdica - A Treatise on the Metrology of Service Level Objectives - Jamie Wilkinson
Overloaded team, tasked with reducing load to build a sustainable system they could manage
Symptom Based Alerting - focus on very few expectations about your service, so as system grows you are still focusing on the same things
Symptom: a user is having a bad time. Causes: internal views on the service.
What’s your tolerance for failure? Error budget. Set expectations: SLO
Symptom anything that can be measured by the SLO. Symptom based alert, is when SLO is in danger of being missed.
Debugging tools - metrics, tracing, logs - replace cause based alerts
On-call Simulator! - Building an interactive game for teaching incident response - Franka Schmidt (Mapbox)
On call onboarding
Safe and low stakes place to practice handing on call incidents
Onboarding:
- Buddy System: Observing
- Bucket list: checklist, how far along you are with experiences of being on call
- Alarm scrum: review last days alarms
Simulator - “choose your adventure” type text adventure game. Tool: Twine.
Craft a story - use Incident Reviews, past notes and enrich with detail. Or just make it up.
Iterate and get feedback.
Observability: the Hard Parts - Peter Bourgon (Fastly)
Level up monitoring story at fastly
Senior engineering org, 120 engineers. Heterogeneous solutions throughout the organisation. Organic growth is good - up to a point.
Goals: are production systems healthy? if not why not?
Non goals: long term storage, replacing logging. (Observed metric usage VERY different from self-reported usage)
Strategy: prometheus, add instrumentation, curate new set of dashboards and alerts, run in parallel to build confidence, decom old systems.
Rollout: Inventory all services. Rank by criticality & friendliness.
The grind…
Embedded expert model. Pair with responsible engineer to get migration done. Hour or two of pair programming. Let documentation emerge naturally. Deploy prometheus, plumb service into prometheus (time consuming).
Technical autonomy is a form of technical debt. Focus on local vs. global optimization.
Dashboards and alerts. Import from other systems. Make sure to only build things in service to original goals. Reduce, keep high level views.
Some services, ownership can be indistinct. Senior leadership needed to wield stick, need their buy in.
Warning: This Talk Contains Content Known to the State of California to Reduce Alert Fatigue - Aditya Mukerjee (Stripe)
We can learn from clinical healthcare
Alert Fatigue - frequency/severity - causes responder to ignore or make mistakes
Decision Fatigue - frequency/complexity of decision points, causes person to avoid devices or make mistakes
Certain patterns of alerts and decisions contribute disproportionately to fatigue. Multiple false positives for an individual patient, impacts that alert over all patients.
Reduce alert fatigue: STAT. Supported, Trustworthy, Actionable, Triaged.
Supported: who owns this alert? Responders should own, or feel ownership over end result.
Trustworthy: do I trust this alert to notify me when a problem happens? stays silent with all is well? give sufficient information to diagnose problems?
Anomaly detection: if you don’t understand why an alert triggered, you don’t understand if it’s real.
Actionable: One decision require to respond. Alerts difficult to action are ignored. Make alerts specific, add decision trees. Alerts must have a specific owner.
Triaged: Triage alerts. Type should reflect urgency. Urgency can change. Commonly understood tiers. Regular re-evaluation process.
Monitory Report: I Have Seen Your Observability Future. You Can Choose a Better One. - Ian Bennett (Twitter)
Twitter has a big monitoring system and migrating was hard