Monitorama 2016 Portland Day 3

Justin Reynolds, Netflix - Intuition Engineering at Netflix

Discussed problems at Netflix that regions were siloed, they worked on serving users out of any region.

To fail regions, need to scale up other regions to server all traffic

Dashboards, good at looking back, but need to know now. How to provide intuition of the now?

Created vizceral - see the blog post for screenshots & video: http://techblog.netflix.com/2015/10/flux-new-approach-to-system-intuition.html

Brian Brazil, Robust Perceiver - Prometheus

Slides

Video

Prometheus is a TSDB offering ‘whitebox monitoring’ for looking inside applications. supports labels, alerting and graphing are unified, using the same language.

Pull based system, links into service discovery. HTTP api for graphing, supports persistent queries which are used for alerting.

Provides instrumentation library, incredibly simple to instrument functiions and expose metrics to prometheus. Client libraries don’t tie you into prometheus - can use graphite.

Can use prometheous as a clearing house to translate between different data formats.

Doesn’t use a notion of a machine. HA by duplicating servers, but alertmanager deduplicates alerts. Alertmanager can also group alerts.

Data stored as file per database on disk, not round-robin - stores all data without downsampling.

Torkel Ödegaard, Raintank - Grafana Master Class

Video

Gave a demo on how to use grafana, as well as recently added and future features.

Katherine Daniels, Etsy - How to Teach an Old Monitoring System New Tricks

Slides

Video

Old Monitoring System == Nagios

Adding new servers.

Use deployinator to deploy nagios configs. Uses chef to provide inventory to generate a currently list of hosts and hostgroups.
Run validation via Jenkins by running nagios -v, as well as writing tool for nagios validation.
New hosts are added with scheduled downtime so they don’t alert until the next day. Chat bots send reminders when downtime is going to finish.

Making Alerts (Marginally) Less Annoying

Created nagdash to provide federated view of multiple nagios instances.
Created nagios-herald to add context to nagios alerts. Also supports allowing people to sign up to alerts for things they’re interested in.

Tracking Sleep

Ops weekly tool. Provides on call reports, engineers flag what they had to do with alerts.
Sleep tracking and alert tracking for on call staff to understand how many alerts they’re facing and how it’s impacting their sleep.

An On Call bedtime story

Plenty of alerts because scheduled downtime expired for ongoing work.
Create daily reports of what downtime will soon expire and which will raise alerts.

Joe Damato, packagecloud.io - All of Your Network Monitoring is (probably) Wrong

Slides

Video

There’s too much stuff to know about

ever copy paste config or tune settings you didn’t understand?
do you really understand every graph you’re generating?
what makes you think you can monitor this stuff?

Claim: the more complex the system is the harder it is to monitor.

whats p. complicated? linux networking stack! lots of features, lots of bugs. with no docs!

/proc/net stats can be buggy
ethtool inconsistent, not always implemented
meaning of driver stats are not standardised
stats meaning for a dirver/device can change over time
/proc/net/snmp has bugs: double counting, not being counted correctly

Monitoring something requires a very deep understanding of what you’re monitoring.

Properly monitoring and setting alerts requires significant investment.

Megan Kanne, Justin Nguyen, and Dan Sotolongo, Twitter - Building Twitter’s Next-Gen Alerting System

Slides

Video

3.5B metrics per minute

Old Alerting System

25k alerts/minute, 3m alert monitors
single config language, lot of existing example, easy to write and add
all those points were good and bad!
lots of orphaned and unmaintained configs, no validation
alerts and dashboards were seperate
problems with reliability when zones suffer problems

Solution

combined alerts and dashboard configuration
dashboards defined in python, common libraries that can be included
python allows testing configs
created multi-zone alerting system
reduced time to detect from 2.5mins to 1.75mins

Helping Human Reasoning

bring together global, dependencies and local context
including runbooks, contacts and escalations directly in the UI

Lessons Learned

distributed system, challenges about consistency, structural complexity and reasoning about time
sharding choices are hard, impossible to always avoid making mistakes
support and collaborate with users, try and reduce support burden with good information at interaction points (UI, CLI etc.), good user guides and docs
migrating - some happy to move, others not. some push back. had to accept schedule compromise and extra work.

Joey Parsons, Airbnb - Monitoring and Health at Airbnb

Video

Perfers to buy stuff:

New Relic
Datadog, instrument apps using dogstatsd
Alerting through metrics

Created open source tool, Interferon, to store alerts configuration as code.

Volunteer on call system. SREs make sure things in place so anyone can be on call.

Sysops training for volunteers, monitoring systems, how to be effective and learn from historical incidents
Shadow on call, learn from current primary/secondary
Promoted to on call

Weekly sysops meeting, go through incidents, hand offs, discuss scheduled maintenance.

On call health:

are alerting trends appropriate?
do we understand impact on engineers?
do we need to tune false positives?
are notifications and notification policies appropriate?

Dashboards for:

incident numbers over time
counts by service
total notifcations per user and how many come at night
false positive incident counts

Heinrich Hartmann, Circonus - Statistics for Engineers

Slides

Video

Monitoring Goals

Measure user experience/ quality of service
Determine implications of service degradation
Define sensible SLA targets

External API Monitoring:

sythetic request, measures availability, but bad for user experience
on long time ranges, rolled-up data is commonly displayed, erodes spikes

Log Analysis:

write request stats to log file
rich information but expensive and delay for indexing

Monitoring Latency Averages:

mean values, cheap to collect, store and analyse, but skewed by outliers/low volumes
percentiles, cheap to collect store and abalyse, robust to outliers but up front percentile choice required and cannot be aggregated

percentiles: keep all your data. don’t take averages! store percentiles for all reporting periods you are interested in - i.e. per min/hour/day. store all percentiles you’ll ever be interested in.

Mointoring with Histograms:

divide latency scale into bands
divide time scale into reporting periods
count the name of samples in each band x period

Can be aggregated across times. Can be visualised as heatmaps.

John Banning, Google - Monarch, Google’s Planet-Scale Monitoring Infrastructure

Video

Huge volume, global span, many teams - constant change

Previously borgmon. Each group had it’s own borgmon. Large load on anyone doing monitoring. Hazing ritual - new engineer gets to do borgmon config maintenance.

Wanted:

can handle the scale
small/no load to get up and running
capable of handling the largest services

Monitor locally. Collect and store the data near where it’s generated. Each zone has a monarch.

Targets collect data with streamz library. Metrics are multi dimensional information, stores histogram.
Metrics sent to monarch ingestion router, send to leaf which is in-memory data store and also written to recovery log. From log to long term disk respository.
Streams stored in a table, basis for queries
Evaluator runs queries and stores new data for streams or sends notifications

Integrate Globally. Global Monarch - Distributed across zones, but a single place to configure/query all monarchs in all zones.

Provides both Web UI and Python interfaces.

Monarch is backend for Stackdriver

Monitoring as a service is the right idea. Make the service a platform to build monitoring solutions.

conferences monitoring

Justin Reynolds, Netflix - Intuition Engineering at Netflix

Brian Brazil, Robust Perceiver - Prometheus

Torkel Ödegaard, Raintank - Grafana Master Class

Katherine Daniels, Etsy - How to Teach an Old Monitoring System New Tricks

Joe Damato, packagecloud.io - All of Your Network Monitoring is (probably) Wrong

Megan Kanne, Justin Nguyen, and Dan Sotolongo, Twitter - Building Twitter’s Next-Gen Alerting System

Joey Parsons, Airbnb - Monitoring and Health at Airbnb

Heinrich Hartmann, Circonus - Statistics for Engineers

John Banning, Google - Monarch, Google’s Planet-Scale Monitoring Infrastructure

Related Posts

Scaling Graphite Clusters 09 Mar 2019

Monitorama 2018 AMS Day 2 07 Sep 2018

Monitorama 2018 AMS Day 1 07 Sep 2018