Monitorama 2016 Portland Day 2

Brian Overstreet, Pinterest - Scaling Pinterest’s Monitoring

Video

Started with Ganglia, Pingdom

Deployed Graphite, single box

Second Graphite architecture - Load Balancer, 2x relay servers, multiple cache/web boxes etc.

Suffered lots of UDP packet receive errors

Put statsd everywhere

fixed packet loss, unique metric names per host
latency only per host, too many metrics

Sharded statsd

not unique per host now,
shard mapping in client, client version needs to be same everywhere

Multiple graphite clusters - one per application (python/java)

More maintenance, more routing rules etc.

Problems with reads, multiple glob searches can be slow

Deployed OpenTSDB

Replace statsd

local metrics agent, kafka, storm - send to graphite/opentsdb

Metrics-agent:

interface for opentsdb and statsd
sends metrics to kafka
processed by storm

120k/sec graphite, 1.5m/sec opentsdb. no more graphite, move to opentsdb.

Create statsboard - integrates graphite and opentsdb for dashboards and alerts

Graphite User Education - underlying info about how metrics are collected, precision, aggregation etc.

Protect System from Clients

alert on unique metrics
block metrics using zookeeper and shared blacklist (created on fly)

Lessen Operational Overhead

more tools, more overhead
more monitoring systems, more monitoring of the monitoring system
removing a tool in prod is hard

Set expectations

data has a lifetime
not magical data warehouse tool that returns data instantly
not all metrics will be efficient

Summary:

match monitoring system to where the company is at
user editation is key to scale tools organizationally
tools scale with the number of engineers, not users

Emily Nakashima, Bugsnag - What your javascript does when you’re not around

Video

Lots of app moving to frontend, so running in browser not on backend servers

Toolkit:

capture load performance from browser, send to app server, use statsd + grafana & google analytics
capture uncaught exections in the browser, using their own product

Sorry, Javascript is just not that relevant in my line of work

Eron Nicholson and Noah Lorang, Basecamp - CHICKEN and WAFFLES: Identifying and Handling Malice

Slides

Video

Suffered DDoS and blackmail. 80 gigabits - DNS reflection, NTP reflection, SYN floods, ICMP flood

Defense and Mitigation:

DC partner filters for them
More 10G circuits and routers
Arrangements with vendors to provide emergency access and other mitigation tools

Experience got them serious about more subtle application level attacks:

vulnerability scanners
repeat slow page requests
brute force attempts
nefarious crawlers

What do we want from a defense system?

Protection against application-level attacks
Keep user access uninterrupted
Take advantage of the data we have available
Transparent in what gets blocked and why

Chicken: who is a real user and who is malicious?

Considered Machine Learning classification. Problems: really hard to get a good training set. Need to be able to explain why an IP was blocked.

Simpler approach:

Some behaviours are known to be from people up to no good. crawling phpmyadmin, path reversal, repeated failed login attempts etc.
Request history gives a good idea of whether someone is a normal user, broken script or a malicious actor.
External indicators: geoip databases, badip dbs, facebook threat exchange

Removing simple things reduces noise. Every incoming request is scored and per-IP aggregate score calculated based on return code. Create Exponentially Weighted Moving Average from that data. About 12% had negative reputation.

Scaning for blockable actions and scoring requests in near real-time using request byproducts.

Request logs, netflow data, threat exchange -> kafka -> request scoring, scanner for known bad bahaviour, tools for manual evaluation.

Average IP reputation gives an early indicator to monitor for application level attack.

Provides list of good, bad, and dubious IPs.

Enforcement:

originally provided by iptables rules on haproxy hosts
then tried rule on loadbalancer
then tried null routing on routers
finally created waffles

Using BGP flowspec to send data from routers to waffles, which then decides what path to take: error, app or challenge. Waffles host live in a seperate network with limited access.

Waffles is redis and nginx.

John Stanford, Solinea - Fake It Until You Make It

Video

Monitoring an openstack cluster, 1 controller and 6 compute nodes, taking logs with heja and sending it to elasticsearch. Can I scale this up to a thousand nodes? How big can it get?

How do you go about figuring that out?

took 7 days of logs from lab, 25k messages/hr
number of logs coming from a node
number of logs coming from a component
7 day message rate, look at histograms, identifies recurring outlier
message size, percentiles of payload size

What models look like what we’re doing for simulation? Add some random noise.

Flood process, monitoring everything, repeat until it breaks. System sustained 4k x 1k messages/sec, started to pause above that, but no messages were dropped.

Next steps:

find bottlenecks
improve the model

Tammy Butow, Dropbox - Database Monitoring at Dropbox

Slides

Video

Achieving any goal requires honest and regular monitoring of your progress.

Originally used nagios, thruk (web ui) and ganglia

Created own tool vortex in 2013

why create in house monitoring?

performance, reliability iissues, scaling number of metrics fast

Create Vortex:

Time Series Database with dashboards, alerting, aggregation
Rich metric metadata, tag a metric with lots of attributes

Monthra: single way of scheduling and relaying metrics, discourage scheduling with cron

Service Metrics:

what durability, reliability goals? align monitoring to goals?
threads running/ threads connected

Run a Monthly Metrics Review (great idea)

Dave Josephsen, Librato - 5 Lines I couldn’t draw

Video

Making cofnitive leap to use monitoring tools to recognise system behaviour independent of alerting. misapprehension about what monitoring was and whom it was for.
Monitoring is not for alerting. Nobody owns moitoring. ‘Tape measure that I share with every engineer I work with’. Ops owns monitoring vs everyone owns monitoring. Monitoring is for asking questions.
Complexity isolates. Effective monitoring gives you the things that allow you to ‘Cynefin’ - make things more familiar and knowable. Reduce complexity rather than embracing it. Monitoring can build bridges to help people understand things across boundaries.
Effective monitoring can bring about cultural change, how people interact between each other.
Repeated point 4

Jessie Frazelle, Google - Everything is broken

Video

Talked about problems with Software Engineering and Operations

Demonstrated how they monitored community and external maintainer PR statistics for Docker project.

James Fryman, Auth0 - Metrics are for Chumps - Understanding and overcoming the roadblocks to implementing instrumentation

Video

Story of implementation of instrumentation at Auth0

Wanted data driven conversations. Metrics implementation happened in past, was ripped out because not well understood, thought to cause latency. Created adversions.

Make the chase. Get buy-in.

To have good decent conversations with someone you need to have metrics.

Pushback:

Not the most important feature - but it is!
Cannot start until we understand the data retention requirements - premature optimisations
We don’t run a SaaS - need to understand what your software is doing regardless

Make decisions based on knowledge, not intuition or luck.

Be opportunistic - success is 90% planning, 10% timing and luck. Find opportunites to accellerate efforts.

Needed to get something going fast - went for full service SaaS Datadog, but with common interfaces and shims to allow moving things in house later. Don’t delay, jump in and iterate.

Keep in sync with developers - change is difficult and there will be resistance, pay attention to feedback. Need to support interpretation of data.

Build out data flows, find potention choke points in system, take a baseline measurement, check systems in isolation

Fix and Repair bottlenecks. Solved 3 major bottlenecks, went from 500 to 10k RPS.

conferences monitoring

Brian Overstreet, Pinterest - Scaling Pinterest’s Monitoring

Emily Nakashima, Bugsnag - What your javascript does when you’re not around

Eron Nicholson and Noah Lorang, Basecamp - CHICKEN and WAFFLES: Identifying and Handling Malice

John Stanford, Solinea - Fake It Until You Make It

Tammy Butow, Dropbox - Database Monitoring at Dropbox

Dave Josephsen, Librato - 5 Lines I couldn’t draw

Jessie Frazelle, Google - Everything is broken

James Fryman, Auth0 - Metrics are for Chumps - Understanding and overcoming the roadblocks to implementing instrumentation

Related Posts

Scaling Graphite Clusters 09 Mar 2019

Monitorama 2018 AMS Day 2 07 Sep 2018

Monitorama 2018 AMS Day 1 07 Sep 2018