Scaling Pinterest’s Monitoring System

Focused on scaling time-series data.
75+ bullion ideas categorized by people into more than 1billion boards
Early tools: ganglia for metrics, and up/down checks for pingdom
April 2012: 10 outages per day. June 2012: fewer outages, but longer. :-(
Pushed into graphite.
StatsD -> to single graphite server
UDP joke: you might not get it.
8 hours to bring up a replacement... but this setup lasted 1 year.
Scaled by # of engineers, not # of customers.
next gen: consitant hash of metric to backend server, by haproxy.
UDP still a problem
Two solutions: StatsD on each host, include hostname in metric.
- autoscaling makes this hard.
- Hostnames are huge in metric space.
- ALL STATSD CLIENTS MUST BE IN SYNC.
Sharded statsd.
- Metric names not needed to be unique
- Fixed most packet loss issues for a while
- Cons: Shard mapping
Shard out clusters... manually.
- Complexity on each cluster? which to send to?
- adding clusters is hard.
- Makes greater availability.
Graphite can't handle multiple globs per query.
OpenTSDB based on HBase
- Not as pretty, but far easier to query, and far faster.
UDP is still dropping stuff.
- Graphs are just wrong...
"I'd give anything for a timeseries database I can trust."
Monitoring system metrics didn't math.
Time to replace statsd:
- Local metrics agent
- Kafka
- storm
  - -> graphite or opentsdb
Metrics-agent is the gatekeeper.
- interface for opentsdb and statsd
- sends metrics to kafka
- needed to onvert to kafka pipeline with no downtime.
  - double write to existing stasd and Kafka
- Aggregation still important. Decided on 30 seconds.
- More aggregate data has wins too... allows for server reboots.
wins:
- less gaps now
- graphite gets 120k points/sec
- opentsdb 1.5mil points/sec
Statsboard
- Define alerts and dashboards in YAML
- Handles routing requests to various backends/clusters
RRD data needs a fair bit of context to interpret it.
What about opentsdb?
- Something is wrong with my lines are often unnaturally straight. Can you fix it???
Started a user education campaign for these tools.
- What are rrds and how to normalize
- Metric summarization into the next interval
- Getting request/sec from a timer
- difference between stats and stats_counts?
- shoudl I use hitcount or integral to calculate totals?
- OpenTSDB: Getting data from continually incrementing counters
  - interpolation of data points
  - How aggregation works
  - Query optimization
What else have we learned?
- Protect system from clients
- Alert on unique metrics
- Block metrics using zookeeper
- Including a metrics blacklist
- Cannot control how users use the data... do not want business decisions off of wrong data.
- Measuring data accuracy is hard.
  - Count metrics generated vs metrics written at every phase
  - Lots of places a metric can get lost and not known about
- Lesson: cost of aggregation (Overhead):
- StatsD performs network call to update, even if over localhost
- Java uses ostrich lib for in process aggregation.
  - problem: can't get cluster level data, only instance.
- Users need to never see the sample rate.
- Lessen operational overhead.
  - We keep adding tools Ack!
  - more systems, = more work to monitor the monitors
  - Removing tools from prod is hard.
  - As product gains more 9s, so must metrics platform.
- Set user expectations ###
  - Data has a lifetime, which needs to be conveyed.
  - Not magical data warehouse tools that return data instantly
  - Not all metrics will be efficient.
Summary:
- match the monitoring system to where the company is at
- User ed is key to scale these tools organizationally
- Tools scale with number of engineers, not users of site. ###

QnA:

Blacklists?
- List in zookeeper updates all hosts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tue01Scaling.Pinterest’s.Monitoring.Systemby_Brian.Overstreet.md

Tue01Scaling.Pinterest’s.Monitoring.Systemby_Brian.Overstreet.md

Scaling Pinterest’s Monitoring System

Files

Tue01__Scaling.Pinterest’s.Monitoring.System__by_Brian.Overstreet.md

Latest commit

History

Tue01__Scaling.Pinterest’s.Monitoring.System__by_Brian.Overstreet.md

File metadata and controls

Scaling Pinterest’s Monitoring System

Tue01Scaling.Pinterest’s.Monitoring.Systemby_Brian.Overstreet.md

Tue01Scaling.Pinterest’s.Monitoring.Systemby_Brian.Overstreet.md