Skip to content

Latest commit

 

History

History
98 lines (96 loc) · 4.02 KB

Tue01__Scaling.Pinterest’s.Monitoring.System__by_Brian.Overstreet.md

File metadata and controls

98 lines (96 loc) · 4.02 KB

Scaling Pinterest’s Monitoring System

  • Focused on scaling time-series data.
  • 75+ bullion ideas categorized by people into more than 1billion boards
  • Early tools: ganglia for metrics, and up/down checks for pingdom
  • April 2012: 10 outages per day. June 2012: fewer outages, but longer. :-(
  • Pushed into graphite.
  • StatsD -> to single graphite server
  • UDP joke: you might not get it.
  • 8 hours to bring up a replacement... but this setup lasted 1 year.
  • Scaled by # of engineers, not # of customers.
  • next gen: consitant hash of metric to backend server, by haproxy.
  • UDP still a problem
  • Two solutions: StatsD on each host, include hostname in metric.
    • autoscaling makes this hard.
    • Hostnames are huge in metric space.
    • ALL STATSD CLIENTS MUST BE IN SYNC.
  • Sharded statsd.
    • Metric names not needed to be unique
    • Fixed most packet loss issues for a while
    • Cons: Shard mapping
  • Shard out clusters... manually.
    • Complexity on each cluster? which to send to?
    • adding clusters is hard.
    • Makes greater availability.
  • Graphite can't handle multiple globs per query.
  • OpenTSDB based on HBase
    • Not as pretty, but far easier to query, and far faster.
  • UDP is still dropping stuff.
    • Graphs are just wrong...
  • "I'd give anything for a timeseries database I can trust."
  • Monitoring system metrics didn't math.
  • Time to replace statsd:
    • Local metrics agent
    • Kafka
    • storm
      • -> graphite or opentsdb
  • Metrics-agent is the gatekeeper.
    • interface for opentsdb and statsd
    • sends metrics to kafka
    • needed to onvert to kafka pipeline with no downtime.
      • double write to existing stasd and Kafka
    • Aggregation still important. Decided on 30 seconds.
    • More aggregate data has wins too... allows for server reboots.
  • wins:
    • less gaps now
    • graphite gets 120k points/sec
    • opentsdb 1.5mil points/sec
  • Statsboard
    • Define alerts and dashboards in YAML
    • Handles routing requests to various backends/clusters
  • RRD data needs a fair bit of context to interpret it.
  • What about opentsdb?
    • Something is wrong with my lines are often unnaturally straight. Can you fix it???
  • Started a user education campaign for these tools.
    • What are rrds and how to normalize
    • Metric summarization into the next interval
    • Getting request/sec from a timer
    • difference between stats and stats_counts?
    • shoudl I use hitcount or integral to calculate totals?
    • OpenTSDB: Getting data from continually incrementing counters
      • interpolation of data points
      • How aggregation works
      • Query optimization
  • What else have we learned?
    • Protect system from clients
    • Alert on unique metrics
    • Block metrics using zookeeper
    • Including a metrics blacklist
    • Cannot control how users use the data... do not want business decisions off of wrong data.
    • Measuring data accuracy is hard.
      • Count metrics generated vs metrics written at every phase
      • Lots of places a metric can get lost and not known about
    • Lesson: cost of aggregation (Overhead):
    • StatsD performs network call to update, even if over localhost
    • Java uses ostrich lib for in process aggregation.
      • problem: can't get cluster level data, only instance.
    • Users need to never see the sample rate.
    • Lessen operational overhead.
      • We keep adding tools Ack!
      • more systems, = more work to monitor the monitors
      • Removing tools from prod is hard.
      • As product gains more 9s, so must metrics platform.
    • Set user expectations ###
      • Data has a lifetime, which needs to be conveyed.
      • Not magical data warehouse tools that return data instantly
      • Not all metrics will be efficient.
  • Summary:
    • match the monitoring system to where the company is at
    • User ed is key to scale these tools organizationally
    • Tools scale with number of engineers, not users of site. ###

QnA:

  • Blacklists?
    • List in zookeeper updates all hosts.