Skip to content
This repository has been archived by the owner on Oct 11, 2022. It is now read-only.

Add StatsD tracking #4444

Merged
merged 29 commits into from
Dec 19, 2018
Merged

Add StatsD tracking #4444

merged 29 commits into from
Dec 19, 2018

Conversation

mxstbr
Copy link
Contributor

@mxstbr mxstbr commented Dec 11, 2018

Status

  • WIP
  • Ready for review
  • Needs testing

Deploy after merge (delete what needn't be deployed)

  • api
  • hyperion (frontend)

Adds StatsD tracking to our HTTP servers (api and hyperion).

Todo

  • Set up a middleware to track request times
  • Handle GraphQL operation tracking
  • Submit metrics to DataDog in production
  • Log metrics in development

I'm currently waiting to get the DataDog API key to post this data to DataDog!

Max Stoiber added 2 commits December 11, 2018 09:43
This is the recommended library to use for metrics reporting to
datadog, see https://docs.datadoghq.com/integrations/node
This tracks response times and status codes, and also specificaly
tracks GraphQL operations happening in the API.
@mxstbr
Copy link
Contributor Author

mxstbr commented Dec 11, 2018

This is what the logs look like for hyperion (the server-side renderer) and the API:

hyperion.http.get.spectrum.general.response_time:3012|ms
hyperion.http.get.spectrum.general.status_code.200:1|c

api.http.post.api.getCommunityThreads.response_time:35|ms
api.http.post.api.getCommunityThreads.status_code.200:1|c

Note how it adds the GraphQL operation name to the API logs, which should make it easier to debug expensive GraphQL queries and optimize them, and the request pathname to all requests.

Max Stoiber added 2 commits December 11, 2018 11:15
This is based on #4444 and adds tracking of the database connection
count and query queue size over time.

I'm not sure we should ship this as it happens _very_ often…
mxstbr and others added 3 commits December 11, 2018 12:32
…and-queue-length

Track database connection count and queue size in DataDog
Forked express-statsd to withspectrum/express-hot-shots to add support
for tags in the metrics.
@mxstbr
Copy link
Contributor Author

mxstbr commented Dec 11, 2018

Latest couple commits switch to a fork of express-statsd I just made (https://github.com/withspectrum/express-hot-shots) to add support for tags in the metrics, which allows us to tag the GraphQL operation name instead of putting it into the metric key as recommended by the observability team! 💯

That means our metrics now look like this, note that the server name and operation name are moved to the tags (after the #):

http.get.spectrum.general.id-123.response_time:3012|ms|#server:hyperion
http.get.spectrum.general.id-123.status_code.200:1|c|#server:hyperion
db.query_queue.size:2|g|#server:hyperion

http.post.api.response_time:35|ms|#server:api,graphqlOperationName:getCurrentUser
http.post.api.status_code.200:1|c|#server:api,graphqlOperationName:getCurrentUser
db.connection_pool.size:58|g|#server:api

@mxstbr
Copy link
Contributor Author

mxstbr commented Dec 11, 2018

This is not quite ready yet, I was recommended to change towards tracking response times and status codes per view rather than per pathname (i.e. http.get.community, http.get.thread, http.get.dashboard instead of http.get.figma, http.get.figma.general.asdf-123). Going to dig into that, will take a bit more work since I have to reverse-engineer our React routing setup somehow.

@mxstbr
Copy link
Contributor Author

mxstbr commented Dec 18, 2018

Latest commit switches to https://github.com/dbader/node-datadog-metrics, which avoids the need to run the DataDog agent and simply pings the DataDog API directly! This works and is live in production for testing

@brianlovin
Copy link
Contributor

@mxstbr looks like we just need to clean up flow and merge alpha, then let's merge this!

@mxstbr
Copy link
Contributor Author

mxstbr commented Dec 19, 2018

Have already learned a bunch of stuff from having this run in production, will rework a little bit to make it more useful for all the workers and then ship!

Max Stoiber added 4 commits December 19, 2018 15:13
This is an anti-pattern, as it'll blow up our DataDog storage costs.
Instead, we should be using a logging service like Splunk to dig into
specific slow requests after noticing abnormalities in the
metrics. Will tackle that next most likely, but this should be a good start!

Thanks to the folks in #observability for pointing that out!
@mxstbr
Copy link
Contributor Author

mxstbr commented Dec 19, 2018

The latest commits rework the tracking a bit:

  • Remove paths from the metrics: this is an anti-pattern according to the #observability folks, as it's a) not very useful (DataDog doesn't let you aggregate sub-metrics) and b) blows up storage size, making it very expensive. Rather than tracking e.g. http.get.auth.twitter.response_time we now simply track http.get.response_time across all our servers
  • Track db query response size and time: we're now tracking how long database queries take and how much data they return, allowing us to notice abnormalities across all our servers much quicker
  • Stop tracking connection pool size, as it's constant now
  • Adds nicer logging in dev

I think this is probably good to ship, going to wait for tests to pass and do a test deploy on alpha first though!

@mxstbr
Copy link
Contributor Author

mxstbr commented Dec 19, 2018

This has been up on api.alpha for a bit and looks fine, wanna give it a review and then deploy to all the servers today @brianlovin? That'd be great

@brianlovin
Copy link
Contributor

brianlovin commented Dec 19, 2018

This is awesome @mxstbr - are you going to set up some sensible default dashboards in datadog that we can start using going forward? I imagine you have a more clear picture of how you'd like this reporting to be visualized and prioritized.

Talked offline: Max to create dashboards for us tomorrow :)

@brianlovin brianlovin merged commit 2ef82a1 into alpha Dec 19, 2018
@brianlovin brianlovin deleted the statsd branch December 19, 2018 17:46
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants