apollo-server-core: unified Studio reporting (#4142)

The usage reporting plugin in `apollo-server-core` is not the first tool Apollo built to report usage to Studio. Previous iterations such as `optics-agent` and `engineproxy` reported a combination of detailed per-field single-operation performance *traces* and summarized *stats* of operations to Apollo's servers. When we built this TypeScript usage reporting plugin in 2018, for the sakes of expediency we did something different: it only sent traces to Apollo's servers. This meant that the performance of every single single user operation was described in detail to Apollo's servers. Studio is not an exhaustive trace warehouse: we have always *sampled* the traces received, making only some of them available via Studio's Traces UI. The other traces were converted to stats inside Studio's servers. While this meant that the reporting agent was simpler than the previous implementations (no need to be able to describe performance statistics), it also meant that the protocol used to talk to Studio consumed a lot more bandwidth (as well as CPU time for encoding traces). This PR returns us to the world where Studio usage is reported as a combination of stats and traces. It takes a slightly different approach than the previous implementations: instead of reporting stats and traces in parallel, usage reports contain both stats and traces. Each GraphQL operation is described either as a trace or as stats, not both. We expect this to significantly reduce the network and CPU requirements of sending usage reports to Studio. It should not significantly affect the experience of using Studio: we have always heavily sampled traces in Studio before saving them to the trace warehouse, and the default heuristic for which operations to send as traces works similarly to the heuristic used in Studio's servers. This PR introduces an option `experimental_sendOperationAsTrace` to allow you to control whether a given operation is sent as trace or stats. This is truly an experimental option that may change at any time. For example, you should not rely on the fact that this will be called on all operations after the operation is done with a full, or on its signature, or even that it exists. It is likely that future improvements to the usage reporting plugin will change how operations are observed so that we don't have to collect a full trace before deciding how to represent the operation. Some other notes: - Upgrade our fork `@apollo/protobufjs` with a few improvements: - New `js_use_toArray` option which lets you encode repeated fields from objects that aren't stored in memory as arrays but expose `toArray` methods. We use this so that we can build up `DurationHistogram`s and map-like objects in a non-array fashion and only convert to array at encoding time. - New `js_preEncoded` option which allows you to encode messages in repeated fields as buffers (Uint8Arrays). This helps amortize encoding cost of a large message over time instead of freezing the event loop to encode the whole message at once. This replaces an old hack we used for one field with something built in to the protobuf compiler (including correct TypeScript typings). - New `--no-from-object` flag which we use to reduce the size of generated code (as we don't use the fromObject protobuf.js API). - In order to help us validate that the trace->stats code in this PR matches similar code in Studio's servers, the flag `internal_includeTracesContributingToStats` sends the traces that contribute to stats in a special field. This is something we only use as part of our own validation in our servers; for your graphs it will have no effect other than increasing message size. - Viewing traces in Studio is only available on paid plans. The usage-reporting endpoint now tells the plugin whether traces are supported on your graph's plan; if not supported, the plugin will switch to sending all operations as stats (regardless of the value of `experimental_sendOperationAsTrace`) after the first report. - We try to estimate the message size compared to maxUncompressedReportSize via a rough estimate about how big the leaf nodes of the stats messages will be rather than carefully counting how much space is used by each number and histogram. We do take the lengths of all strings into account. - By mistake, this plugin never sent the cache policy on traces, meaning that visualizing cache-specific stats in Studio did not work. This is now fixed. This project was begun by @jsegaran and completed by @glasser.
apollographql · Apr 28, 2021 · 8ce26dd · 8ce26dd
1 parent 78304ec
commit 8ce26dd
Show file tree

Hide file tree

Showing 14 changed files with 7,355 additions and 138 deletions.
diff --git a/package-lock.json b/package-lock.json
diff --git a/packages/apollo-reporting-protobuf/package.json b/packages/apollo-reporting-protobuf/package.json
@@ -7,7 +7,7 @@
   "scripts": {
     "clean": "git clean -fdX -- dist",
     "prepare": "npm run clean && mkdir dist && npm run pbjs && npm run pbts && cp src/* dist",
-    "pbjs": "apollo-pbjs --target static-module --out dist/protobuf.js --wrap commonjs --force-number src/reports.proto",
+    "pbjs": "apollo-pbjs --target static-module --out dist/protobuf.js --wrap commonjs --force-number --no-from-object src/reports.proto",
     "pbts": "apollo-pbts -o dist/protobuf.d.ts dist/protobuf.js",
     "update-proto": "curl -sSfo src/reports.proto https://usage-reporting.api.apollographql.com/proto/reports.proto"
   },
@@ -29,6 +29,6 @@
   },
   "homepage": "https://github.com/apollographql/apollo-server#readme",
   "dependencies": {
-    "@apollo/protobufjs": "^1.0.3"
+    "@apollo/protobufjs": "1.2.0"
   }
 }
diff --git a/packages/apollo-reporting-protobuf/src/index.js b/packages/apollo-reporting-protobuf/src/index.js
@@ -3,29 +3,9 @@ const protobufJS = require('@apollo/protobufjs/minimal');
 
 // Remove Long support.  Our uint64s tend to be small (less
 // than 104 days).
+// XXX Just remove this in our fork?
 // https://github.com/protobufjs/protobuf.js/issues/1253
 protobufJS.util.Long = undefined;
 protobufJS.configure();
 
-// Override the generated protobuf Traces.encode function so that it will look
-// for Traces that are already encoded to Buffer as well as unencoded
-// Traces. This amortizes the protobuf encoding time over each generated Trace
-// instead of bunching it all up at once at sendReport time. In load tests, this
-// change improved p99 end-to-end HTTP response times by a factor of 11 without
-// a casually noticeable effect on p50 times. This also makes it easier for us
-// to implement maxUncompressedReportSize as we know the encoded size of traces
-// as we go.
-const originalTracesAndStatsEncode = protobuf.TracesAndStats.encode;
-protobuf.TracesAndStats.encode = function(message, originalWriter) {
-  const writer = originalTracesAndStatsEncode(message, originalWriter);
-  const encodedTraces = message.encodedTraces;
-  if (encodedTraces != null && encodedTraces.length) {
-    for (let i = 0; i < encodedTraces.length; ++i) {
-      writer.uint32(/* id 1, wireType 2 =*/ 10);
-      writer.bytes(encodedTraces[i]);
-    }
-  }
-  return writer;
-};
-
 module.exports = protobuf;
diff --git a/packages/apollo-reporting-protobuf/src/reports.proto b/packages/apollo-reporting-protobuf/src/reports.proto
@@ -375,6 +375,10 @@ message ContextualizedStats {
 
 // A sequence of traces and stats. An individual trace should either be counted as a stat or trace
 message TracesAndStats {
-  repeated Trace trace = 1;
+  repeated Trace trace = 1 [(js_preEncoded)=true];
   repeated ContextualizedStats stats_with_context = 2 [(js_use_toArray)=true];
+  // This field is used to validate that the algorithm used to construct `stats_with_context`
+  // matches similar algorithms in Apollo's servers. It is otherwise ignored and should not
+  // be included in reports.
+  repeated Trace internal_traces_contributing_to_stats = 3 [(js_preEncoded)=true];
 }
diff --git a/packages/apollo-server-core/src/plugin/traceTreeBuilder.ts b/packages/apollo-server-core/src/plugin/traceTreeBuilder.ts
@@ -261,7 +261,7 @@ function errorToProtobufError(error: GraphQLError): Trace.Error {
 }
 
 // Converts a JS Date into a Timestamp.
-function dateToProtoTimestamp(date: Date): google.protobuf.Timestamp {
+export function dateToProtoTimestamp(date: Date): google.protobuf.Timestamp {
   const totalMillis = +date;
   const millis = totalMillis % 1000;
   return new google.protobuf.Timestamp({