Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodeJS 20: dd-trace reports HTTP requests with duration above 5 seconds as errors #3515

Closed
map34 opened this issue Aug 7, 2023 · 4 comments · Fixed by #3841
Closed

NodeJS 20: dd-trace reports HTTP requests with duration above 5 seconds as errors #3515

map34 opened this issue Aug 7, 2023 · 4 comments · Fixed by #3841
Labels
bug Something isn't working

Comments

@map34
Copy link

map34 commented Aug 7, 2023

When we upgraded to Node.js 20, dd-trace doesn't report HTTP requests with durations above 5 seconds correctly. An example trace in Datadog:

Screenshot 2023-08-07 at 11 42 37 AM

I added an additional log to see if the request times out. It doesn't seem to be the case. The request took a long time ( > 5 seconds), but succeeded.

[2] [1691433342726] INFO (73069 on MyLaptop):
[2]     dd: {
[2]       "service": "orgMgmtWorker",
[2]       "version": "1.0.0",
[2]       "env": "localhost"
[2]     }
[2]     url: "https://<redacted>/graphql"
[2]     method: "POST"
[2]     responseStatus: 200
[2]     durationMs: 11572.146042

I verified that this doesn't happen when the requests take less than 5 seconds.

Expected behaviour
dd-trace should accurately report requests that take above 5 seconds with their duration and status code.

Actual behaviour
dd-trace report requests that take above 5 seconds as errors, and report their durations as ~5 seconds.

Steps to reproduce

  • Use the following setup:
import tracer from 'dd-trace'

tracer.init({
    profiling: true,
    logInjection: true,
    runtimeMetrics: true,
    service: serviceName,
    sampleRate: 0.1
})
  • Make an external HTTP call from the traced service by using cross-fetch (or similar HTTP client libraries). Make sure that the requests take more than 5 seconds.

Environment

  • Operation system: Debian 12 (Bookworm)
  • Node.js version: 20.5.0
  • Tracer version: 4.9.0
  • Agent version: 7.46.0
  • Relevant library versions: [email protected]
@map34 map34 added the bug Something isn't working label Aug 7, 2023
@map34
Copy link
Author

map34 commented Aug 7, 2023

Glancing through the code, does dispatching a finish and error events here make sense on a timeout event?

According to https://nodejs.org/api/http.html#event-timeout:

Emitted when the underlying socket times out from inactivity. This only notifies that the socket has been idle. The request must be destroyed manually.

This doesn't mean that the HTTP request itself is timing out. For example, if a request takes a long time (> 5 s), the timeout event will be emitted depending on the socket.timeout value provided. For example, if the socket.timeout value is 2000 ms, and the request takes 10000 ms, then 5 timeout events will be emitted.

Should we be listening to the close or error event instead and checking the argument for that event?

@tlhunter
Copy link
Member

Here's a quick app that makes a request to a service that takes 6 seconds to respone:

require('dd-trace/init');

fetch('https://postman-echo.com/delay/6').then((d) => {
    console.error('success:', d.status);
}).catch((err) => {
    console.error('error:', err);
});

This app can be executed like so:

DD_TRACE_DEBUG=1 DD_TRACE_BEAUTIFUL_LOGS=1 node app.js

In the output I see the following:

Inject into carrier: {"x-datadog-trace-id":"2762931287846283238","x-datadog-parent-id":"2762931287846283238","x-datadog-sampling-priority":"1"}.
Encoding payload: [
  {
    "trace_id": "2657e75e1ee3e7e6",
    "span_id": "2657e75e1ee3e7e6",
    "parent_id": "0000000000000000",
    "name": "http.request",
    "resource": "GET",
    "error": 0,
    "meta": {
      "_dd.p.dm": "-0",
      "service": "timerz",
      "version": "1.0.0",
      "runtime-id": "066ed323-d55a-41e0-a8ce-9b30b81d035d",
      "component": "fetch",
      "span.kind": "client",
      "http.method": "GET",
      "http.url": "https://postman-echo.com/delay/6",
      "out.host": "postman-echo.com",
      "http.status_code": "200",
      "language": "javascript"
    },
    "metrics": {
      "_dd.agent_psr": 1,
      "_dd.top_level": 1,
      "_dd.measured": 1,
      "process_id": 68662,
      "_sampling_priority_v1": 1
    },
    "start": 1692201252172012500,
    "duration": 6691944336,
    "service": "timerz",
    "type": "http"
  }
]
success: 200
Request to the agent: {"path":"/v0.4/traces","method":"PUT","headers":{"Content-Type":"application/msgpack","Datadog-Meta-Tracer-Version":"4.12.0","X-Datadog-Trace-Count":"1","Datadog-Meta-Lang":"nodejs","Datadog-Meta-Lang-Version":"v18.16.0","Datadog-Meta-Lang-Interpreter":"v8"},"url":"http://127.0.0.1:8126/"}
Response from the agent: {"rate_by_service":{"service:,env:":1,"service:timerz,env:":1,"service:timerz,env:none":1}}

These results suggest everything worked as expected.

I also tried modifying the app to include a timeout:

require('dd-trace/init');

fetchWithTimeout('https://postman-echo.com/delay/6', {timeout: 2000}).then((d) => {
    console.error('success:', d.status);
}).catch((err) => {
    console.error('error:', err);
});


async function fetchWithTimeout(resource, options = {}) {
  const { timeout = 8000 } = options;

  const controller = new AbortController();
  const id = setTimeout(() => controller.abort(), timeout);

  const response = await fetch(resource, {
    ...options,
    signal: controller.signal
  });
  clearTimeout(id);

  return response;
}

The output doesn't seem to show any repeated events:

Inject into carrier: {"x-datadog-trace-id":"6179527386960947679","x-datadog-parent-id":"6179527386960947679","x-datadog-sampling-priority":"1"}.          [14/14]
Encoding payload: [
  {
    "trace_id": "55c2176afb9ae9df",
    "span_id": "55c2176afb9ae9df",
    "parent_id": "0000000000000000",
    "name": "http.request",
    "resource": "GET",
    "error": 0,
    "meta": {
      "_dd.p.dm": "-0",
      "service": "timerz",
      "version": "1.0.0",
      "runtime-id": "6721ac85-c5bc-478d-ada4-dd34d3a0ac34",
      "component": "fetch",
      "span.kind": "client",
      "http.method": "GET",
      "http.url": "https://postman-echo.com/delay/6",
      "out.host": "postman-echo.com",
      "language": "javascript"
    },
    "metrics": {
      "_dd.agent_psr": 1,
      "_dd.top_level": 1,
      "_dd.measured": 1,
      "process_id": 69701,
      "_sampling_priority_v1": 1
    },
    "start": 1692201453754014500,
    "duration": 2009973389,
    "service": "timerz",
    "type": "http"
  }
]
error: DOMException [AbortError]: This operation was aborted
    at Object.fetch (node:internal/deps/undici/undici:11457:11)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async fetchWithTimeout (/private/tmp/timerz/app.js:16:20)
Request to the agent: {"path":"/v0.4/traces","method":"PUT","headers":{"Content-Type":"application/msgpack","Datadog-Meta-Tracer-Version":"4.12.0","X-Datadog-Trace-Count":"1","Datadog-Meta-Lang":"nodejs","Datadog-Meta-Lang-Version":"v18.16.0","Datadog-Meta-Lang-Interpreter":"v8"},"url":"http://127.0.0.1:8126/"}
Response from the agent: {"rate_by_service":{"service:,env:":1,"service:timerz,env:":1,"service:timerz,env:none":1}}

When I modify the apps to use your more complex .init() calls I do see more output but the behaviors are the same.

Can you modify my sample app to reproduce the 5 second error that you're encountering? I didn't see any 5 second timeouts in dd-trace itself.

@map34
Copy link
Author

map34 commented Aug 23, 2023

@tlhunter Thanks for taking a look. We have the same setup except that you hard-coded the timeout to be 2000 ms there. Try to set the timeout to be greater than 5000 ms (leaving the timeout as 8000 ms is fine). Also, try to use this for the fetch package:

import fetch from 'cross-fetch'

from: https://www.npmjs.com/package/cross-fetch
The output I have gotten is the following. As you can see, the first entry has the error property equaling 1.

Encoding payload: [
  {
    "trace_id": "27a34474476c53e6",
    "span_id": "27a34474476c53e6",
    "parent_id": "0000000000000000",
    "name": "http.request",
    "resource": "GET",
    "error": 1,
    "meta": {
      "service": "ddTraceTest",
      "version": "1.5.0",
      "runtime-id": "60a91a11-23f5-4ab1-aa13-0bfa89f550b4",
      "component": "http",
      "span.kind": "client",
      "http.method": "GET",
      "http.url": "https://postman-echo.com/delay/6",
      "out.host": "postman-echo.com",
      "language": "javascript"
    },
    "metrics": {
      "_dd.rule_psr": 0.1,
      "_dd.top_level": 1,
      "_dd.measured": 1,
      "process_id": 37507,
      "_sampling_priority_v1": -1
    },
    "start": 1692820660739010600,
    "duration": 5304725342,
    "service": "ddTraceTest",
    "type": "http"
  },
  {
    "trace_id": "27a34474476c53e6",
    "span_id": "63376bdd65def9b3",
    "parent_id": "27a34474476c53e6",
    "name": "tcp.connect",
    "resource": "postman-echo.com:443",
    "error": 0,
    "meta": {
      "service": "ddTraceTest",
      "version": "1.5.0",
      "runtime-id": "60a91a11-23f5-4ab1-aa13-0bfa89f550b4",
      "component": "net",
      "span.kind": "client",
      "tcp.remote.host": "postman-echo.com",
      "tcp.family": "IPv4",
      "tcp.local.address": "172.16.227.131",
      "out.host": "postman-echo.com",
      "language": "javascript"
    },
    "metrics": {
      "_dd.measured": 1,
      "tcp.remote.port": 443,
      "tcp.local.port": 59803,
      "network.destination.port": 443,
      "process_id": 37507,
      "_sampling_priority_v1": -1
    },
    "start": 1692820660755469300,
    "duration": 109916260,
    "service": "ddTraceTest"
  },
  {
    "trace_id": "27a34474476c53e6",
    "span_id": "634b38cce2212f6d",
    "parent_id": "63376bdd65def9b3",
    "name": "dns.lookup",
    "resource": "postman-echo.com",
    "error": 0,
    "meta": {
      "service": "ddTraceTest",
      "version": "1.5.0",
      "runtime-id": "60a91a11-23f5-4ab1-aa13-0bfa89f550b4",
      "component": "dns",
      "span.kind": "client",
      "dns.hostname": "postman-echo.com",
      "dns.address": "34.193.58.5",
      "dns.addresses": "34.193.58.5,34.197.229.242,52.1.47.226,54.236.127.160,54.81.161.182,54.81.173.137",
      "language": "javascript"
    },
    "metrics": {
      "_dd.measured": 1,
      "process_id": 37507,
      "_sampling_priority_v1": -1
    },
    "start": 1692820660755731700,
    "duration": 29480469,
    "service": "ddTraceTest"
  }
]
success: 200
Stopped wall profiler
Stopped space profiler
Request to the agent: {"path":"/v0.4/traces","method":"PUT","headers":{"Content-Type":"application/msgpack","Datadog-Meta-Tracer-Version":"4.11.1","X-Datadog-Trace-Count":"1","Datadog-Meta-Lang":"nodejs","Datadog-Meta-Lang-Version":"v20.5.0","Datadog-Meta-Lang-Interpreter":"v8"},"url":"http://127.0.0.1:8126/"}

@map34
Copy link
Author

map34 commented Jan 5, 2024

@sabrenner @tlhunter Thank you for fixing this issue. You both rock!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants