-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault in node::StreamBase::GetFD #2847
Comments
/cc @indutny |
@heilage-nsk may I ask you to publish a backtrace(s) of the crash? This is pretty odd to see it |
Oh, one more thing. Is there any way to reproduce it? |
@indutny
Almost no useful info from backtrace, and also I couldn't generate core dump :( What else can I do to gather more info? I'm not sure if one can reproduce this segfault easily, but I'll try to create a test case. |
Oh. And my naive fix didn't help :) Seems like it's not that easy. |
@heilage-nsk thank you! How did you collect this information? It looks like a debug mode build, but... the stack trace has lots of gaps. It can't just get straight from JS to the C++ code, there should be several v8 API C++ helpers on stack. A couple more questions:
Thanks! |
@indutny
I tried to do |
@heilage-nsk thank you very much for this. I appreciate your help. |
@indutny
I tried to set a conditional breakpoint before this line, but in case when So I reapplied my naive fix and re-ran load testing under gdb, and it still works fine after several hours. For now I'm going to inspect nodejs sources a bit deeper to know what could go wrong here, any ideas are appreciated :) |
@heilage-nsk I think I have an idea on why it could be happening, working on the test case. In a few words, Are you using |
@indutny |
@heilage-nsk no |
@indutny |
The error is pretty easy to reproduce, but how it could be happening in natural environment is less evident to me at the moment: var net = require('net');
var socket = net.connect(1443);
var handle = socket._handle;
socket.destroy();
socket.on('close', function() {
setImmediate(function() {
console.log('hm..');
console.log(handle.fd);
});
}); Investigating |
Perhaps it is related to |
@heilage-nsk do you know if it happens in master or in worker process? |
Actually, I have one patch that may reveal some useful information about this crash: diff --git a/src/stream_base-inl.h b/src/stream_base-inl.h
index 81114a2..114ba39 100644
--- a/src/stream_base-inl.h
+++ b/src/stream_base-inl.h
@@ -71,6 +71,17 @@ template <class Base>
void StreamBase::GetFD(Local<String> key,
const PropertyCallbackInfo<Value>& args) {
StreamBase* wrap = Unwrap<Base>(args.Holder());
+ if (wrap == nullptr) {
+ Environment* env = Environment::GetCurrent(args.GetIsolate());
+ Local<Object> global = env->context()->Global();
+ Local<Object> console = global->Get(
+ String::NewFromUtf8(env->isolate(), "console")).As<Object>();
+ Local<v8::Function> trace = console->Get(
+ String::NewFromUtf8(env->isolate(), "trace")).As<v8::Function>();
+
+ v8::MaybeLocal<Value> res =
+ trace->Call(env->context(), console, 0, nullptr);
+ }
if (!wrap->IsAlive())
return args.GetReturnValue().Set(UV_EINVAL); @heilage-nsk may I ask you to give it a try? It should just print a stacktrace before crashing on |
@indutny Trace has shown this:
So it also looks like a bug in PM2 process management code. I tried to run the app with just cluster module and it doesn't fail (but has some other bug not related to current issue :) ). I've made some more debugging and basic tracing and came to v8's ReadField function. In
The output was:
For me it looks like some (internal?) fields of an object were already cleaned up when process attempted to exit, but the object itself was still there. Maybe this will help somehow. It's very strange behavior, but I think node should not crash in this case, maybe it should throw an exception or just do nothing? |
@heilage-nsk I have a possible fix: diff --git a/lib/internal/child_process.js b/lib/internal/child_process.js
index 328a67d..890f80a 100644
--- a/lib/internal/child_process.js
+++ b/lib/internal/child_process.js
@@ -446,9 +446,8 @@ function setupChannel(target, channel) {
} else {
this.buffering = false;
- target.disconnect();
channel.onread = nop;
- channel.close();
+ target._disconnect();
maybeClose(target);
}
};
May I ask you to give it a try? Thanks! |
Updated the patch. |
Wait, there are some test failures with this patch. Please do not apply just yet. |
Ok, this patch is better: diff --git a/lib/internal/child_process.js b/lib/internal/child_process.js
index 328a67d..bea139b 100644
--- a/lib/internal/child_process.js
+++ b/lib/internal/child_process.js
@@ -449,6 +449,7 @@ function setupChannel(target, channel) {
target.disconnect();
channel.onread = nop;
channel.close();
+ target._channel = null;
maybeClose(target);
}
}; Please give it a try ;) |
@indutny |
This is a good result, I guess? ;) |
Absoultely :) |
Thanks! |
@heilage-nsk heya! Any news? ;) |
@indutny |
The test is still failing sometimes because when trying to establish the second connection, the server is already closed. Bring back the code that handled this case and was removed in the last refactoring of the test. Also ignore the errors that might happen when sending the second handle to the worker because it may already have exited. PR-URL: #5422 Reviewed-By: Rich Trott <[email protected]>
The test would sometimes time out on some platforms. Take the initial version of the test and instead of sending 100 handles, just send 2. It still fails when run with the code before the change was introduced and passes afterwards. Remove test from parallel.status. PR-URL: #5121 Reviewed-By: Brian White <[email protected]> Reviewed-By: Rich Trott <[email protected]> Reviewed-By: Michael Dawson <[email protected]> Fixes: #3635
Windows is still sometimes failing with ECONNRESET. Bring back the handling of this error as was initially introduced in PR #4442. PR-URL: #5179 Reviewed-By: Rich Trott <[email protected]> Fixes: #3635
The test is still failing sometimes because when trying to establish the second connection, the server is already closed. Bring back the code that handled this case and was removed in the last refactoring of the test. Also ignore the errors that might happen when sending the second handle to the worker because it may already have exited. PR-URL: #5422 Reviewed-By: Rich Trott <[email protected]>
The test is still failing sometimes because when trying to establish the second connection, the server is already closed. Bring back the code that handled this case and was removed in the last refactoring of the test. Also ignore the errors that might happen when sending the second handle to the worker because it may already have exited. PR-URL: #5422 Reviewed-By: Rich Trott <[email protected]>
Windows would die with ECONNRESET most times when running this particular test. This commit makes handling these errors more tolerable. PR-URL: nodejs#4442 Reviewed-By: Fedor Indutny <[email protected]> Reviewed-By: Rich Trott <[email protected]>
It's not guaranteed that the first socket that tries to connect is the first that succeeds so the rest of assumptions made in the test are not correct. Fix it by making sure the second socket does not try to connect until the first has succeeded. The IPC channel can already be closed when sending the second socket. It should be allowed. Also, don't start sending messages until the worker is online. Fixes: nodejs#8950 PR-URL: nodejs#8954 Reviewed-By: Michael Dawson <[email protected]> Reviewed-By: Brian White <[email protected]> Reviewed-By: James M Snell <[email protected]> Reviewed-By: Colin Ihrig <[email protected]>
It's not guaranteed that the first socket that tries to connect is the first that succeeds so the rest of assumptions made in the test are not correct. Fix it by making sure the second socket does not try to connect until the first has succeeded. The IPC channel can already be closed when sending the second socket. It should be allowed. Also, don't start sending messages until the worker is online. Fixes: #8950 PR-URL: #8954 Reviewed-By: Michael Dawson <[email protected]> Reviewed-By: Brian White <[email protected]> Reviewed-By: James M Snell <[email protected]> Reviewed-By: Colin Ihrig <[email protected]>
It's not guaranteed that the first socket that tries to connect is the first that succeeds so the rest of assumptions made in the test are not correct. Fix it by making sure the second socket does not try to connect until the first has succeeded. The IPC channel can already be closed when sending the second socket. It should be allowed. Also, don't start sending messages until the worker is online. Fixes: #8950 PR-URL: #8954 Reviewed-By: Michael Dawson <[email protected]> Reviewed-By: Brian White <[email protected]> Reviewed-By: James M Snell <[email protected]> Reviewed-By: Colin Ihrig <[email protected]>
Throw an error if either `err.message` or `err.code` is wrong. Previously, it was only throwing an error if one or the other was happening.
It's not guaranteed that the first socket that tries to connect is the first that succeeds so the rest of assumptions made in the test are not correct. Fix it by making sure the second socket does not try to connect until the first has succeeded. The IPC channel can already be closed when sending the second socket. It should be allowed. Also, don't start sending messages until the worker is online. Fixes: #8950 PR-URL: #8954 Reviewed-By: Michael Dawson <[email protected]> Reviewed-By: Brian White <[email protected]> Reviewed-By: James M Snell <[email protected]> Reviewed-By: Colin Ihrig <[email protected]>
Rename test-dgram-regress-4496 to test-dgram-typeerror-buffer to test-dgram-send-invalid-msg-type Rename test-http-regr-nodejsgh-2821 to test-http-request-large-payload Rename test-child-process-fork-regr-nodejsgh-2847 to test-child-process-fork-closed-channel-segfault Rename test-http-pipeline-regr-2639 to test-http-pipeline-serverresponse-assertionerror Rename test-http-pipeline-regr-3332 to test-http-pipeline-requests-connection-leak Rename test-http-pipeline-regr-3508 to test-http-pipeline-socket-parser-typeerror Refs: nodejs#19105 Refs: https://github.com/nodejs/node/blob/master/doc/guides/writing-tests.md#test-structure
Hello,
While performing the load testing on my application I've noticed a segfault. My environment and conditions are:
So, under load (about 50-100 rps) when some process runs out of memory, it's killed and restarted by PM2, but sometimes after that the parent PM2 process just fails with "Segmentation fault".
I've researched that for a while (compiled debug version, ran under gdb, etc) and found the source of segfault in
src/stream_base-inl.h
:I assumed that somehow the
wrap
pointer becomes NULL. I didn't research deeper, so I assumed thatUnwrap
function failed to do what it's needed for and just returned NULL, but the code ofGetFD
didn't handle this correctly. So I tried this small fix:And it seems to work so far - no single failure for an hour of load testing, and I'm going to leave it working for some days to be sure.
Does this solution seem to be correct? Should I make a PR for it?
The text was updated successfully, but these errors were encountered: