Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lib: fix cluster handle leak #3510

Closed
wants to merge 2 commits into from
Closed

Conversation

Trott
Copy link
Member

@Trott Trott commented Oct 24, 2015

It is possible to cause a resource leak in SharedHandle if a worker is
added after all other workers have been removed
. This commit fixes the
leak.

Fixes: #2510

@Trott
Copy link
Member Author

Trott commented Oct 24, 2015

@Trott Trott added the cluster Issues and PRs related to the cluster subsystem. label Oct 24, 2015
@Trott
Copy link
Member Author

Trott commented Oct 24, 2015

Meh. Relevant test failure on Windows. Back to the drawing board for a bit...

@Trott Trott force-pushed the shared-handle branch 3 times, most recently from 3229ac8 to 0291472 Compare October 25, 2015 16:44
@Trott
Copy link
Member Author

Trott commented Oct 25, 2015

OK, all squashed into a single commit and working on all platforms now.

Results of the added test in Node 4.2.1:

$ node test/parallel/test-cluster-shared-worker-added-after-disconnect.js 

assert.js:89
  throw new assert.AssertionError({
  ^
AssertionError: Resource leak detected.
    at removeWorker (cluster.js:328:9)
    at ChildProcess.<anonymous> (cluster.js:348:34)
    at ChildProcess.g (events.js:260:16)
    at emitTwo (events.js:87:13)
    at ChildProcess.emit (events.js:172:7)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:200:12)
$

With this patch, no assertion is thrown.

@Trott
Copy link
Member Author

Trott commented Oct 25, 2015


process.on('message', function listener() {
server.close(function() {
process.exit();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the process.exit necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I comment out process.exit() and leave the rest of the test as is, the test never terminates.

@bnoordhuis
Copy link
Member

Change itself LGTM but the test can probably be a little tighter.

@Trott
Copy link
Member Author

Trott commented Oct 25, 2015

I made some changes based on @bnoordhuis comments and also renamed the test and added an explanatory comment. New CI: https://ci.nodejs.org/job/node-test-pull-request/613/

@bnoordhuis
Copy link
Member

You should be able to get rid of the c.unref() and process.exit() calls with the below patch:

diff --git a/test/parallel/test-cluster-shared-leak.js b/test/parallel/test-cluster-shared-leak.js
index 680709f..1cabe6f 100644
--- a/test/parallel/test-cluster-shared-leak.js
+++ b/test/parallel/test-cluster-shared-leak.js
@@ -9,23 +9,25 @@ const cluster = require('cluster');
 cluster.schedulingPolicy = cluster.SCHED_NONE;

 if (cluster.isMaster) {
-  var worker1, worker2;
+  var conn, worker1, worker2;

   worker1 = cluster.fork();
   worker1.on('message', common.mustCall(function() {
     worker2 = cluster.fork();
-    const c = net.connect(common.PORT, common.mustCall(function() {
-      c.unref();
+    conn = net.connect(common.PORT, common.mustCall(function() {
       worker1.send('die');
       worker2.send('die');
     }));
-    c.on('error', function(e) {
-      // ECONNRESET is OK
-      if (e.code !== 'ECONNRESET')
-        throw e;
-    });
   }));

+  cluster.on('exit', function(worker, exitCode, signalCode) {
+    assert(worker === worker1 || worker === worker2);
+    assert.strictEqual(exitCode, 0);
+    assert.strictEqual(signalCode, null);
+    if (Object.keys(cluster.workers).length === 0)
+      conn.destroy();
+  });
+
   return;
 }

@@ -40,6 +42,6 @@ server.listen(common.PORT, function() {
 process.on('message', function(msg) {
   if (msg !== 'die') return;
   server.close(function() {
-    process.exit();
+    setImmediate(() => process.disconnect());
   });
 });

Your test passes locally for me however, with and without the change to lib/cluster.js and with or without my changes to the test.

Trott added a commit to Trott/io.js that referenced this pull request Oct 26, 2015
It is possible to cause a resource leak in SharedHandle if a worker is
added after all other workers have been removed. This commit fixes the
leak.

Fixes: nodejs#2510
PR-URL: nodejs#3510
@Trott
Copy link
Member Author

Trott commented Oct 26, 2015

@bnoordhuis You must be running Linux. I commented out the new line in cluster.js and ran CI to see which operating systems fail with the test and which don't.

Every single operating system triggered the error on all builds except Linux. It did trip up CentOS 6 and Fedora 22. It could be that Linux is affected, but not reliably affected. Or it could be that only those Linux distro/versions are affected.

It makes sense to me that the problem could be OS-dependent as the test needs to turn off round-robin scheduling and rely on system scheduling. (Otherwise, it doesn't fail for anything.)

@bnoordhuis
Copy link
Member

Interesting. I run Linux - FC22 actually - but it sounds plausible that it's a platform-specific issue. LGTM if you steal my changes.

It is possible to cause a resource leak in SharedHandle. This commit
fixes the leak.

Fixes: nodejs#2510
PR-URL: nodejs#3510
Reviewed-By: Ben Noordhuis <[email protected]>
@Trott
Copy link
Member Author

Trott commented Oct 26, 2015

With your changes, I still get a reliable (expected) test failure with Node 4.2.1 on OS X and reliable success with the change to cluster.js, so hooray! I also updated the comment to explain that the problem manifests on operating systems other than Linux.

And one (hopefully final) CI: https://ci.nodejs.org/job/node-test-pull-request/617/

@Trott
Copy link
Member Author

Trott commented Oct 26, 2015

@bnoordhuis With your changes to the test, everything passes except Windows. But Windows is not reporting the Resource Leak issue anymore. It's instead reporting this:

not ok 27 test-cluster-shared-leak.js
#events.js:141
#      throw er; // Unhandled 'error' event
#      ^
#
#Error: read ECONNRESET
#    at exports._errnoException (util.js:867:11)
#    at TCP.onread (net.js:544:26)

I'm inclined to put back the error handler I had in there that swallowed ECONNRESET. Would you say that ECONNRESET firing may depend on the OS scheduler? Or is this a bug that needs to be filed and further explored?

@bnoordhuis
Copy link
Member

Would you say that ECONNRESET firing may depend on the OS scheduler?

I think so. Probably best to put the error listener back.

@Trott
Copy link
Member Author

Trott commented Oct 26, 2015

Error listener is back. CI: https://ci.nodejs.org/job/node-test-pull-request/618/

@Trott
Copy link
Member Author

Trott commented Oct 26, 2015

A few unrelated test failures on CI, looks good otherwise. Thanks!

@Trott
Copy link
Member Author

Trott commented Oct 26, 2015

I'll land this in 12-24 hours unless someone objects.

Trott added a commit that referenced this pull request Oct 27, 2015
It is possible to cause a resource leak in SharedHandle. This commit
fixes the leak.

Fixes: #2510
PR-URL: #3510
Reviewed-By: Ben Noordhuis <[email protected]>
@Trott
Copy link
Member Author

Trott commented Oct 27, 2015

Landed in da21dba

@Trott Trott closed this Oct 27, 2015
Trott added a commit that referenced this pull request Oct 28, 2015
It is possible to cause a resource leak in SharedHandle. This commit
fixes the leak.

Fixes: #2510
PR-URL: #3510
Reviewed-By: Ben Noordhuis <[email protected]>
@jasnell
Copy link
Member

jasnell commented Oct 28, 2015

Landed in v4.x-staging in e4417a1

rvagg pushed a commit to rvagg/io.js that referenced this pull request Oct 29, 2015
It is possible to cause a resource leak in SharedHandle. This commit
fixes the leak.

Fixes: nodejs#2510
PR-URL: nodejs#3510
Reviewed-By: Ben Noordhuis <[email protected]>
@rvagg rvagg mentioned this pull request Oct 29, 2015
Trott added a commit that referenced this pull request Oct 29, 2015
It is possible to cause a resource leak in SharedHandle. This commit
fixes the leak.

Fixes: #2510
PR-URL: #3510
Reviewed-By: Ben Noordhuis <[email protected]>
@mscdex
Copy link
Contributor

mscdex commented Feb 9, 2016

It looks like at least v0.12 has this same issue and needs this fix too. I am not sure about v0.10 yet.

mscdex pushed a commit to mscdex/io.js that referenced this pull request Feb 9, 2016
It is possible to cause a resource leak in SharedHandle. This commit
fixes the leak.

Fixes: nodejs#2510
PR-URL: nodejs#3510
Reviewed-By: Ben Noordhuis <[email protected]>
@Trott Trott deleted the shared-handle branch January 13, 2022 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cluster Issues and PRs related to the cluster subsystem.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test-tls-ticket-cluster.js is flaky
4 participants