Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: fix test-tls-ticket-cluster.js #52431

Conversation

huseyinacacak-janea
Copy link
Contributor

Problem

I've investigated the related issue in depth and found out that this test sometimes fails because of the handshake process. When the error happens and I look at the Wireshark logs, I see that the client doesn't send a Change Cipher Spec packet, but instead sends a FIN packet after a while. The FIN packet is sent in approximately 10ms. This may be some timeout, but I wasn't able to find proof for that.

I've added 2 screenshots from the Wireshark logs (which I can provide too if requested). The first one shows the expected packet flow when the test passes, and the second one shows the packet flow when the test fails.

Pass:

2510_log2

Failure:

2510_log1

As can be seen, the Change Cipher Spec packet was missing, and the FIN packet had already been sent before the handshake was complete.

I've seen discussions about the issues in the handshake process of Node.js in the following links: openssl/openssl#8069, openssl/openssl#7199, #18770.

As I don't have in-depth knowledge about OpenSSL, I'm not sure if this issue is caused by the way OpenSSL is used by Node. If there is a need to investigate it, a new issue can be opened.

Fix

In this PR, I’ve made the Primary (client) wait for the FIN packet from the Worker (server) to close the connection. This approach ensures that the Primary waits until it receives a FIN packet from the Worker and doesn’t time out during the handshake. I've run the test with the fix roughly 11k times and it hasn't failed once. Without the fix, I would usually get a failure with 1k runs easily.

Fixes: #2510

Ensured connection end is initiated by worker (server)

Fixes: nodejs#2510
@nodejs-github-bot nodejs-github-bot added needs-ci PRs that need a full CI run. test Issues and PRs related to the tests. labels Apr 9, 2024
@lpinca lpinca added the request-ci Add this label to start a Jenkins CI on a PR. label Apr 9, 2024
Comment on lines 49 to 51
c.on('end', () => {
c.end();
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
c.on('end', () => {
c.end();
});

Removing the whole 'secureConnect' listener should have the same effect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review. I applied your suggestion and tested it locally. I ran it over 4000 times and got 2 timeouts. The following is the test log:

E:\github\node>python tools\test.py test\parallel\test-tls-ticket-cluster
=== release test-tls-ticket-cluster ===
Path: parallel/test-tls-ticket-cluster
spawn file: E:\github\node\out\Release\node.exe
spawn file: E:\github\node\out\Release\node.exe
spawn file: E:\github\node\out\Release\node.exe
spawn file: E:\github\node\out\Release\node.exe
[primary] got "listening"
[primary] connecting 49193 session? false
[primary] got "listening"
[primary] got "listening"
[primary] got "listening"
[worker] connection reused? false
[primary] got "not-reused"
[primary] connecting 49193 session? true
[worker] connection reused? true
[primary] got "reused"
[primary] connecting 49193 session? true
[worker] connection reused? true
[primary] got "reused"
[primary] connecting 49193 session? true
[worker] connection reused? true
[primary] got "reused"
[primary] connecting 49193 session? true
[worker] connection reused? true
[primary] got "reused"
[primary] connecting 49193 session? true
[worker] connection reused? true
[primary] got "reused"
[primary] connecting 49193 session? true
[worker] connection reused? true
[primary] got "reused"
[primary] connecting 49193 session? true
[worker] connection reused? true
[primary] got "reused"
[primary] connecting 49193 session? true
[worker] connection reused? true
[primary] got "reused"
[primary] connecting 49193 session? true
[worker] connection reused? true
[primary] got "reused"
[primary] connecting 49193 session? true
[worker] connection reused? true
[primary] got "reused"
[primary] connecting 49193 session? true
[worker] connection reused? true
[primary] got "reused"
[primary] connecting 49193 session? true
[worker] connection reused? true
[primary] got "reused"
[primary] connecting 49193 session? true
[worker] connection reused? true
[primary] got "reused"
[primary] connecting 49193 session? true
[worker] connection reused? true
[primary] got "reused"
[primary] connecting 49193 session? true
[worker] connection reused? true
[primary] got "reused"
[worker] got "die"
[worker] server close
[worker] exit
[worker] got "die"
[worker] server close
[worker] exit
[worker] got "die"
[worker] server close
[worker] exit
[primary] worker died
[worker] got "die"
[worker] server close
[worker] exit
[primary] worker died
[primary] worker died
[primary] worker died
Command: E:\github\node\out\Release\node.exe E:\github\node\test\parallel\test-tls-ticket-cluster.js
--- TIMEOUT ---


[02:32|% 100|+   0|-   1]: Done

Failed tests:
E:\github\node\out\Release\node.exe E:\github\node\test\parallel\test-tls-ticket-cluster.js

Since the initial solution doesn't have that issue, I'd go with it to avoid timeouts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can reproduce the timeouts also with the initial version then as the only difference is that c.end() is called in the next tick. See

process.nextTick(endWritableNT, stream);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine either way, as is or with the 'secureListener' event listener removed. The result is the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left the test running overnight and saw the error, which is related to the handshaking. It appears that removing secureConnect doesn't have the same effect as adding an end callback. Do you have any other suggestion I could try here?
P.S. I applied your other suggestion, the code looks better now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left the test running overnight and saw the error, which is related to the handshaking. It appears that removing secureConnect doesn't have the same effect as adding an end callback. Do you have any other suggestion I could try here?

I have no suggestions. What is currently done in the 'end' callback, is already done automatically even if the callback is not added. The only difference is that c.end() is called in the next tick. This makes it explicit and I'm fine with it.

@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Apr 10, 2024
@nodejs-github-bot
Copy link
Collaborator

@lpinca
Copy link
Member

lpinca commented Apr 17, 2024

# https://github.com/nodejs/node/pull/52563
$ python tools/test.py -J --repeat=1000 parallel/test-tls-ticket-cluster
=== release test-tls-ticket-cluster ===

...

node:assert:126
  throw new AssertionError(obj);
  ^

AssertionError [ERR_ASSERTION]: Expected values to be strictly equal:

15 !== 16

    at process.<anonymous> (C:\Users\lpinca\node\test\parallel\test-tls-ticket-cluster.js:92:12)
    at process.emit (node:events:532:35) {
  generatedMessage: true,
  code: 'ERR_ASSERTION',
  actual: 15,
  expected: 16,
  operator: 'strictEqual'
}

Node.js v22.0.0-pre
Command: C:\Users\lpinca\node\out\Release\node.exe C:\Users\lpinca\node\test\parallel\test-tls-ticket-cluster.js

...

[09:13|% 100|+ 986|-  14]: Done
# This rebased on top of https://github.com/nodejs/node/pull/52563
$ python tools/test.py -J --repeat=1000 parallel/test-tls-ticket-cluster
[08:43|% 100|+ 1000|-   0]: Done

All tests passed.

@lpinca lpinca added request-ci Add this label to start a Jenkins CI on a PR. commit-queue-squash Add this label to instruct the Commit Queue to squash all the PR commits into the first one. labels Apr 17, 2024
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Apr 17, 2024
@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot
Copy link
Collaborator

@lpinca lpinca added the commit-queue Add this label to land a pull request using GitHub Actions. label Apr 18, 2024
@nodejs-github-bot nodejs-github-bot removed the commit-queue Add this label to land a pull request using GitHub Actions. label Apr 18, 2024
@nodejs-github-bot nodejs-github-bot merged commit cfca9e3 into nodejs:main Apr 18, 2024
61 checks passed
@nodejs-github-bot
Copy link
Collaborator

Landed in cfca9e3

aduh95 pushed a commit that referenced this pull request Apr 29, 2024
Ensured connection end is initiated by worker (server)

Fixes: #2510
PR-URL: #52431
Reviewed-By: Luigi Pinca <[email protected]>
marco-ippolito pushed a commit that referenced this pull request May 2, 2024
Ensured connection end is initiated by worker (server)

Fixes: #2510
PR-URL: #52431
Reviewed-By: Luigi Pinca <[email protected]>
marco-ippolito pushed a commit that referenced this pull request May 3, 2024
Ensured connection end is initiated by worker (server)

Fixes: #2510
PR-URL: #52431
Reviewed-By: Luigi Pinca <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
commit-queue-squash Add this label to instruct the Commit Queue to squash all the PR commits into the first one. needs-ci PRs that need a full CI run. test Issues and PRs related to the tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test-tls-ticket-cluster.js is flaky
3 participants