Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: move events sock path to conf #10732

Closed
wants to merge 11 commits into from

Conversation

bzp2010
Copy link
Contributor

@bzp2010 bzp2010 commented Dec 29, 2023

Description

In my last PR #10550, I place the unix socket files in logs directory. It may be inappropriate because if we use a container environment with log-rotate turned on, we may mount the logs folder to an external volume, and I think we may run into communication and permissions issues.

So this PR moves the sock files to the conf folder, which is probably a better choice because that's where we're generating our nginx.conf right now, and the same way we used to create conf_server sock file there previously.

In addition, this PR adds some code for cleaning up the sock file to ensure that when an Nginx process crashes or is killed, the sock file can be cleaned up before the next startup, preventing such errors (this is because the file already exists):

✗ bin/apisix start
/usr/local/openresty//luajit/bin/luajit ./apisix/cli/apisix.lua start
nginx.pid exists but there's no corresponding process with pid  1860232 , the file will be overwritten
nginx: [emerg] bind() to unix:/home/bzp/code/apisix/conf/worker_events.sock failed (98: Address already in use)
nginx: [emerg] bind() to unix:/home/bzp/code/apisix/conf/worker_events.sock failed (98: Address already in use)

This often occurs when using containers, and we have to be able to handle it correctly.

This feature has not yet been released, so fixing it now is timely and involves no compatibility issues. In the meantime, no test cases themselves need to be changed.

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

@bzp2010 bzp2010 self-assigned this Dec 29, 2023
@bzp2010 bzp2010 marked this pull request as draft December 29, 2023 05:50
@bzp2010
Copy link
Contributor Author

bzp2010 commented Dec 30, 2023

The errors in the test is the result of the behavior in test::nginx that cleans up the conf directory, which looks like this:

start test =>
start nginx and listen to UNIX socket (create file in conf dir) =>
case 1 =>
reload (test::nginx cleanup conf dir and regenerate nginx.conf file) =>
case 2 (will try to connect a not exist sock file that cleaned by test::nginx) =>
connect failed and cannot receive events =>
test failed

In fact, this was already present when we were still using the conf server (config_listen.sock will be in conf), it's just that the tests back then didn't strongly rely on this behavior and it couldn't be detected. I do see it on the 3.2.2 branch.

There are several ways to fix it:

  1. Provide a configuration or environment variable to override the default sock file path for testing (which can also be set by the user). For testing, we place the sock file under /tmp.
  2. Stop setting USE_HUP for cases that fail in the test, so Nginx will restart between tests [1], the unix socket will indeed be recreated, and the test will work.
    a. Cases t/node/healthcheck-multiple-worker.t and t/node/healthcheck-ipv6.t involved
    b. These tests are simple and do not rely on the reload behavior. Therefore, no impact is expected.
  3. Create a new folder such as tmp alongside the conf folder location to store these temporary files.

This PR itself mentions issues that we should be wary of, including but not limited to files being exposed by container mounts and not being cleaned up on abnormal exits resulting in failure to start next time.
So we need to choose a way to solve this problem, either one of the above or your better suggestion.

Waiting for discussion. cc @monkeyDluffy6017 @shreemaan-abhishek

Ref:
[1] According to Nginx's behavior, when a reload is performed, the newly started worker will not perform a listen operation again, but will simply reuse the file descriptor already created by the old worker. This means that if test::nginx cleans up that sock file between tests, you can still see the unix socket listening via netstat -lx, but the file doesn't exist and you can't connect. Therefore, it can be said that the cleanup behavior of test::nginx and the Nginx reload feature together broke the test.

@shreemaan-abhishek
Copy link
Contributor

This PR itself mentions issues that we should be wary of,

But your changes don't introduce these problems, do they? What I am trying to say is, this problem can arise anytime even without your PR being merged.

I think, introducing a configuration variable is fair.

@monkeyDluffy6017
Copy link
Contributor

IMHO, i prefer option 3

@bzp2010
Copy link
Contributor Author

bzp2010 commented Jan 8, 2024

In fact, I've implemented the third point mentioned (create a tmp dir alongside conf and logs) and it still doesn't work.

The test::nginx makes some pretty dirty assumptions, it completely disregards any directories (conf, logs, html, etc) that may exist under the servroot other than the ones it confirms exist, and therefore its cleanup doesn't work properly. https://github.com/openresty/test-nginx/blob/40aa6364b0a428779cbf96cd10f49b20166483ee/lib/Test/Nginx/Util.pm#L867-L891

And the test::nginx will only do special things to paths like *_cache and *_temp to remove them. https://github.com/openresty/test-nginx/blob/40aa6364b0a428779cbf96cd10f49b20166483ee/lib/Test/Nginx/Util.pm#L870-L871

Instead it tries again to delete and recreate the servroot folder before the start of each test case, and there are no hooks available there for early cleanup files. So servroot/tmp will keep there and block remove servroot.

I have tested the cleanup added via add_cleanup_handler and it will run after an error occurs, so it won't work.

Therefore the use of special cases in testing is indispensable (in my perception). For example, use an environment variable to overwrite the directory used for testing. Is there a better way to do this?

ping @shreemaan-abhishek @monkeyDluffy6017 🤔

@monkeyDluffy6017
Copy link
Contributor

LGTM

@shreemaan-abhishek
Copy link
Contributor

@bzp2010 did you attempt using other solution? IMHO, at this stage since solution 3 is not available you can choose any solution and give us a fix.

Copy link

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Mar 24, 2024
Copy link

This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Apr 22, 2024
@darkSheep404
Copy link
Contributor

Description

In my last PR #10550, I place the unix socket files in logs directory. It may be inappropriate because if we use a container environment with log-rotate turned on, we may mount the logs folder to an external volume, and I think we may run into communication and permissions issues.↳

So this PR moves the sock files to the conf folder, which is probably a better choice because that's where we're generating our nginx.conf right now, and the same way we used to create conf_server sock file there previously.↳

In addition, this PR adds some code for cleaning up the sock file to ensure that when an Nginx process crashes or is killed, the sock file can be cleaned up before the next startup, preventing such errors (this is because the file already exists):↳

✗ bin/apisix start
/usr/local/openresty//luajit/bin/luajit ./apisix/cli/apisix.lua start
nginx.pid exists but there's no corresponding process with pid  1860232 , the file will be overwritten
nginx: [emerg] bind() to unix:/home/bzp/code/apisix/conf/worker_events.sock failed (98: Address already in use)
nginx: [emerg] bind() to unix:/home/bzp/code/apisix/conf/worker_events.sock failed (98: Address already in use)

This often occurs when using containers, and we have to be able to handle it correctly.

This feature has not yet been released, so fixing it now is timely and involves no compatibility issues. In the meantime, no test cases themselves need to be changed.

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

we may mount the logs folder to an external volume ==>
we mount the logs folder to an external volume and we occur this issue worker_events.sock failed (98: Address already in use) after updgrate from 3.2 to 3.9
It seems like moving to the conf folder would be a better solution, since mount the logs folder is indeed a common behavior

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants