Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16922 dfuse: Avoid assertion on shutdown #15972

Merged
merged 3 commits into from
Feb 28, 2025
Merged

Conversation

jolivier23
Copy link
Contributor

When there are open file handles and dfuse is shutdown using umount -f, it would assert due to resources being used. Rather than asserting, just print a warning that we are shutting down ungracefully.

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

@daosbuild1
Copy link
Collaborator

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15972/1/execution/node/321/log

@daosbuild1
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15972/1/execution/node/375/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15972/1/execution/node/370/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15972/1/execution/node/334/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15972/1/execution/node/261/log

Copy link

github-actions bot commented Feb 25, 2025

Ticket title is ' Assertion 'atomic_load_relaxed(&ie->ie_open_count) == 0' '
Status is 'Awaiting backport'
Labels: 'GCP,google-cloud-daos,request_for_2.6.4'
https://daosio.atlassian.net/browse/DAOS-16922

When there are open file handles and dfuse is shutdown using umount -f,
it would assert due to resources being used.  Rather than asserting,
just print a warning that we are shutting down ungracefully.

Features: dfuse

Signed-off-by: Jeff Olivier <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15972/2/execution/node/1541/log

@jolivier23 jolivier23 added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Feb 26, 2025
@jolivier23
Copy link
Contributor Author

Master branch has some failures that this PR hit but they are clearly unrelated to dfuse.
https://daosio.atlassian.net/browse/DAOS-17094 among others

@jolivier23 jolivier23 requested a review from a team February 26, 2025 21:08
@daltonbohning
Copy link
Contributor

Master branch has some failures that this PR hit but they are clearly unrelated to dfuse. https://daosio.atlassian.net/browse/DAOS-17094 among others

Actually, one of the tests timed out while checking the dfuse mount point
https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15972/2/artifact/Functional%20Hardware%20Medium/server/replay.py/job.log

2025-02-26 13:47:04,441 dfuse_utils      L0109 INFO | Checking which hosts have the mount point directory created
2025-02-26 13:47:04,441 run_utils        L0471 DEBUG| Running on wolf-[229,248] with a 120 second timeout: test -d /tmp/daos_dfuse_test_replay_posix_1 -a ! -L /tmp/daos_dfuse_test_replay_posix_1
2025-02-26 13:48:06,464 stacktrace       L0039 ERROR| 
2025-02-26 13:48:06,464 stacktrace       L0042 ERROR| Reproduced traceback from: /localhome/jenkins/venv/lib64/python3.6/site-packages/avocado/core/test.py:767
...
2025-02-26 13:48:06,466 stacktrace       L0045 ERROR| RuntimeError: Test interrupted by SIGTERM

It could be a network/system issue but I'm not sure

@jolivier23
Copy link
Contributor Author

Is there a dfuse log? I'm very skeptical this patch could be the root cause

@daltonbohning
Copy link
Contributor

Is there a dfuse log? I'm very skeptical this patch could be the root cause

https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15972/2/artifact/Functional%20Hardware%20Medium/server/replay.py/daos_logs.wolf-229/dfuse_daos.log/*view*/

End has:

02/26-13:46:02.55 wolf-229 DAOS[289719/289741/0] rpc  WARN src/cart/crt_context.c:1267 crt_context_timeout_check(0x7f58f8001050) [opc=0x2070003 (DAOS_POOL_MODULE:POOL_QUERY) rpcid=0x379d3f300000009a rank:tag=1:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (1:0)
02/26-13:46:02.55 wolf-229 DAOS[289719/289741/0] rpc  INFO src/cart/crt_context.c:1201 crt_req_timeout_hdlr(0x7f58f8001050) [opc=0x2070003 (DAOS_POOL_MODULE:POOL_QUERY) rpcid=0x379d3f300000009a rank:tag=1:0] aborting in-flight to group daos_server, rank 1, tgt_uri (null)
02/26-13:46:02.55 wolf-229 DAOS[289719/289741/0] hg   WARN src/cart/crt_hg.c:1376 crt_hg_req_send_cb(0x7f58f8001050) [opc=0x2070003 (DAOS_POOL_MODULE:POOL_QUERY) rpcid=0x379d3f300000009a rank:tag=1:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
02/26-13:47:02.55 wolf-229 DAOS[289719/289741/0] rpc  WARN src/cart/crt_context.c:1267 crt_context_timeout_check(0x7f58f8001050) [opc=0x2070003 (DAOS_POOL_MODULE:POOL_QUERY) rpcid=0x379d3f300000009b rank:tag=0:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (0:0)
02/26-13:47:02.55 wolf-229 DAOS[289719/289741/0] rpc  INFO src/cart/crt_context.c:1201 crt_req_timeout_hdlr(0x7f58f8001050) [opc=0x2070003 (DAOS_POOL_MODULE:POOL_QUERY) rpcid=0x379d3f300000009b rank:tag=0:0] aborting in-flight to group daos_server, rank 0, tgt_uri (null)
02/26-13:47:02.55 wolf-229 DAOS[289719/289741/0] hg   WARN src/cart/crt_hg.c:1376 crt_hg_req_send_cb(0x7f58f8001050) [opc=0x2070003 (DAOS_POOL_MODULE:POOL_QUERY) rpcid=0x379d3f300000009b rank:tag=0:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
02/26-13:48:02.55 wolf-229 DAOS[289719/289741/0] rpc  WARN src/cart/crt_context.c:1267 crt_context_timeout_check(0x7f58f8008f20) [opc=0x2070003 (DAOS_POOL_MODULE:POOL_QUERY) rpcid=0x379d3f300000009c rank:tag=1:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (1:0)
02/26-13:48:02.55 wolf-229 DAOS[289719/289741/0] rpc  INFO src/cart/crt_context.c:1201 crt_req_timeout_hdlr(0x7f58f8008f20) [opc=0x2070003 (DAOS_POOL_MODULE:POOL_QUERY) rpcid=0x379d3f300000009c rank:tag=1:0] aborting in-flight to group daos_server, rank 1, tgt_uri (null)
02/26-13:48:02.55 wolf-229 DAOS[289719/289741/0] hg   WARN src/cart/crt_hg.c:1376 crt_hg_req_send_cb(0x7f58f8008f20) [opc=0x2070003 (DAOS_POOL_MODULE:POOL_QUERY) rpcid=0x379d3f300000009c rank:tag=1:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
02/26-13:48:04.71 wolf-229 DAOS[289719/289734/0] rpc  WARN src/cart/crt_context.c:1267 crt_context_timeout_check(0x7f5908002380) [opc=0x40a0001 (DAOS_OBJ_MODULE:fetch) rpcid=0x379d3f300000009d rank:tag=1:7] ctx_id 1, (status: 0x38) timed out (60 seconds), target (1:7)
02/26-13:48:04.71 wolf-229 DAOS[289719/289734/0] rpc  INFO src/cart/crt_context.c:1201 crt_req_timeout_hdlr(0x7f5908002380) [opc=0x40a0001 (DAOS_OBJ_MODULE:fetch) rpcid=0x379d3f300000009d rank:tag=1:7] aborting in-flight to group daos_server, rank 1, tgt_uri (null)
02/26-13:48:04.71 wolf-229 DAOS[289719/289734/0] hg   WARN src/cart/crt_hg.c:1376 crt_hg_req_send_cb(0x7f5908002380) [opc=0x40a0001 (DAOS_OBJ_MODULE:fetch) rpcid=0x379d3f300000009d rank:tag=1:7] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
02/26-13:48:04.71 wolf-229 DAOS[289719/289734/0] object ERR  src/object/cli_shard.c:791 dc_rw_cb() 281479271677952.0.0.1 (non-EC) RPC 1 to 1/7, flags 8/0, task 0x7f5908001a90 failed, non-DMA: DER_TIMEDOUT(-1011): 'Time out'
02/26-13:48:04.71 wolf-229 DAOS[289719/289734/0] rpc  WARN src/cart/crt_context.c:1267 crt_context_timeout_check(0x7f5908002e70) [opc=0x40a0009 (DAOS_OBJ_MODULE:key_query) rpcid=0x379d3f300000009e rank:tag=0:6] ctx_id 1, (status: 0x38) timed out (60 seconds), target (0:6)
02/26-13:48:04.71 wolf-229 DAOS[289719/289734/0] rpc  INFO src/cart/crt_context.c:1201 crt_req_timeout_hdlr(0x7f5908002e70) [opc=0x40a0009 (DAOS_OBJ_MODULE:key_query) rpcid=0x379d3f300000009e rank:tag=0:6] aborting in-flight to group daos_server, rank 0, tgt_uri (null)
02/26-13:48:04.71 wolf-229 DAOS[289719/289734/0] hg   WARN src/cart/crt_hg.c:1376 crt_hg_req_send_cb(0x7f5908002e70) [opc=0x40a0009 (DAOS_OBJ_MODULE:key_query) rpcid=0x379d3f300000009e rank:tag=0:6] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
02/26-13:48:04.71 wolf-229 DAOS[289719/289734/0] object ERR  src/object/cli_shard.c:2120 obj_shard_query_key_cb() Regular query failed: DER_TIMEDOUT(-1011): 'Time out'
02/26-13:49:02.55 wolf-229 DAOS[289719/289741/0] rpc  WARN src/cart/crt_context.c:1267 crt_context_timeout_check(0x7f58f8004a80) [opc=0x2070003 (DAOS_POOL_MODULE:POOL_QUERY) rpcid=0x379d3f300000009f rank:tag=0:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (0:0)
02/26-13:49:02.55 wolf-229 DAOS[289719/289741/0] rpc  INFO src/cart/crt_context.c:1201 crt_req_timeout_hdlr(0x7f58f8004a80) [opc=0x2070003 (DAOS_POOL_MODULE:POOL_QUERY) rpcid=0x379d3f300000009f rank:tag=0:0] aborting in-flight to group daos_server, rank 0, tgt_uri (null)
02/26-13:49:02.55 wolf-229 DAOS[289719/289741/0] hg   WARN src/cart/crt_hg.c:1376 crt_hg_req_send_cb(0x7f58f8004a80) [opc=0x2070003 (DAOS_POOL_MODULE:POOL_QUERY) rpcid=0x379d3f300000009f rank:tag=0:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
02/26-13:49:04.71 wolf-229 DAOS[289719/289734/0] rpc  WARN src/cart/crt_context.c:1267 crt_context_timeout_check(0x7f58c4011840) [opc=0x2070014 (DAOS_POOL_MODULE:POOL_TGT_QUERY_MAP) rpcid=0x379d3f30000000a0 rank:tag=1:0] ctx_id 1, (status: 0x38) timed out (60 seconds), target (1:0)
02/26-13:49:04.71 wolf-229 DAOS[289719/289734/0] rpc  INFO src/cart/crt_context.c:1201 crt_req_timeout_hdlr(0x7f58c4011840) [opc=0x2070014 (DAOS_POOL_MODULE:POOL_TGT_QUERY_MAP) rpcid=0x379d3f30000000a0 rank:tag=1:0] aborting in-flight to group daos_server, rank 1, tgt_uri ofi+verbs;ofi_rxm://192.168.100.228:31416
02/26-13:49:04.72 wolf-229 DAOS[289719/289734/0] hg   WARN src/cart/crt_hg.c:1376 crt_hg_req_send_cb(0x7f58c4011840) [opc=0x2070014 (DAOS_POOL_MODULE:POOL_TGT_QUERY_MAP) rpcid=0x379d3f30000000a0 rank:tag=1:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
02/26-13:49:07.21 wolf-229 DAOS[289719/289734/0] rpc  WARN src/cart/crt_context.c:1267 crt_context_timeout_check(0x7f58fc0022f0) [opc=0x40a0001 (DAOS_OBJ_MODULE:fetch) rpcid=0x379d3f30000000a1 rank:tag=1:7] ctx_id 1, (status: 0x38) timed out (60 seconds), target (1:7)
02/26-13:49:07.21 wolf-229 DAOS[289719/289734/0] rpc  INFO src/cart/crt_context.c:1201 crt_req_timeout_hdlr(0x7f58fc0022f0) [opc=0x40a0001 (DAOS_OBJ_MODULE:fetch) rpcid=0x379d3f30000000a1 rank:tag=1:7] aborting in-flight to group daos_server, rank 1, tgt_uri (null)
02/26-13:49:07.21 wolf-229 DAOS[289719/289734/0] rpc  INFO src/cart/crt_context.c:1273 crt_context_timeout_check(0x7f58fc018010) [opc=0x40a0009 (DAOS_OBJ_MODULE:key_query) rpcid=0x379d3f30000000a2 rank:tag=0:6] ctx_id 1, (status: 0x38) timed out (60 seconds), target (0:6)
02/26-13:49:07.21 wolf-229 DAOS[289719/289734/0] rpc  INFO src/cart/crt_context.c:1201 crt_req_timeout_hdlr(0x7f58fc018010) [opc=0x40a0009 (DAOS_OBJ_MODULE:key_query) rpcid=0x379d3f30000000a2 rank:tag=0:6] aborting in-flight to group daos_server, rank 0, tgt_uri (null)
02/26-13:49:07.21 wolf-229 DAOS[289719/289734/0] hg   WARN src/cart/crt_hg.c:1376 crt_hg_req_send_cb(0x7f58fc0022f0) [opc=0x40a0001 (DAOS_OBJ_MODULE:fetch) rpcid=0x379d3f30000000a1 rank:tag=1:7] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
02/26-13:49:07.21 wolf-229 DAOS[289719/289734/0] object ERR  src/object/cli_shard.c:791 dc_rw_cb() 281479271677952.0.0.1 (non-EC) RPC 1 to 1/7, flags 8/0, task 0x7f58fc001a00 failed, non-DMA: DER_TIMEDOUT(-1011): 'Time out'
02/26-13:49:07.21 wolf-229 DAOS[289719/289734/0] hg   WARN src/cart/crt_hg.c:1376 crt_hg_req_send_cb(0x7f58fc018010) [opc=0x40a0009 (DAOS_OBJ_MODULE:key_query) rpcid=0x379d3f30000000a2 rank:tag=0:6] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
02/26-13:49:07.21 wolf-229 DAOS[289719/289734/0] object ERR  src/object/cli_shard.c:2120 obj_shard_query_key_cb() Regular query failed: DER_TIMEDOUT(-1011): 'Time out'
02/26-13:50:02.55 wolf-229 DAOS[289719/289741/0] rpc  WARN src/cart/crt_context.c:1267 crt_context_timeout_check(0x7f58f8003020) [opc=0x2070003 (DAOS_POOL_MODULE:POOL_QUERY) rpcid=0x379d3f30000000a3 rank:tag=1:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (1:0)
02/26-13:50:02.55 wolf-229 DAOS[289719/289741/0] rpc  INFO src/cart/crt_context.c:1201 crt_req_timeout_hdlr(0x7f58f8003020) [opc=0x2070003 (DAOS_POOL_MODULE:POOL_QUERY) rpcid=0x379d3f30000000a3 rank:tag=1:0] aborting in-flight to group daos_server, rank 1, tgt_uri (null)
02/26-13:50:02.55 wolf-229 DAOS[289719/289741/0] hg   WARN src/cart/crt_hg.c:1376 crt_hg_req_send_cb(0x7f58f8003020) [opc=0x2070003 (DAOS_POOL_MODULE:POOL_QUERY) rpcid=0x379d3f30000000a3 rank:tag=1:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
02/26-13:50:04.72 wolf-229 DAOS[289719/289734/0] rpc  WARN src/cart/crt_context.c:1267 crt_context_timeout_check(0x7f58c4012850) [opc=0x2070014 (DAOS_POOL_MODULE:POOL_TGT_QUERY_MAP) rpcid=0x379d3f30000000a4 rank:tag=0:0] ctx_id 1, (status: 0x38) timed out (60 seconds), target (0:0)
02/26-13:50:04.72 wolf-229 DAOS[289719/289734/0] rpc  INFO src/cart/crt_context.c:1201 crt_req_timeout_hdlr(0x7f58c4012850) [opc=0x2070014 (DAOS_POOL_MODULE:POOL_TGT_QUERY_MAP) rpcid=0x379d3f30000000a4 rank:tag=0:0] aborting in-flight to group daos_server, rank 0, tgt_uri ofi+verbs;ofi_rxm://192.168.100.227:31416
02/26-13:50:04.72 wolf-229 DAOS[289719/289734/0] hg   WARN src/cart/crt_hg.c:1376 crt_hg_req_send_cb(0x7f58c4012850) [opc=0x2070014 (DAOS_POOL_MODULE:POOL_TGT_QUERY_MAP) rpcid=0x379d3f30000000a4 rank:tag=0:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'

@jolivier23
Copy link
Contributor Author

Log link is useless for me. Can you attach full output in slack or the jira ticket?

This is almost certainly a different issue as those errors are happening when dfuse is still running. Looks like it's lost connectivity with the server for whatever reason. My patch simply avoids an assertion during shutdown

@daltonbohning
Copy link
Contributor

Log link is useless for me. Can you attach full output in slack or the jira ticket?

This is almost certainly a different issue as those errors are happening when dfuse is still running. Looks like it's lost connectivity with the server for whatever reason. My patch simply avoids an assertion during shutdown

I don't doubt you. I'll send the logs in slack

@jolivier23 jolivier23 merged commit a027712 into master Feb 28, 2025
57 checks passed
@jolivier23 jolivier23 deleted the jvolivie/dfuse branch February 28, 2025 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed.
Development

Successfully merging this pull request may close these issues.

5 participants