-
Notifications
You must be signed in to change notification settings - Fork 401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fabtests: Synchronize on Initialization #10108
Conversation
bot:aws:retest |
@shijin-aws can you please tell me why the first run of AWS CI failed? |
We have a multi-client test that involves fi_rdm failed
|
I think the failure is related. Need to look into further. |
I agree that these changes broke the test. I think it might need some manual progression like how I had to add it to fi_rdm_multi_client. It looks like this test is just doing fi_rdm which goes through the main codepath and will have the second client hang while waiting to recv the oob socket from the server (the server only sends it once). Do you have any ideas for how to solve this problem? |
OOB for multiple clients is handled in |
Make the socket to synchronize on an argument so that any socket can be specified for synchronization instead of only sock. Signed-off-by: Zach Dworkin <[email protected]>
@j-xiong how is this? |
fabtests/functional/rdm.c
Outdated
if (oob_sock >= 0 && !opts.dst_addr) { | ||
ret = ft_sock_sync(oob_sock, 1); | ||
if (ret) | ||
return ret; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be moved into the if
block above, otherwise the server would call extra sync for the first client.
Looks good now. Let's see if CI would find any issue. |
@shijin-aws can you share the AWS CI failure? |
The same test failed again, looking at it now |
So the test is run like this server expects to get messages from 2 clients
client 1 and 2 comes and left and does ping-pong with server via (Just run the following command twice )
The sever does OK when the first client comes and left But when the second client comes it starts to get the error
|
I can reproduce the error with TCP provider as well
So it looks the change broke the multi-client mode of fi_rdm |
fabtests/functional: Add manual init sync to fi_rdm_multiclient and fi_rdm Some providers (verbs ud) might require the server to be fully initialized before the client process calls getinfo with the server address. This causes a No Data Available error due to the fi_info call failing during initialization by not being able to find the name on the server. This is seen most often in cases where a socket (usually oob [out of band]) is initialized before getinfo in the ft_init_fabric sequence. Adding a sync only if an oob socket has been initialized to order the initialization correctly will prevent this from happening. The syncronization is for client to start getinfo only after the server is done initializing everything. fi_rdm_multiclient test needs its client startup to have a manual sync since it does not follow the normal codepath of going through ft_init_fabric like the server and other tests do. This sync will only happen on the first client that connects because it is only necessary to give the server enough time to spin up all the resources. It is not necessary for future clients because the server has already started all of those resources. fi_rdm needs the syncronization added after accepting a new client because the client will be waiting for a sock_send and the server only does it once on its initialization. Adding a ft_sock_sync will force the server to do it for each new client. Signed-off-by: Zach Dworkin <[email protected]>
fabtests: Synchronize on Initialization
fabtests/functional: Add manual init sync to fi_rdm_multiclient
Some providers (verbs ud) might require the server to be fully initialized before the client process calls getinfo with the server address. This causes a No Data Available error due to the fi_info call failing during initialization by not being able to find the name on the server. This is seen most often in cases where a socket (usually oob [out of band]) is initialized before getinfo in the ft_init_fabric sequence. Adding a sync only if an oob socket has been initialized to order the initialization correctly will prevent this from happening.
The syncronization is for client to start getinfo only after the server is done initializing everything.
fi_rdm_multiclient test needs its client startup to have a manual sync since it does not follow the normal codepath of going through ft_init_fabric like the server and other tests do. This sync will only happen on the first client that connects because it is only necessary to give the server enough time to spin up all the resources. It is not necessary for future clients because the server has already started all of those resources.