Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binding manager thinks it can connect to multiple things in parallel, but it can't #21606

Closed
bzbarsky-apple opened this issue Aug 3, 2022 · 12 comments · Fixed by #22133
Closed
Assignees
Labels
Interaction Model Work spec Mismatch between spec and implementation V1.0

Comments

@bzbarsky-apple
Copy link
Contributor

Problem

BindingManager::EstablishConnection does:

    mInitParams.mCASESessionManager->FindOrEstablishSession(nodeId, &mOnConnectedCallback, &mOnConnectionFailureCallback);

This will move the mOnConnectedCallback and mOnConnectionFailureCallback callbacks to the callback list of the new session establishment process, if there is no session already.

We call BindingManager::NotifyBoundClusterChanged in a loop over bindings in BindingManager::NotifyBoundClusterChanged. This means we will only get notified for the last connection we started, not any of the other ones.

In many cases this might be getting covered up by connection attempts that complete synchronously, because we have pre-warmed CASE sessions... But it can lead to messages not being sent to some bindings if we have no pre-warmed sessions around.

Proposed Solution

Move the connection callbacks to somewhere that is per-connection-attempt. Perhaps the pending notification? Or some other data structure?

@mrjerryjohns
Copy link
Contributor

Why is this labeled 'spec'?

@woody-apple
Copy link
Contributor

@bzbarsky-apple ?

@bzbarsky-apple
Copy link
Contributor Author

Because:

But it can lead to messages not being sent to some bindings

which would be a spec violation, no?

@gjc13
Copy link
Contributor

gjc13 commented Aug 17, 2022

Had a quick test on linux platform. If we intitiate two concurrent CASE sessions, the second connection will get stuck at waiting for Sigma1 msg.

I'd suggest to add a queue for the connections in the CASESessionManager.

@tehampson
Copy link
Contributor

@gjc13 can you please upload a sample of your test. I might be able to take a look sometime this week, it will be helpful to have a test in hand to reproduce the issue

@tehampson tehampson assigned tehampson and unassigned franck-apple Aug 23, 2022
@tehampson
Copy link
Contributor

Had a quick test on linux platform. If we intitiate two concurrent CASE sessions, the second connection will get stuck at waiting for Sigma1 msg.

I'd suggest to add a queue for the connections in the CASESessionManager.

This does not line up with my understanding of this issue at all. CASESessionManager is capable of handling multiple connections at one. You can totally call FindOrEstablishSession twice and it should resolve without issue and call both callbacks should be called when the connection is established. To me this sounds like potentially a completely different issue that would need to be addressed even before this one.

From my understanding the issue here is that when BindingManager::NotifyBoundClusterChanged is called twice with the same mPendingNotificationMap.AddPendingNotification we overwrite the entry since PendingNotificationMap::AddPendingNotification first removes the entry. The issue is that we don't allow multiple entries in PendingNotificationMap

@franck-apple
Copy link

@tehampson :
@gjc13 has left the Matter project. So I wouldn't expect a response from him.

@mrjerryjohns
Copy link
Contributor

Chatted with @tehampson about this. He's going to run some experiments to see if this issue does indeed exist.

@tehampson
Copy link
Contributor

Just want to quickly touch on waiting for Sigma1 msg. This message is perfectly natural. We always have a CASEClient that is ready to receive the first Sigma1 message sent by a peer node. This is a none issue.


Based on discussion with Jerry, arguably when someone calls PendingNotificationMap::AddPendingNotification with a new context the right things to do is to overwrite the entry so this is not a problem.

Based on a code audit there is no issue with calling BindingManager::EstablishConnection from loop in BindingManager::NotifyBoundClusterChanged for the following reasons:

  • If sessions is not already primed there will be multiple entries in mPendingNotificationMap (since the index will be different)
    • When BindingManager::HandleDeviceConnected is eventual called upon session establishing, we iterate through all the pending entries in mPendingNotificationMap. Outside of the for loop we clear all the entries in mPendingNotificationMap associated with the ScopedNodeId.
    • Even if BindingManager::HandleDeviceConnected is called a second time there is nothing in the mPendingNotificationMap the second time around.
  • If session is already primed, there was already deemed to be no issue (I have confirmed during audit but don't want to spend the effort explaining it here).

To see if there really is an issue in practice I added two hacks:

  • src/app/clusters/bindings/bindings.cpp to add an entry twice in CreateBindingEntry
  • In BindingManager::NotifyBoundClusterChanged created a hack to mark a session as defunct so to test out the sequence outlined by Boris.

This sequence I did (I am documenting this for my own future reference):

  • Build all clusters app with command scripts/examples/gn_build_example.sh examples/all-clusters-app/linux out/debug/standalone chip_config_network_layer_ble=false chip_build_libshell=true
  • Terminal 1 run rm -rf /tmp/chip_* && ./out/debug/standalone/chip-all-clusters-app --secured-device-port 72564 (port was picked by mashing keys)
  • Terminal 2 run ./out/debug/standalone/chip-tool pairing onnetwork 1 20202021
  • Terminal 3 run rm -f kvs1 && out/debug/standalone/chip-all-clusters-app --KVS kvs1
  • Terminal 2 run ./out/debug/standalone/chip-tool pairing onnetwork 2 20202021
  • Terminal 2 run ./out/debug/standalone/chip-tool binding write binding '[{"fabricIndex": 1, "node": 1, "endpoint": 1, "cluster": 6}]' 2 1
  • Terminal 3 run switch on
  • Terminal 1 confirm it received notification

What I discovered is that we only ever call BindingManager::EstablishConnection once from loop in BindingManager::NotifyBoundClusterChanged because SuccessOrExit(error == CHIP_NO_ERROR); should actually be SuccessOrExit(error);


While auditing code for this issue I have spotted some use after free and a buffer overrun. So I am going to add these fixes as well to the issue mentioned above. PR should be up soon

@mrjerryjohns
Copy link
Contributor

@tehampson I'd recommend trying with at least two instances of the 'bulb' that the switch is controlling, and to then have two entries in the bindings cluster pointing to two different nodes.

Also, I'd recommend trying with both the model where the switch boots up first, fails to connect to the two bulbs, THEN, launch the two bulbs (to prevent the initial CASE session establishment from succeeding so that we can observe the establishment process at the time the switch is flipped), as well as the model where the bulbs boot up first, then the switch to observe warmed CASE session usagse.

@tehampson
Copy link
Contributor

Okay so when I try talking to two different nodes that are not primed I am getting the issue. The issue has to do with mOnConnectedCallback and mOnConnectionFailureCallback that are provided to FindOrEstablishSession when you call it multiple times waiting for the callback, you have BindingManager::HandleDeviceConnected only called from the last one registered. This is what Boris was already saying, but I don't think I understood that part until now 🤦‍♂️

@bzbarsky-apple
Copy link
Contributor Author

Right, the key is that a Callback can only be in a single list, so adding mOnConnectedCallback and mOnConnectionFailureCallback to the list for the new session establishment removes them from the old one, and hence they don't get notifications for that one....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Interaction Model Work spec Mismatch between spec and implementation V1.0
Projects
None yet
6 participants