-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FastRTPS drops messages under stress #258
Comments
@dejanpan I've found a race condition. It happens when new samples are notified to rmw_fastrtps by FastRTPS after rmw_fastrtps takes the sample from FastRTPS and before rmw_fastrtps's internal counter ( In this branch I've implemented a proposal, always use the FastRTPS's internal counter. Can you also test it yourself? |
@richiware we tested this internally. The fix does not appear to work (see below). Could you re-produce this for you before the fix and could you test that it does not break for you after the fix? If yes, how did you test it? How we tested : pete.baughman@ade:~$ mkdir -p stress_test_ws/src
pete.baughman@ade:~$ cd stress_test_ws/src/
pete.baughman@ade:~/stress_test_ws/src$ git clone [email protected]:ApexAI/experiments.git
Cloning into 'experiments'...
remote: Enumerating objects: 9, done.
remote: Counting objects: 100% (9/9), done.
remote: Compressing objects: 100% (6/6), done.
remote: Total 9 (delta 0), reused 9 (delta 0), pack-reused 0
Receiving objects: 100% (9/9), done.
pete.baughman@ade:~/stress_test_ws/src$ git clone [email protected]:richiware/rmw_fastrtps.git
Cloning into 'rmw_fastrtps'...
remote: Enumerating objects: 9, done.
remote: Counting objects: 100% (9/9), done.
remote: Compressing objects: 100% (8/8), done.
remote: Total 2256 (delta 1), reused 6 (delta 1), pack-reused 2247
Receiving objects: 100% (2256/2256), 733.42 KiB | 5.35 MiB/s, done.
Resolving deltas: 100% (1574/1574), done.
pete.baughman@ade:~/stress_test_ws/src$ cd rmw_fastrtps/
pete.baughman@ade:~/stress_test_ws/src/rmw_fastrtps (master u=)$ git checkout hotfix/race_condition
Branch 'hotfix/race_condition' set up to track remote branch 'hotfix/race_condition' from 'origin' by rebasing.
Switched to a new branch 'hotfix/race_condition'
pete.baughman@ade:~/stress_test_ws/src/rmw_fastrtps (hotfix/race_condition u=)$ cd ../../
pete.baughman@ade:~/stress_test_ws$ source /opt/ros2/crystal/setup.bash
pete.baughman@ade:~/stress_test_ws$ colcon build
Starting >>> rmw_fastrtps_shared_cpp
Starting >>> stress_test
Finished <<< stress_test [7.61s]
Finished <<< rmw_fastrtps_shared_cpp [7.64s]
Starting >>> rmw_fastrtps_cpp
Starting >>> rmw_fastrtps_dynamic_cpp
Finished <<< rmw_fastrtps_cpp [8.55s]
Finished <<< rmw_fastrtps_dynamic_cpp [8.88s]
Summary: 4 packages finished [16.9s]
pete.baughman@ade:~/stress_test_ws$ source install/setup.bash
pete.baughman@ade:~/stress_test_ws$ ./build/stress_test/stress_test_exe This ran until I did "apt-get update" in another window (I was trying to install 'stress' because I was in a fresh ADE) and then it hung at 99 like it did before |
@deeplearningrobotics and @pbaughman @richiware I also tested this today and found out that the fix seems to work. What I needed to do is compile whole workspace from src. If I only compiled https://github.com/ros2/rmw_fastrtps, the test would still hang at 99. So this is what I did: Term 1: cd ~
mkdir ~/ros2_ws/src
cd ~/ros2_ws
wget https://gist.githubusercontent.com/dejanpan/cc83dff7692e41de5086db5ef69262bd/raw/95b3567691c2e0ce1785b706d8d8d010825bed97/patch1.repos
vcs import src < patch1.repos
cd src/ros2/rmw_fastrtps/
git remote add richiware https://github.com/richiware/rmw_fastrtps
git fetch
git checkout richiware/hotfix/race_condition
cd -
colcon build --symlink-install
source install/setup.bash
./build/stress_test/stress_test_exe If you wanna switch back to a broken version: cd src/ros2/rmw_fastrtps/
git checkout a5675c45103f7c8312f4e991e60a346ca5fe4800
cd -
colcon build --symlink-install
source install/setup.bash
./build/stress_test/stress_test_exe Term2: stress-ng --matrix 0 The test has now been running for >1h and didn't get hung up. I think the problem is that |
When I was testing and implementing the solution I always compiled the whole workspace. I'm a bit lost about relationship between ROS2 packages. But it sounds like my patch was not being applied. |
Ok - I guess this is something else's build dependency, so if you don't re-compile the higher level thing you don't get the new behavior. I tested again by rebuilding the whole workspace and it seems to be running successfully now |
If you are happy with the solution, I would be able to create a PR to rmw_fastrtps. |
@dejanpan Is the solution good to you? Can I create the PR? |
@richiprosima we confirmed that it works, I ran the test for 24h. Please create a PR. But I think that we also need a test for it. Do you have an idea how to write such a test? |
Issue References: * ros2/rmw_fastrtps#257 * ros2/rmw_fastrtps#258 Signed-off-by: Steven! Ragnarök <[email protected]>
* Update the known issues section in Crystal. Issue References: * ros2/rmw_fastrtps#257 * ros2/rmw_fastrtps#258 Signed-off-by: Steven! Ragnarök <[email protected]> * remove duplicate periods
No idea. I don't know how to write a deterministic test. The error is erratic and depends on a sample being received after rmw_fastrtps calls FastRTPS |
* Update the known issues section in Crystal. Issue References: * ros2/rmw_fastrtps#257 * ros2/rmw_fastrtps#258 Signed-off-by: Steven! Ragnarök <[email protected]> * remove duplicate periods
Bug report
Required Info:
Steps to reproduce issue
stress-ng --matrix 0
Expected behavior
Messages in this line should not get lost: https://github.com/ApexAI/experiments/blob/master/stress_test/src/stress_test.cpp#L77
Actual behavior
Messages in this line get lost: https://github.com/ApexAI/experiments/blob/master/stress_test/src/stress_test.cpp#L77
Additional information
The issue might be connected to #157.
@deeplearningrobotics and @pbaughman found the issue and created a reproducer.
The text was updated successfully, but these errors were encountered: