Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If loan message is enabled, subscriber will send ack to publisher in take function instead of return loan function, data corruption may happen [13672] #2080

Closed
crazyhank opened this issue Jul 19, 2021 · 6 comments

Comments

@crazyhank
Copy link

crazyhank commented Jul 19, 2021

Consider the following case:
Publisher send message to subscribers in 10HZ, but one of subscribers process the received message in 5HZ, in other words, the speed of subcriber is slower than publisher. If message type is plain, then subsriber will use shared memory without copying data. I checked the source code the ack from subscriber is sent during take function, it is ok with no plain data type, but for plain data type, we want to use this shared memory until application's callback is done, otherwise there may be a data corrupt!

@MiguelCompany MiguelCompany changed the title If loan message is enabled, subscriber will send ack to publisher in take function instead of return loan function, data corruption will happen If loan message is enabled, subscriber will send ack to publisher in take function instead of return loan function, data corruption may happen Jul 19, 2021
@MiguelCompany
Copy link
Member

@crazyhank True, this is how the mechanism has been designed. Even sending the ack when returning the loan may produce data corruption, as the same samples could be loaned more than once, if the read API is used instead of the take one.

This is why is_sample_valid method exists, to allow checking if the data was overwritten after being processed.

Some could argue that having a subscriber twice as slow as the publisher means your system is bad designed, but to help supporting a use-case as the one you depict here, we could implement an application level ACK in the future.

@crazyhank
Copy link
Author

Unfortunately, we are programming based on ROS2 API, it seems that is_sample_valid is not exported in ROS2 APIs.
I think this question is how to choose strategy when suscriber is slower than publisher, in this case, shared memory payload queue will defenitely be filled without any free payload to be loaned. In our product, it is acceptable to drop the latest message until free payload is available.
In my mind, it it not that difficult to implement the above function. For example, we can place a reference count in sample data header before sending. The reference count is the count of subscribers for this message. Each one of subscriber will decrease by 1 after processing the sample data. At the publisher side, the reference count will be checked untill it become zero. Then we can put it to free payload queue.
Of course, semaphore will be used to do sync between subsribers.

Correct me if I am wrong, or give me some advice about how to deal with this kind of problem now, many thanks!

@crazyhank
Copy link
Author

@MiguelCompany , in ROS2 galactic release, it seems that data sharing is disabled by default, so this problem will never happen, but we will not get benefit from data sharing. Do you have any comment on this?

@MiguelCompany
Copy link
Member

@crazyhank Data sharing is ready to use on ROS 2 rolling. We are backporting it to galactic here, and are still preparing some documentation on how to use it.

Regarding your suggestion on having a counter of subscribers, it is not as simple as it may seem. New subscribers could get in just after the sample was written. A subscriber could crash, and then never decrement the counter. The mechanism we have is similar, but it is based on the sequence number of the samples.

That said, I think there could be a way to achieve what you pretend, but it may be a bit cumbersome.

  • First, you will need your writer's history QoS to be KEEP_ALL. This will make the write call on the writer wait for the samples to be acknowledged before they are reused.
  • Second, use two readers instead of one. Call take on the first reader, process the sample, then call take on the second reader. This way, full acknowledge will not take place till the second reader take call is made.

@crazyhank
Copy link
Author

True, a crash may happen for a subsriber. I think we can add a timeout mechanism, if timeout happened, we can remove this subscriber from current subscriber list and get the payload back.

@MiguelCompany
Copy link
Member

@crazyhank The mechanism you mention is already in place with the participant lease duration.

@JLBuenoLopez JLBuenoLopez changed the title If loan message is enabled, subscriber will send ack to publisher in take function instead of return loan function, data corruption may happen If loan message is enabled, subscriber will send ack to publisher in take function instead of return loan function, data corruption may happen [13672] Jan 31, 2022
@eProsima eProsima locked and limited conversation to collaborators Jan 31, 2022
@JLBuenoLopez JLBuenoLopez converted this issue into discussion #2449 Jan 31, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants