Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(bitswap/client/msgq): prevent duplicate requests #691

Merged
merged 6 commits into from
Nov 25, 2024

Conversation

Wondertan
Copy link
Member

@Wondertan Wondertan commented Oct 17, 2024

Previously, in-progress requests could be re-requested again during periodic rebroadcast. The queue requests, and while awaiting a response, the rebroadcast event happens. Rebroadcast event changes previously sent WANTs to pending and sends them again in a new message, duplicating some WANT requests.

The solution here is to ensure WANT was in sent status for long enough before bringing it back to pending. This utilizes the existing sendAt map, which tracks when every CID was sent. Then, on every event, it compares if the message was around longer than rebroadcastInterval

@Wondertan Wondertan requested a review from a team as a code owner October 17, 2024 18:42
if mq.bcstWants.sent.Len() == 0 && mq.peerWants.sent.Len() == 0 {
return false
mq.rebroadcastIntervalLk.RLock()
rebroadcastInterval := mq.rebroadcastInterval
Copy link
Member Author

@Wondertan Wondertan Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, this could be a different new parameter/constant

@Wondertan
Copy link
Member Author

I tested this on a k8s cluster and with a local node connected to it. It works as expected, but I believe this would benefit a lot from a proper test. Unfortunately, I can't allocate time to writing one. It's not that straightforward.

@Wondertan
Copy link
Member Author

Wondertan commented Oct 17, 2024

For context, I detect duplicates with a custom multihash that logs out when the same data is hashed again. This essentially uncovered #690, and this issue

@Wondertan Wondertan force-pushed the message-queue-duplicates branch 3 times, most recently from d193c2f to 9020b71 Compare October 19, 2024 23:20
Copy link

codecov bot commented Oct 19, 2024

Codecov Report

Attention: Patch coverage is 96.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 60.39%. Comparing base (37756ce) to head (d67691d).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...tswap/client/internal/messagequeue/messagequeue.go 96.00% 1 Missing ⚠️

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #691      +/-   ##
==========================================
+ Coverage   60.36%   60.39%   +0.03%     
==========================================
  Files         244      244              
  Lines       31034    31044      +10     
==========================================
+ Hits        18734    18750      +16     
+ Misses      10630    10626       -4     
+ Partials     1670     1668       -2     
Files with missing lines Coverage Δ
bitswap/client/wantlist/wantlist.go 90.90% <ø> (-0.88%) ⬇️
...tswap/client/internal/messagequeue/messagequeue.go 84.06% <96.00%> (+0.53%) ⬆️

... and 13 files with indirect coverage changes

@lidel lidel added the need/triage Needs initial labeling and prioritization label Oct 22, 2024
@gammazero gammazero added need/analysis Needs further analysis before proceeding need/maintainers-input Needs input from the current maintainer(s) and removed need/triage Needs initial labeling and prioritization labels Oct 22, 2024
Comment on lines 491 to 492
if mq.bcstWants.sent.Len() == 0 && mq.peerWants.sent.Len() == 0 {
return false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably good to leave since it avoids Lock/Unlock of mq.rebroadcastIntervalLk and time.Now().

	if mq.bcstWants.sent.Len() == 0 && mq.peerWants.sent.Len() == 0 {
		return 0
	}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock exists only for testing. The interval is never changed outside of the unit test. Thus, I don't see any contention zero length check could prevent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment is not about contention but about saving unnecessary lock/unlock calls, but if this only happens every 30 seconds, then it's probably not very important.

@gammazero
Copy link
Contributor

triage note: This is a good candidate for testing in rainbow staging to observe performance differences.

@gammazero gammazero added status/blocked Unable to be worked further until needs are met need/author-input Needs input from the original author and removed need/maintainers-input Needs input from the current maintainer(s) labels Oct 29, 2024
@Wondertan Wondertan force-pushed the message-queue-duplicates branch from 9020b71 to 5dc309b Compare October 29, 2024 19:17
Previously, in-progress requests could be re-requested again during periodic rebroadcast.
The queue requests, and while awaiting response, the rebroadcast event happens.
Rebroadcast event changes previosly sent WANTs to pending and sends them again in a new message.

The solution here is to ensure WANT was in sent status for long enough, before bringing it back to pending.
This utilizes existing `sendAt` map which tracks when every CID was sent.
@Wondertan Wondertan force-pushed the message-queue-duplicates branch from 5dc309b to 993c48c Compare October 29, 2024 19:29
@lidel lidel requested a review from hsanjuan November 12, 2024 17:35
Copy link
Contributor

@hsanjuan hsanjuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main thing to consider here is that:

  • before, a "want" would be re-broadcasted at most 30 seconds after it was sent (could be 0.1s)
  • after, a "want" would be re-broadcasted only after at least 30 seconds after it was sent (could be 59.9s).

In that respect the code looks good.

I am not sure how much of an improvement this is in practice (perhaps clients were lucky to hit a short rebroadcast period sometimes), but it makes clients more respectful at least and perf should not be based on "luck".

I think we can test on staging and discuss in the next triage if we accept the change.

Comment on lines 491 to 492
if mq.bcstWants.sent.Len() == 0 && mq.peerWants.sent.Len() == 0 {
return false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment is not about contention but about saving unnecessary lock/unlock calls, but if this only happens every 30 seconds, then it's probably not very important.

@gammazero gammazero added need/maintainers-input Needs input from the current maintainer(s) and removed need/analysis Needs further analysis before proceeding need/author-input Needs input from the original author status/blocked Unable to be worked further until needs are met labels Nov 19, 2024
@gammazero
Copy link
Contributor

Need to test on staging before merge.

@gammazero
Copy link
Contributor

gammazero commented Nov 25, 2024

This PR does make sure that the client does not resend wants to any peer before the rebroadcast interval has elapsed. In doing this is also makes some peers, that were just short of the interval, wait for another rebroadcast interval.

In summary it changes from "wait no more than X to resend wants" to "wait at least X, but no more than 2X, to resend wants".

If we want the PR, then we should consider calling rebroadcastWantlist at half (or less) of the rebroadcast interval. @hsanjuan WDYT?

Consider changing line 410 to

const checksPerInterval = 2
mq.rebroadcastTimer = mq.clock.Timer(mq.rebroadcastInterval / checksPerInterval)

That will change the logic to "wait at least X, but no more than X+(X/checksPerInterval), to resend wants

@gammazero gammazero self-assigned this Nov 25, 2024
@gammazero gammazero merged commit e2d2f36 into ipfs:main Nov 25, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/maintainers-input Needs input from the current maintainer(s)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants