Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove entries from wantlists when their related requests are cancelled #3182

Merged
merged 2 commits into from
Sep 4, 2016

Conversation

whyrusleeping
Copy link
Member

This PR has been a long time coming. A recent race condition fix has lead to us actually rebroadcasting our wantlist every ten seconds as we claim to do (instead of just pretending to). As a result, the gateways have started catching on fire. The reason for this is that since we never remove elements from the wantlist, we now rebroadcast thousands of wantlist entries every ten seconds. Thats kinda bad.

The solution, is to reference count wantlist entries and remove them when no active requests are referencing them any longer.

A few notable sub-fixes happened to make this work:

  • wantlist Entries are now stored as pointers. The reference count wasnt being changed before
  • GetBlocks now tracks which blocks it has successfully fetch so it can cancel the rest as needed.
  • Remove contexts from wantlist entries, it wasnt useful

@whyrusleeping whyrusleeping added status/in-progress In progress need/review Needs a review and removed status/in-progress In progress labels Sep 2, 2016
@whyrusleeping whyrusleeping added this to the ipfs-0.4.3-rc4 milestone Sep 3, 2016
@@ -75,6 +75,7 @@ func (pm *WantManager) WantBlocks(ctx context.Context, ks []key.Key) {
}

func (pm *WantManager) CancelWants(ks []key.Key) {
log.Infof("cancel wants: %s", ks)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably shouldn't be info.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ehh, I think its best to keep it at Info for now. Its one of the few things thats accurately reporting whats going on in this ipfs node (hey, we're requesting a block! oh, now we're not looking for that block anymore)

@Kubuxu
Copy link
Member

Kubuxu commented Sep 3, 2016

SGTM, 2nd commit above is failing tests so it needs to be squashed or fixed.

@whyrusleeping whyrusleeping force-pushed the fix/bitswap/want-cancel branch from 2210ef6 to 1548c8a Compare September 4, 2016 13:45
@whyrusleeping
Copy link
Member Author

Squashed.

@whyrusleeping whyrusleeping merged commit d6092eb into version/0.4.3-rc4 Sep 4, 2016
@whyrusleeping whyrusleeping deleted the fix/bitswap/want-cancel branch September 4, 2016 14:05
@jbenet
Copy link
Member

jbenet commented Sep 5, 2016

actually rebroadcasting our wantlist every ten seconds as we claim to do

A long time ago we agreed not to send wantlists so often, and to instead endeavor to send correct diffs. (i.e. sending a wantlist should be a rare event, once on connect and then maybe every 10m to be safe (but again, diffs alone should work, because they happen over a reliable channel, so if a diff is lost the reliable channel would break, triggering a reconnection and thus new wantlist. I think re-sending want lists should in general not be necessary, and it's only a safety measure)

a good metric to collect is when a received wantlist differs from our view on it (i.e. when it was useful to have sent a wantlist update)

we never remove elements from the wantlist

What do you mean by that? we should and i thought we DID remove elements from wantlist after they're retrieved or cancelled.

@Kubuxu
Copy link
Member

Kubuxu commented Sep 5, 2016

What do you mean by that? we should and i thought we DID remove elements from wantlist after they're retrieved or cancelled.

We weren't removing elements on context cancel.

@jbenet
Copy link
Member

jbenet commented Sep 5, 2016

@Kubuxu thanks

@whyrusleeping
Copy link
Member Author

@jbenet My bad, this isnt actually sending wantlists over again... It used to be that code, but now its only purpose is to periodically search for new providers for content we want. So every ten seconds, for each key in our wantlist, we call FindProviders and connect to each of them (which causes us to sync up wantlists).

Some options here:

  • Do this less frequently (every 30 seconds would likely help significantly)
  • Don't search for entire wantlist every time, maybe select a subset each time?

@daviddias
Copy link
Member

So, since we already look for the root hash of every request when the blocks wanted are first added, that should give us a pretty good amount of walking the DHT to connect to peers that will have the blocks (unless we are unlucky and connected to the best peer already and that peer only has a really small set of the blocks).

Doing something like 30 seconds, send a query for a random block from the wantlist will reduce the bandwidth dramatically, the problem is that we can't tell if that will impact a ton our ability to search, but it is a good experiment. A node well peered should not have these issues. so it might be good to have a smaller interval for short lived nodes and then increase the interval as the node has more and more connections (i.e from 10 secs to 60 seconds). The outcomes of this will change a lot once the network grows bigger, right now, it is mostly easy to get well peered (except of the symmetric NAT traversal cases)

@jbenet
Copy link
Member

jbenet commented Sep 9, 2016

  • Do this less frequently (every 30 seconds would likely help significantly)

yes, but downside is also longer downloads. the "urgency of a request" is not currently captured but could inform these decisions

  • Don't search for entire wantlist every time, maybe select a subset each time?

yes. +1 to this. another option is to use "trickle" (exponential) backoffs. They work really well in networking in general:

When nodes agree, they slow their communication
rate exponentially, such that in a stable state
nodes send at most a few packets per hour. Instead of
flooding a network with packets, the algorithm controls the
send rate so each node hears a small trickle of packets, just
enough to stay consistent.

Trickle is used differently -- more for broadcasts to regain consistency in routing protocols. But here, "agree" would mean "i searched the dht but found nothing". each successive query like that could have an exponential backoff, with a maximum (of say 30min or 1hr). That way, things that are "not around right now" don't place a strain on the network.

The way to implement this here is:

type trickleCfg struct {
  min time.Duration
  max time.Duration
}

var bsGetProvsTrickle := trickleCfg{time.Second * 5, time.Hour}

// keep one of these per key in wantlist
type trickleTimer struct {
  interval time.Duration  // initialized to bsGetProvsTrickle.Min
  next time.Time // initialized to now + interval
}

func (t *trickleTimer) Reset(cfg *TrickleCfg) {
  t.interval = cfg.Min
  t.next = time.Now().Add(cfg.Min)
}

// algorithm for updating.
toSearch = []cid.Cid
for cid, timer := keyTimers {
  if now > timer.next {
    toSearch = append(toSearch, cid)
    timer.next = timer.next.Add(timer.interval)
    timer.interval = timer.interval * 2 // exponential backoff (trickle)
    if timer.interval > bsGetProvsTrickle.Max {
      timer.interval = bsGetProvsTrickle.Max
    }
  }
}
go getProviders(toSearch)

// IMPORTANT:
// - if new request comes in, search for providers then AND
// call timer.Reset(bsGetProvsTrickle)
// - dont need to bring it back down ever, because if we FIND 
// or cancel the key, then the timer goes away entirely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants