Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prevent wantmanager from leaking goroutines (and memory) #1356

Merged
merged 2 commits into from
Jun 12, 2015
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 37 additions & 24 deletions exchange/bitswap/wantmanager.go
Original file line number Diff line number Diff line change
Expand Up @@ -137,36 +137,49 @@ func (mq *msgQueue) runQueue(ctx context.Context) {
for {
select {
case <-mq.work: // there is work to be done

err := mq.network.ConnectTo(ctx, mq.p)
if err != nil {
log.Noticef("cant connect to peer %s: %s", mq.p, err)
// TODO: cant connect, what now?
continue
}

// grab outgoing message
mq.outlk.Lock()
wlm := mq.out
if wlm == nil || wlm.Empty() {
mq.outlk.Unlock()
continue
}
mq.out = nil
mq.outlk.Unlock()

// send wantlist updates
err = mq.network.SendMessage(ctx, mq.p, wlm)
if err != nil {
log.Noticef("bitswap send error: %s", err)
// TODO: what do we do if this fails?
}
mq.doWork(ctx)
case <-mq.done:
return
}
}
}

func (mq *msgQueue) doWork(ctx context.Context) {
// allow ten minutes for connections
// this includes looking them up in the dht
// dialing them, and handshaking
conctx, cancel := context.WithTimeout(ctx, time.Minute*10)
defer cancel()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

understood that it's important to have a timeout here, but 1min is way too short. 15-30min is more ok. these timeouts affect people in the worst of situations, where a pageload could take >5min. It's awful, yes, but that's the rest of the world. We must work there too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright, ten minutes then? tcp times out at 5.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Thu, Jun 11, 2015 at 01:26:40PM -0700, Jeromy Johnson wrote:

+func (mq *msgQueue) doWork(ctx context.Context) {

  • // allow a minute for connections
  • // this includes looking them up in the dht
  • // dialing them, and handshaking
  • conctx, cancel := context.WithTimeout(ctx, time.Minute)
  • defer cancel()

alright, ten minutes then? tcp times out at 5.

Do we need explicit sub-part timeouts at all? Can't the caller
specify their own timeout when they submit a task? Looking into the
context handling here, runQueue gets launched with one context, and it
passes that context down to doWork. But maybe the doWork context
should be a union of the runQueue context and a second context that
gets passed in via the channel? With a check in runQueue to see if
any context-related doWork errors were due to a msgQueue-context
timeout/cancel (in which msgQueue should close down) or the
channel-passed context (in which case pass that error back to the
caller?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need explicit sub-part timeouts at all? Can't the caller
specify their own timeout when they submit a task?

we could if all the callers/roots had timeouts enforced somehow. we dont do this perfectly yet.

But maybe the doWork context
should be a union of the runQueue context and a second context that
gets passed in via the channel?

could use https://github.com/jbenet/go-context/blob/master/ext/dagctx.go for this.


err := mq.network.ConnectTo(conctx, mq.p)
if err != nil {
log.Noticef("cant connect to peer %s: %s", mq.p, err)
// TODO: cant connect, what now?
return
}

// grab outgoing message
mq.outlk.Lock()
wlm := mq.out
if wlm == nil || wlm.Empty() {
mq.outlk.Unlock()
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part is different than the original source-- the

mq.out = nil

now gets executed every time. this means that if:

wlm.Empty()

is true, then mq.out will always get dropped and forgotten (mq.out = nil). is this still correct? Checked Empty() and this should be ok, but not sure if this pointer is expected elsewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hrm, fair. i can put in:

if !wlm.Empty() {
    mq.out = nil
}

instead of blindly setting it nil.

Setting it nil here doesnt change the behaviour, not nil'ing it out was just an allocation optimization.

mq.out = nil
mq.outlk.Unlock()

sendctx, cancel := context.WithTimeout(ctx, time.Minute*5)
defer cancel()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe 5min?

im not sure there should be a timeout on this one-- wont the underlying connection just break and unroll all the way up here?


// send wantlist updates
err = mq.network.SendMessage(sendctx, mq.p, wlm)
if err != nil {
log.Noticef("bitswap send error: %s", err)
// TODO: what do we do if this fails?
return
}
}

func (pm *WantManager) Connected(p peer.ID) {
pm.connect <- p
}
Expand Down