Skip to content
This repository has been archived by the owner on Sep 9, 2022. It is now read-only.

call Stream.Reset instead of Stream.Close #76

Merged
merged 7 commits into from
May 22, 2019
Merged

Conversation

Stebalien
Copy link
Member

@Stebalien Stebalien commented May 8, 2019

May fix ipfs/kubo#6237

Basically,

  1. We hang while closing a stream (because Close waits). Also, Closing a stream should cancel pending writes. go-mplex#9
  2. This blocks the connection manager because it assumes that close doesn't wait.

This may also fix a stream leak.

@ghost ghost assigned Stebalien May 8, 2019
@ghost ghost added the status/in-progress In progress label May 8, 2019
@vyzo vyzo self-requested a review May 8, 2019 22:24
@vyzo
Copy link
Contributor

vyzo commented May 8, 2019

tests fail...

@Stebalien
Copy link
Member Author

Yeah, we expect a nice EOF.

@vyzo
Copy link
Contributor

vyzo commented May 8, 2019

we need the Shutdown call already.

@Stebalien
Copy link
Member Author

  1. We need Close to mean "close both ways" (with a sane default timeout), Reset to mean "throw everything away", and then yeah, CloseWrite and CloseRead.

@vyzo
Copy link
Contributor

vyzo commented May 9, 2019

The test failure seems insurmountable, we'll need to change the stream API for this to work :(

@Stebalien
Copy link
Member Author

Not necessarily. The test failure is due to the assumption that conn.Close() flushes. However, conn.Close() actually resets in our TCP transport as well (for efficiency) so this isn't really an issue.

@Stebalien
Copy link
Member Author

But I agree this sucks.

@@ -28,29 +29,53 @@ func (n *NetAddr) String() string {
return fmt.Sprintf("relay[%s-%s]", n.Remote, n.Relay)
}

func (c *Conn) Close() error {
return c.stream.Reset()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is relevant here. This is how I ended up doing it in gostream :

https://github.com/hsanjuan/go-libp2p-gostream/blob/master/conn.go#L38

// Close closes the connection.
// Any blocked Read or Write operations will be unblocked and return errors.
func (c *conn) Close() error {
	if err := c.s.Close(); err != nil {
		c.s.Reset()
		return err
	}
	go pnet.AwaitEOF(c.s)
	return nil
}

If I remember well, bluntly resetting libp2p streams on Close() caused errors on the other side on situations where the stream closing was supposed to be a clean operation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but we need to be careful:

  1. Concurrent reads aren't safe. We need to interrupt any concurrent reads (somehow) before we can do this. We should be able to do this by setting a read deadline but... we'll have to make sure to fix any/all of our multiplexers/transports (deadlines currently don't interrupt).
  2. We probably want to set some timeouts/deadlines.
  3. We should probably call Close() in the background as well. That will mimic a normal close(tcpFileDescriptor) call (which will let the kernel flush in the background).

IMO, the real issue here is that TCPConn.Close is actually a very fancy best-effort function that does a bunch of dirty work in the background. Ideally, even if we make Close close both directions, we wouldn't be that sloppy.

@Stebalien
Copy link
Member Author

Stebalien commented May 21, 2019

@vyzo, this is getting really critical and both the TCP and the QUIC transports reset the underlying connection on close without flushing anything (I specifically modified the TCP transport to do this to avoid unnecessary work). This bug likely just crashed our gateways and has been causing problems for a ton of people.

The only other solution I can think of is to:

  1. Close the stream. We'll have to fix any bugs we have where closing streams doesn't necessarily interrupt any in-progress writes (an issue for both yamux and mplex).
  2. Set a read deadline on the stream. Make deadlines interrupt and coalesce writes whyrusleeping/yamux#28 fixes read deadlines for yamux but they're still broken in mplex.
  3. Wait for the current reader to exit (read lock).
  4. Read on the stream waiting for an EOF or a timeout.
  5. On timeout, reset.
  6. Do all this in a goroutine. so we don't stall anything.

But that requires fixing multiple annoying bugs. I've started on this path (with the yamux fix) but it feels like a waste given that we don't actually make any guarantees about Close flushing.

@vyzo
Copy link
Contributor

vyzo commented May 21, 2019

Sigh... If it has been identified as the critical bug, then we got to merge it.

Copy link
Contributor

@vyzo vyzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's fix the test and merge it as it is a critical bug.

@Stebalien
Copy link
Member Author

Sigh... If it has been identified as the critical bug, then we got to merge it.

Well, Close is stalling. I'll try shipping a patch to some of our partners to make sure this is the bug.

@vyzo
Copy link
Contributor

vyzo commented May 22, 2019

This needs a rebase for the go mod debacle (go.sum conflicts)

Stebalien added 2 commits May 22, 2019 16:30
May fix ipfs/kubo#6237

Basically,

1. We hang while closing a stream (because `Close` waits).
2. This blocks the connection manager because it assumes that close _doesn't_ wait.

This may also fix a stream leak.
@vyzo vyzo force-pushed the fix/stream-close branch from 46b4e95 to 4a96ae7 Compare May 22, 2019 13:30
@vyzo
Copy link
Contributor

vyzo commented May 22, 2019

rebased and resolved the conflict; should be ready for merge.

@vyzo
Copy link
Contributor

vyzo commented May 22, 2019

we also need the companion in #77, so that the connection manager doesn't kill the underlying hop relay connections when they have stop streams.

Copy link
Member

@raulk raulk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial feedback; still reviewing.

conn.go Outdated Show resolved Hide resolved
conn.go Show resolved Hide resolved
conn.go Outdated Show resolved Hide resolved
raulk
raulk previously requested changes May 22, 2019
Copy link
Member

@raulk raulk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous status should've been "request changes".

@vyzo vyzo dismissed raulk’s stale review May 22, 2019 15:40

fixed bug identfied, the wrapping is not an issue.

@Stebalien Stebalien merged commit 1f1395c into master May 22, 2019
@Stebalien Stebalien deleted the fix/stream-close branch May 22, 2019 17:43
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Too many open files (regression?)
4 participants