Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: read from TCP socket with read deadline, under load, loses data on darwin/amd64 and Windows/amd64 #70395

Closed
glycerine opened this issue Nov 17, 2024 · 7 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Darwin

Comments

@glycerine
Copy link

glycerine commented Nov 17, 2024

Go version

go 1.23.3 on darwin/amd64 (edit: and Windows/amd64; see comment below)

Output of go env in your module/workspace:

$ go env
GO111MODULE='auto'
GOARCH='amd64'
GOBIN=''
GOCACHE='/Users/jaten/Library/Caches/go-build'
GOENV='/Users/jaten/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/jaten/go/pkg/mod'
GONOPROXY='github.com/glycerine/*'
GONOSUMDB='github.com/glycerine/*'
GOOS='darwin'
GOPATH='/Users/jaten/go'
GOPRIVATE='github.com/glycerine/*'
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go/pkg/tool/darwin_amd64'
GOVCS=''
GOVERSION='go1.23.3'
GODEBUG=''
GOTELEMETRY='local'
GOTELEMETRYDIR='/Users/jaten/Library/Application Support/go/telemetry'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='clang'
CXX='clang++'
CGO_ENABLED='1'
GOMOD=''
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/b6/q5ms8t4x5cjcpcpyjrb1h91c0000gn/T/go-build2671965393=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

On MacOS Sonoma 14.0 amd64 (Intel) architecture, I observe
occassional TCP data loss when reading from a net.Conn
with a read deadline.

I recorded packets and can see their delivery to
the client (at the OS level) in Wireshark, but the
Go client would not receive that data. The expected data
was the response to an RPC call. So therefore my client
would, at times, timeout waiting for the call response.
This caused me to investigate. I've attempted to boil
it down to a minimal reproducer, shown below. This
only was observable under load; I needed to run
at least 12 clients on my 4 core mac to start getting
losses. A larger value of GOMAXPROCS (say 10 * cores) makes it
happen faster.

If need be, I can also provide reproduction in the original RPC library at
a particular historical point, but the commands to reproduce
and the interpretation are more complex. I'm hoping
the below will suffice. Let me know if it does not.

I did not observe the same behavior on Linux. I did not
detect the same behavior under quic-go. So it seems to
be specific to MacOS and TCP sockets.

package main

import (
	"encoding/binary"
	"fmt"
	"net"
	"runtime"
	"strings"
	"time"
)

func service(conn net.Conn) {

	buff := make([]byte, 364)
	var i uint64
	var by [8]byte

	for i = 0; i < 1e8; i++ {

		by = toBytes(i)
		copy(buff[:8], by[:])

		// send sequential numbers to client, starting at 0.
		if _, err := conn.Write(buff); err != nil {
			panic(err)
		}
		runtime.Gosched()
	}
}

func listener(ln net.Listener) {
	for {
		c, err := ln.Accept()
		if err != nil {
			panic(err)
		}
		go service(c)
	}
}

func toBytes(n uint64) (r [8]byte) {
	binary.BigEndian.PutUint64(r[:], n)
	return
}

func fromBytes(by []byte) uint64 {
	return binary.BigEndian.Uint64(by)
}

func client(id int) {
	conn, err := net.Dial("tcp", "127.0.0.1:5555")
	if err != nil {
		panic(err)
	}

	buff := make([]byte, 364)
	var i uint64
	timeout := time.Millisecond

	t0 := time.Now()
	defer func() {
		fmt.Printf("elapsed since starting: %v\n", time.Since(t0))
	}()
	for i = 0; i < 1e8; i++ {

		// The repeated-read loop. Here we
		// read with timeouts; as if we also
		// need to periodically check for other events, e.g. shutdown, pause, etc.
		//
		// When we do so, we observe data loss under load.
		// Originally the library that revealed this issue used
		// 100 msec timeouts, and multiple 50 client processes.
		// Under this reproducer when we put everything in
		// one process for ease of repro, we cranked it down
		// to 1 msec to make it happen faster and more reliably.
		//
		// On darwin, at go1.23.3, the read timeouts occassionally
		// causes us to lose TCP data. Estimated around every 1 in 500K (maybe)
		// reads when under load. We could see the TCP data packets being sent
		// to the client in Wireshark, but the Go client would never get
		// delivery of the response to their RPC call.
		// MacOS Sonoma 14.0 on darwin/amd64.
		//
		// I was expecting to see a skipped number and not zero (to match
		// the original library's experience of not getting an RPC
		// reply), but in this compressed reproducer, observing
		// getting a zero instead of the next expected number
		// still qualifies as a bug. This may be as close as we can
		// easily get with this little code (the original library is much bigger).
		for {
			err := readFull(conn, buff, &timeout)
			if err != nil {
				//fmt.Printf("err = '%v'\n", err)
				r := err.Error()
				if strings.Contains(r, "i/o timeout") || strings.Contains(r, "deadline exceeded") {
					continue // normal, expected, timeout
				}
				panic(err)
			}
			// INVAR: err == nil, so we can exit the repeated-read loop.
			break
		}
		j := fromBytes(buff[:8])
		if i != j {
			panic(fmt.Sprintf("expected next number: %v, but we got %v", i, j))
		}
		if i%100000 == 0 {
			fmt.Printf("at i = %v\n", i)
		}
	}
}

var zeroTime = time.Time{}

// readFull reads exactly len(buf) bytes from conn
func readFull(conn net.Conn, buf []byte, timeout *time.Duration) error {

	if timeout != nil && *timeout > 0 {
		conn.SetReadDeadline(time.Now().Add(*timeout))
	} else {
		// do not let previous deadlines contaminate this one.
		conn.SetReadDeadline(zeroTime)
	}

	need := len(buf)
	total := 0
	for total < len(buf) {
		n, err := conn.Read(buf[total:])
		total += n
		if total == need {
			// probably just EOF
			if err != nil {
				panic(err)
			}
			return nil
		}
		if err != nil {
			return err
		}
	}
	return nil
}

func startClients() {
	for i := 0; i < 50; i++ {
		go client(i)

		time.Sleep(time.Millisecond * 2)
	}
}

func main() {
	ln, err := net.Listen("tcp", "127.0.0.1:5555")
	if err != nil {
		panic(err)
	}
	go listener(ln)

	startClients()
	select {}
}

/*
sample output 1
...
at i = 200000
elapsed since starting: 15.691497196s
panic: expected next number: 203963, but we got 0

goroutine 111 [running]:
main.client(0x0?)
	/Users/jaten/trash/darwin_lossy_tcp_read_timeout.go:98 +0x2fa
created by main.startClients in goroutine 1
	/Users/jaten/trash/darwin_lossy_tcp_read_timeout.go:139 +0x3d
exit status 2

Compilation exited abnormally with code 1 at Sat Nov 16 21:28:33


sample output 2
...
at i = 400000
elapsed since starting: 35.007915405s
panic: expected next number: 420721, but we got 0

goroutine 68 [running]:
main.client(0x0?)
	/Users/jaten/trash/darwin_lossy_tcp_read_timeout.go:98 +0x2fa
created by main.startClients in goroutine 1
	/Users/jaten/trash/darwin_lossy_tcp_read_timeout.go:139 +0x3d
exit status 2

Compilation exited abnormally with code 1 at Sat Nov 16 21:29:57

sample output 3

at i = 100000
elapsed since starting: 12.453088565s
panic: expected next number: 148173, but we got 0

goroutine 125 [running]:
main.client(0x0?)
	/Users/jaten/trash/darwin_lossy_tcp_read_timeout.go:105 +0x2fa
created by main.startClients in goroutine 1
	/Users/jaten/trash/darwin_lossy_tcp_read_timeout.go:146 +0x3d
exit status 2

Compilation exited abnormally with code 1 at Sat Nov 16 21:37:24

sample output 4

...
at i = 800000
elapsed since starting: 1m10.034763283s
panic: expected next number: 829960, but we got 0

goroutine 130 [running]:
main.client(0x0?)
	/Users/jaten/trash/darwin_lossy_tcp_read_timeout.go:112 +0x2fa
created by main.startClients in goroutine 1
	/Users/jaten/trash/darwin_lossy_tcp_read_timeout.go:153 +0x3d
exit status 2

Compilation exited abnormally with code 1 at Sat Nov 16 21:41:49

*/

What did you see happen?

See above. It takes a variable amount of time to manifest. Sometimes 5 seconds, sometimes over a minute. Example output 4 from above:

...
at i = 800000
elapsed since starting: 1m10.034763283s
panic: expected next number: 829960, but we got 0

goroutine 130 [running]:
main.client(0x0?)
	/Users/jaten/trash/darwin_lossy_tcp_read_timeout.go:112 +0x2fa
created by main.startClients in goroutine 1
	/Users/jaten/trash/darwin_lossy_tcp_read_timeout.go:153 +0x3d
exit status 2

Compilation exited abnormally with code 1 at Sat Nov 16 21:41:49

What did you expect to see?

Continuous read of the integers and not 0 (in the first 8 bytes) returned without error.

@glycerine
Copy link
Author

glycerine commented Nov 17, 2024

I'll record notes on how to reproduce with my RPC library, rpc25519, while they are still fresh in mind, in case the above small reproducer does not suffice.

repo: https://github.com/glycerine/rpc25519

checkout at tag v1.1.42-alpha-repro-issue-70395 (should give a branch called repro-70395)

steps:
0. cd to the rpc25519 directory. We will skip using the Makefile so you don't need to install the greenpack serialization library.

  1. mkdir -p certs; go build -o cli ./cmd/cli; go build -o srv ./cmd/srv

  2. In first terminal, run the server:

$ ./srv -quiet -stats 100000 -tcp -max 40 -s 127.0.0.1:9999
  1. In 2nd terminal, run 12 clients:
$ for i in `seq 1 12`; do  ./cli -tcp -quiet -n 600000  -s 127.0.0.1:9999  & sleep 0.001; done
  1. Watch first terminal for the first number after the a=c( part to go negative. When it does, it indicates that the client timed out after 10 seconds waiting for an RPC response. You can press ctrl-c to kill the server, or watch as other clients timeout. The negative numbers are simple the negative of the positive count of successful RPC calls before we saw a hang and then 10 second timeout.

The number outside the square brackets gives the number of successful RPC calls before the hang and client death. The number inside the square brackets is the goroutine number for correlation with stack traces. They are sorted in ascending order by call count, so the negatives (starved out clients) will be first.

sample output

jaten@Js-MacBook-Pro ~/go/src/github.com/glycerine/rpc25519 ((HEAD detached at ae7b060)) $ srv -quiet -stats 100000 -tcp -max 40 -s 127.0.0.1:9999
2024/11/17 04:22:16 server.go:101: rpc25519.server Start() returned serverAddr = '127.0.0.1:9999'

srv.go:343 2024-11-16 22:22:19.438 -0600 CST sendLoopGoroNum = [35] for pairID = '1'
(2.983257461s since start). count of jobs done by each client:
 a=c(1[38])


srv.go:343 2024-11-16 22:22:19.439 -0600 CST sendLoopGoroNum = [37] for pairID = '2'

srv.go:343 2024-11-16 22:22:19.439 -0600 CST sendLoopGoroNum = [39] for pairID = '3'

srv.go:343 2024-11-16 22:22:19.441 -0600 CST sendLoopGoroNum = [22] for pairID = '4'

srv.go:343 2024-11-16 22:22:19.441 -0600 CST sendLoopGoroNum = [28] for pairID = '7'

srv.go:343 2024-11-16 22:22:19.441 -0600 CST sendLoopGoroNum = [24] for pairID = '5'

srv.go:343 2024-11-16 22:22:19.442 -0600 CST sendLoopGoroNum = [30] for pairID = '8'

srv.go:343 2024-11-16 22:22:19.443 -0600 CST sendLoopGoroNum = [32] for pairID = '9'

srv.go:343 2024-11-16 22:22:19.443 -0600 CST sendLoopGoroNum = [26] for pairID = '6'

srv.go:343 2024-11-16 22:22:19.443 -0600 CST sendLoopGoroNum = [100] for pairID = '11'

srv.go:343 2024-11-16 22:22:19.443 -0600 CST sendLoopGoroNum = [98] for pairID = '10'

srv.go:343 2024-11-16 22:22:19.445 -0600 CST sendLoopGoroNum = [102] for pairID = '12'
(5.051670882s since start). count of jobs done by each client:
 a=c(3881[99], 7739[103], 7870[36], 8233[25], 8484[38], 8664[23], 8801[101], 9044[97], 9056[27], 9240[40], 9344[29], 9645[31])

(7.154977927s since start). count of jobs done by each client:
 a=c(3881[99], 12948[36], 16697[25], 17487[38], 17538[103], 17988[23], 18005[31], 18132[101], 18379[97], 19029[40], 19790[29], 20127[27])

(9.196302984s since start). count of jobs done by each client:
 a=c(3881[99], 12948[36], 26494[38], 27300[97], 27437[25], 28015[103], 28275[29], 28394[23], 28414[31], 28659[101], 29820[40], 30364[27])

(11.187412775s since start). count of jobs done by each client:
 a=c(3881[99], 12948[36], 37285[31], 37428[38], 37494[29], 37679[23], 37841[97], 38111[103], 38477[25], 38601[101], 39869[27], 40387[40])

(13.229637568s since start). count of jobs done by each client:
 a=c(3881[99], 12948[36], 46382[31], 46715[23], 47350[97], 47594[29], 48305[25], 48696[38], 48785[103], 48968[101], 49757[27], 50620[40])


srv.go:454 2024-11-16 22:22:30.401 -0600 CST server sees io.EOF from receiveMessage
(15.226440599s since start). count of jobs done by each client:
 a=c(-3881[99], 12948[36], 56835[31], 57258[97], 57686[29], 57710[103], 57816[25], 57869[38], 58334[101], 58476[23], 59752[27], 61436[40])


srv.go:454 2024-11-16 22:22:32.774 -0600 CST server sees io.EOF from receiveMessage
(17.198700926s since start). count of jobs done by each client:
 a=c(-12948[36], -3881[99], 59320[103], 67595[23], 68176[31], 68258[97], 68926[38], 69577[29], 69616[101], 69854[25], 70306[27], 71544[40])

In the last log line above, two clients (serviced by server goroutines 36 and 99) have timeout so far. One client successfully did 12948 roundtrips before hanging. The other client only got in 3881 roundtrips before hanging.

For others encountering this issue: my current workaround is to simply never use a timeout on a read. Close the connection on another goroutine to terminate the read. Sub-optimal, but it works and alleviates the starvation of clients.

@glycerine glycerine changed the title net: read from TCP socket under load losses data on darwin/amd64 net: read from TCP socket under load losses data on darwin/amd64 with read deadline Nov 17, 2024
@seankhliao seankhliao added OS-Darwin NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Nov 17, 2024
@glycerine
Copy link
Author

I can also see data loss on Windows amd64, using the first, short, reproducer. I do not see the 2nd reproducer have a problem under Windows. This also suggests that they may be getting at distinct issues.

~/go/src/github.com/glycerine/rpc25519/attic (master) $ go run darwin_lossy_tcp_read_timeout.go
...
at i = 0
at i = 0
elapsed since starting: 128.5632ms
panic: expected next number: 1623, but we got 0

goroutine 136 [running]:
main.client(0x0?)
        C:/Users/jaten/go/src/github.com/glycerine/rpc25519/attic/darwin_lossy_tcp_read_timeout.go:112 +0x2fa
created by main.startClients in goroutine 1
        C:/Users/jaten/go/src/github.com/glycerine/rpc25519/attic/darwin_lossy_tcp_read_timeout.go:153 +0x3d
exit status 2
jaten@DESKTOP-689SS63 ~/go/src/github.com/glycerine/rpc25519/attic (master) $ go version
go version go1.23.2 windows/amd64

@glycerine glycerine changed the title net: read from TCP socket under load losses data on darwin/amd64 with read deadline net: read from TCP socket under load loses data on darwin/amd64 with read deadline Nov 17, 2024
@glycerine glycerine changed the title net: read from TCP socket under load loses data on darwin/amd64 with read deadline net: read from TCP socket with read deadline, under load, loses data on darwin/amd64 and Windows/amd64 Nov 17, 2024
@glycerine
Copy link
Author

the short reproducer might(?) be getting at the same issue as: #67748

@glycerine
Copy link
Author

glycerine commented Nov 18, 2024

I did some further debugging. It convinced me that the two reproducers above are indeed getting at the same issue. I can observe a faulty read return data that is 8 bytes (commonly) or 12 bytes (less commonly) further down in the TCP receive buffer than it should.

See the three sample output runs at the end of the following code. It is a small variation of the first reproducer. I made two changes: a) I shrunk the message size buff to 16 bytes; b) I wrote random bytes into buff[8:16] and these give us a fingerprint of the origin of the faulty read bytes. The fingerprint (and the matching goroutine number) lets us conclude that the faulty read was from 8 or 12 bytes further into the underlying TCP buffer than it should be returning. Put another
way, there are 8 or 12 bytes being dropped or skipped in a read that happens after a read deadline has expired. (But not after all timeouts always, obviously, as this does seem racy: we get several thousand correct reads before encountering a bad one.)

Run on go version go1.23.3 darwin/amd64:

package main

import (
	"strconv"
	//cryrand "crypto/rand"
	"encoding/binary"
	"fmt"
	"io"
	mathrand2 "math/rand/v2"
	"net"
	"runtime"
	"strings"
	"syscall"
	"time"
)

func init() {
	var seed [32]byte
	chacha8rand = mathrand2.NewChaCha8(seed)
}

var chacha8rand *mathrand2.ChaCha8

// const buffSize = 364 // crasher size
const buffSize = 16

func service(conn net.Conn) {

	cliGoroNumBytes := make([]byte, 8)
	err := readFull(conn, cliGoroNumBytes, nil)
	if err != nil {
		panic(err)
	}
	cliGoroNumber := fromBytes(cliGoroNumBytes)

	buff := make([]byte, buffSize)

	// fill with random bytes to see if that changes
	// our wrong receive from 0 to something else. Yes! It does.
	// We are getting back an incorrectly offset buffer! See
	// the sample output below.
	chacha8rand.Read(buff)

	var i uint64
	var by [8]byte

	for i = 0; i < 1e8; i++ {

		by = toBytes(i)
		copy(buff[:8], by[:])

		if i == 0 {
			//j := fromBytes(buff[8:16])
			fmt.Printf("buff = '%x'  (goroutine %v)\n", buff, cliGoroNumber)
		}

		// verify it
		check := fromBytes(buff[:8])
		if check != i {
			panic("did not encode i into buff!")
		}

		// send sequential numbers to client, starting at 0.
		if _, err := conn.Write(buff); err != nil {
			panic(err)
		}
		runtime.Gosched()
	}
}

func listener(ln net.Listener) {
	for {
		c, err := ln.Accept()
		if err != nil {
			panic(err)
		}
		go service(c)
	}
}

func toBytes(n uint64) (r [8]byte) {
	binary.BigEndian.PutUint64(r[:], n)
	return
}

func fromBytes(by []byte) uint64 {
	return binary.BigEndian.Uint64(by)
}

func client(id int) {
	conn, err := net.Dial("tcp", "127.0.0.1:5555")
	if err != nil {
		panic(err)
	}

	myGoroNumAsBytes := toBytes(uint64(GoroNumber()))
	_, err = conn.Write(myGoroNumAsBytes[:])
	if err != nil {
		panic(err)
	}

	buff := make([]byte, buffSize)
	var i, j uint64
	timeout := time.Millisecond

	t0 := time.Now()
	defer func() {
		fmt.Printf("elapsed since starting: %v\n", time.Since(t0))
	}()
	crashNext := false
	var unexpected uint64
	_ = unexpected
	for i = 0; i < 1e8; i++ {

		// The repeated-read loop. Here we
		// read with timeouts; as if we also
		// need to periodically check for other events, e.g. shutdown, pause, etc.
		//
		// When we do so, we observe data loss.
		for {
			err := readFull(conn, buff, &timeout)
			if err != nil {
				//fmt.Printf("err = '%v'; current i=%v; prev j=%v\n", err, i, j)
				r := err.Error()
				if strings.Contains(r, "i/o timeout") || strings.Contains(r, "deadline exceeded") {
					continue // normal, expected, timeout
				}
				panic(err)
			}
			// INVAR: err == nil, so we can exit the repeated-read loop.
			break
		}
		j = fromBytes(buff[:8])

		if crashNext {
			errConn := connCheck(conn)
			//isCon := isConnected(conn) // always false, bah. ignore.
			writable := canWrite(conn) // always seeing true.
			panic(fmt.Sprintf("crashNext true: prev unexpected was: %v; on the one after we see j = %v, while i (after increment at top of loop) is now = %v; errConn ='%v'; canWrite = %v; buff='%x'; last 8 bytes decoded as uint64: '%v'", unexpected, j, i, errConn, writable, buff, fromBytes(buff[8:16])))
		}

		if i != j {
			//errConn := connCheck(conn)
			//panic(fmt.Sprintf("expected next number: %v, but we got %v;  errConn='%v'\n", i, j, errConn))
			fmt.Printf("expected next number: %v, but we got %v; buff = '%x' (buff[8:16] decoded as uint64: '%v')\n", i, j, buff, fromBytes(buff[8:16]))
			unexpected = j
			crashNext = true
			runtime.Gosched()
		}
		//		if i%100000 == 0 {
		//			fmt.Printf("at i = %v\n", i)
		//		}
	}
}

var zeroTime = time.Time{}

// readFull reads exactly len(buf) bytes from conn
func readFull(conn net.Conn, buf []byte, timeout *time.Duration) error {

	if timeout != nil && *timeout > 0 {
		conn.SetReadDeadline(time.Now().Add(*timeout))
	} else {
		// do not let previous deadlines contaminate this one.
		conn.SetReadDeadline(zeroTime)
	}

	need := len(buf)
	total := 0
	for total < len(buf) {
		n, err := conn.Read(buf[total:])
		total += n
		if total == need {
			// probably just EOF
			if err != nil {
				panic(err)
			}
			return nil
		}
		if err != nil {
			return err
		}
	}
	return nil
}

func startClients() {
	for i := 0; i < 50; i++ {
		go client(i)

		time.Sleep(time.Millisecond * 2)
	}
}

func main() {
	ln, err := net.Listen("tcp", "127.0.0.1:5555")
	if err != nil {
		panic(err)
	}
	go listener(ln)

	startClients()
	select {}
}

func connCheck(conn net.Conn) error {
	var sysErr error
	rc, err := conn.(syscall.Conn).SyscallConn()
	if err != nil {
		return err
	}
	err = rc.Read(func(fd uintptr) bool {
		buf := make([]byte, 1)
		n, _, err := syscall.Recvfrom(int(fd), buf, syscall.MSG_PEEK|syscall.MSG_DONTWAIT)
		switch {
		case n == 0 && err == nil:
			sysErr = io.EOF
		case err == syscall.EAGAIN || err == syscall.EWOULDBLOCK:
			// no-op
		default:
			sysErr = err
		}
		return true
	})
	if err != nil {
		return err
	}

	return sysErr
}

func isConnected(conn net.Conn) bool {
	f, err := conn.(*net.TCPConn).File()
	if err != nil {
		return false
	}

	b := []byte{0}
	_, _, err = syscall.Recvfrom(int(f.Fd()), b, syscall.MSG_PEEK|syscall.MSG_DONTWAIT)
	return err != nil
}

func canWrite(c net.Conn) bool {
	_, err := c.Write([]byte("OK"))
	if err != nil {
		//c.Close() // close if problem
		return false
	}
	return true
}

func panicOn(err error) {
	if err != nil {
		panic(err)
	}
}

// GoroNumber returns the calling goroutine's number.
func GoroNumber() int {
	buf := make([]byte, 48)
	nw := runtime.Stack(buf, false) // false => just us, no other goro.
	buf = buf[:nw]

	// prefix "goroutine " is len 10.
	i := 10
	for buf[i] != ' ' && i < 30 {
		i++
	}
	n, err := strconv.Atoi(string(buf[10:i]))
	panicOn(err)
	return n
}

/*
sample output 1 (on go version go1.23.3 darwin/amd64)

-*- mode: compilation; default-directory: "~/trash/tmp/" -*-
Compilation started at Sun Nov 17 22:34:34

go run darwin_lossy_tcp_read_timeout.go
buff = '00000000000000001a6f419ec627c76b'  (goroutine 20)
buff = '0000000000000000a46add6a48d89474'  (goroutine 22)
buff = '00000000000000004cf4929ef54f635d'  (goroutine 33)
buff = '0000000000000000dcb8a99468ef7de3'  (goroutine 65)
buff = '000000000000000043cea3e828a15c70'  (goroutine 66)
buff = '0000000000000000e3eb28cb670f0e97'  (goroutine 67)
buff = '0000000000000000d8d1001f787c1704'  (goroutine 68)
buff = '000000000000000068b5c68785829264'  (goroutine 81)
buff = '000000000000000055b970d3ecbf14bd'  (goroutine 82)
buff = '0000000000000000b7dafc64deb68410'  (goroutine 83)
buff = '0000000000000000d3e59bdef423e884'  (goroutine 84)
buff = '0000000000000000314a42a39f8abf25'  (goroutine 85)
buff = '0000000000000000314a42a39f8abf25'  (goroutine 87)
buff = '0000000000000000da86b0011deb21bb'  (goroutine 88)
buff = '0000000000000000387bfdb84b56e5d8'  (goroutine 89)
buff = '00000000000000000e0cfe42821e68aa'  (goroutine 86)
buff = '0000000000000000837fefe5279cfc9e'  (goroutine 90)
buff = '00000000000000001e4cb12690cafb7a'  (goroutine 93)
buff = '0000000000000000a94911c891eeb225'  (goroutine 91)
buff = '0000000000000000ecc97fc250c8ac13'  (goroutine 92)
buff = '00000000000000003e01c3d906865c6f'  (goroutine 94)
buff = '0000000000000000aacf9b68a5267c6c'  (goroutine 96)
buff = '00000000000000006eccf1bba9c5284e'  (goroutine 129)
buff = '00000000000000005ff4c8fd651e175a'  (goroutine 130)
buff = '000000000000000021473e89362f469b'  (goroutine 95)
buff = '0000000000000000d50f51f96e722347'  (goroutine 131)
buff = '0000000000000000a1ced239de959f76'  (goroutine 132)
buff = '00000000000000000c72ee7e53331554'  (goroutine 133)
buff = '000000000000000033f157282afa440d'  (goroutine 134)
buff = '0000000000000000e61d1fb237cde886'  (goroutine 135)
buff = '0000000000000000c8c839492da3b333'  (goroutine 136)
buff = '00000000000000005e4890521e53b60d'  (goroutine 137)
buff = '0000000000000000628d82cf7ecd4691'  (goroutine 138)
buff = '0000000000000000e8b3595252b39038'  (goroutine 139)
buff = '0000000000000000177ed7038108641d'  (goroutine 140)
buff = '0000000000000000e685815248a7a297'  (goroutine 144)
buff = '000000000000000059f50c2bf3db897f'  (goroutine 142)
buff = '00000000000000008e6ace6ba25365a2'  (goroutine 143)
buff = '0000000000000000e29f79c7ead1f149'  (goroutine 141)
buff = '000000000000000008f3fe1ad87a97e8'  (goroutine 145)
buff = '0000000000000000fb144320b847203f'  (goroutine 146)
buff = '0000000000000000dd2ec5603046950c'  (goroutine 149)
buff = '000000000000000078a7fccc2a97681f'  (goroutine 147)
buff = '00000000000000002ac9312fdf823181'  (goroutine 151) <<<< NOTICE! goro 151 is panic-ing
buff = '00000000000000002fa4883dfaeedcef'  (goroutine 153)
buff = '000000000000000044374e8b34b4eb1b'  (goroutine 152)
buff = '0000000000000000643cff2836d178e8'  (goroutine 148)
buff = '00000000000000007d294126d3a9fccc'  (goroutine 150)
expected next number: 2438, but we got 3083049501594890625; buff = '2ac9312fdf8231810000000000000987' (buff[8:16] decoded as uint64: '2439')
elapsed since starting: 499.613109ms
panic: crashNext true: prev unexpected was: 3083049501594890625; on the one after we see j = 3083049501594890625, while i (after increment at top of loop) is now = 2439; errConn ='<nil>'; canWrite = true; buff='2ac9312fdf8231810000000000000988'; last 8 bytes decoded as uint64: '2440'

goroutine 151 [running]:
main.client(0x0?)
	/Users/jaten/trash/tmp/darwin_lossy_tcp_read_timeout.go:164 +0x599
created by main.startClients in goroutine 1
	/Users/jaten/trash/tmp/darwin_lossy_tcp_read_timeout.go:214 +0x3d
exit status 2

Compilation exited abnormally with code 1 at Sun Nov 17 22:34:35

My commentary:

When we shrink the message size down from 264 to just 16 bytes,
then we can get a glimpse into what is happening when we get a
wrong read on the client end. The above shows us a one word
(8 byte) shift:

goroutine 151 observed after read into its buff this set of bytes,
shown as two words:
'2ac9312fdf823181 0000000000000987'

but we expected to get these two words:
'0000000000000986 2ac9312fdf823181'

so the faulty read has shifted the read from the underlying
TCP buffer incorrrectly forward by 8 bytes (one word on amd64).

I have also seen more than an 8 byte shift in a
run. See sample output 3 below for a 12 byte shift example.

sample output 2: This show the same issue as above. We happen
to get to see 4 client reads being incorrect before the
panic took down the process. (also go version go1.23.3 darwin/amd64)

-*- mode: compilation; default-directory: "~/trash/tmp/" -*-
Compilation started at Sun Nov 17 22:45:20

go run darwin_lossy_tcp_read_timeout.go
buff = '00000000000000001a6f419ec627c76b'  (goroutine 8)
buff = '0000000000000000a46add6a48d89474'  (goroutine 17)
buff = '00000000000000004cf4929ef54f635d'  (goroutine 10)
buff = '0000000000000000dcb8a99468ef7de3'  (goroutine 49)
buff = '000000000000000043cea3e828a15c70'  (goroutine 65)
buff = '0000000000000000e3eb28cb670f0e97'  (goroutine 66)
buff = '0000000000000000d8d1001f787c1704'  (goroutine 68)
buff = '000000000000000068b5c68785829264'  (goroutine 69)
buff = '000000000000000055b970d3ecbf14bd'  (goroutine 71)
buff = '0000000000000000b7dafc64deb68410'  (goroutine 73)
buff = '0000000000000000d3e59bdef423e884'  (goroutine 72)
buff = '0000000000000000aadc8f829e0ce35c'  (goroutine 75)
buff = '0000000000000000a775a4810dd20a80'  (goroutine 76)
buff = '0000000000000000df31b5282bbc3d9a'  (goroutine 77)
buff = '000000000000000070aa5fa26ca01c38'  (goroutine 78)
buff = '0000000000000000ba340af96279bf01'  (goroutine 79)
buff = '00000000000000000e427b01e9fa49a3'  (goroutine 80)
buff = '0000000000000000a0fe81b34c7f8050'  (goroutine 81)
buff = '000000000000000014bdb20f9675ae25'  (goroutine 82)
buff = '000000000000000000d0e937a152a4fd'  (goroutine 83)
buff = '000000000000000037491e5f0ae4ba1c'  (goroutine 85)
buff = '00000000000000005fc087cb70df5b51'  (goroutine 86)
buff = '0000000000000000429173d13425349a'  (goroutine 87)
buff = '0000000000000000eeb74f5a77e95d3b'  (goroutine 88)
buff = '000000000000000073acc6c3565825a0'  (goroutine 84)
buff = '0000000000000000d72b59b8227d4506'  (goroutine 90)
buff = '0000000000000000dcf76aa0aa1a4dc4'  (goroutine 94)
buff = '0000000000000000fb679aac0dc09769'  (goroutine 95)
buff = '0000000000000000938a75b111a52d46'  (goroutine 96)
buff = '0000000000000000bf5ab58c8e61eb90'  (goroutine 91)
buff = '0000000000000000bb25d77bb39f40c5'  (goroutine 92)
buff = '00000000000000005382898c849e5abe'  (goroutine 113)
buff = '00000000000000006d06240c1e08b578'  (goroutine 93)
buff = '00000000000000003e83149a5f2feb67'  (goroutine 89)
buff = '0000000000000000499af66bed112e78'  (goroutine 114)
buff = '000000000000000009e2007b87270bb1'  (goroutine 115)
buff = '000000000000000060eeddc708e2d75f'  (goroutine 116)
buff = '00000000000000005332d537b95eef80'  (goroutine 127)
buff = '0000000000000000c2cd89d8a4799eef'  (goroutine 121)
buff = '00000000000000000929d71336fdf5b0'  (goroutine 128)
buff = '000000000000000064ec0f9968bbaa21'  (goroutine 126)
buff = '0000000000000000f70e90b8f3003857'  (goroutine 120)
buff = '0000000000000000bdc400b69b983b9e'  (goroutine 117)
buff = '000000000000000015f7619aaa157a1f'  (goroutine 118)
buff = '000000000000000012f238a56d597369'  (goroutine 119)
buff = '00000000000000009bc20ccbc82cea86'  (goroutine 145)
buff = '00000000000000005c17c3998763199d'  (goroutine 148)
buff = '0000000000000000fe568512a242b356'  (goroutine 146)
buff = '00000000000000003976e4ef048cbe3d'  (goroutine 147)
buff = '00000000000000004bf44cc939358b5e'  (goroutine 149)
expected next number: 2700, but we got 6984763919015532383; buff = '60eeddc708e2d75f0000000000000a8d' (buff[8:16] decoded as uint64: '2701')
expected next number: 2430, but we got 13674055152898161566; buff = 'bdc400b69b983b9e000000000000097f' (buff[8:16] decoded as uint64: '2431')
expected next number: 4974, but we got 12311873272484258652; buff = 'aadc8f829e0ce35c000000000000136f' (buff[8:16] decoded as uint64: '4975')
expected next number: 4429, but we got 8335255552625485216; buff = '73acc6c3565825a0000000000000114e' (buff[8:16] decoded as uint64: '4430')
elapsed since starting: 464.649704ms
panic: crashNext true: prev unexpected was: 6984763919015532383; on the one after we see j = 6984763919015532383, while i (after increment at top of loop) is now = 2701; errConn ='<nil>'; canWrite = true; buff='60eeddc708e2d75f0000000000000a8e'; last 8 bytes decoded as uint64: '2702'

goroutine 116 [running]:
main.client(0xc000204028?)
	/Users/jaten/trash/tmp/darwin_lossy_tcp_read_timeout.go:139 +0x599
created by main.startClients in goroutine 1
	/Users/jaten/trash/tmp/darwin_lossy_tcp_read_timeout.go:189 +0x3d
exit status 2

Compilation exited abnormally with code 1 at Sun Nov 17 22:45:21

------
sample output 3: shows an off-by 12 bytes, rather than an off-by 8 bytes:

-*- mode: compilation; default-directory: "~/go/src/github.com/glycerine/rpc25519/attic/" -*-
Compilation started at Sun Nov 17 22:49:49

go run darwin_word_shift.go
buff = '00000000000000001a6f419ec627c76b'  (goroutine 20)
buff = '0000000000000000a46add6a48d89474'  (goroutine 33)
buff = '00000000000000004cf4929ef54f635d'  (goroutine 49)
buff = '0000000000000000dcb8a99468ef7de3'  (goroutine 50)
buff = '000000000000000043cea3e828a15c70'  (goroutine 51)
buff = '0000000000000000e3eb28cb670f0e97'  (goroutine 53)
buff = '0000000000000000d8d1001f787c1704'  (goroutine 52)
buff = '000000000000000068b5c68785829264'  (goroutine 56)
buff = '000000000000000055b970d3ecbf14bd'  (goroutine 57)
buff = '0000000000000000d3e59bdef423e884'  (goroutine 58)
buff = '0000000000000000b7dafc64deb68410'  (goroutine 59)
buff = '0000000000000000aadc8f829e0ce35c'  (goroutine 60)
buff = '0000000000000000a775a4810dd20a80'  (goroutine 54)
buff = '0000000000000000df31b5282bbc3d9a'  (goroutine 55)
buff = '000000000000000070aa5fa26ca01c38'  (goroutine 61) <<<<<< 6ca01c38 occurs here.
buff = '0000000000000000ba340af96279bf01'  (goroutine 63)
buff = '00000000000000000e427b01e9fa49a3'  (goroutine 99)
buff = '0000000000000000a0fe81b34c7f8050'  (goroutine 64)
buff = '000000000000000014bdb20f9675ae25'  (goroutine 98)
buff = '000000000000000000d0e937a152a4fd'  (goroutine 97)
buff = '000000000000000037491e5f0ae4ba1c'  (goroutine 62)
buff = '00000000000000005fc087cb70df5b51'  (goroutine 106)
buff = '0000000000000000429173d13425349a'  (goroutine 100)
buff = '0000000000000000eeb74f5a77e95d3b'  (goroutine 101)
buff = '000000000000000073acc6c3565825a0'  (goroutine 102)
buff = '0000000000000000d72b59b8227d4506'  (goroutine 103)
buff = '0000000000000000bf5ab58c8e61eb90'  (goroutine 104)
buff = '0000000000000000dcf76aa0aa1a4dc4'  (goroutine 105)
buff = '0000000000000000fb679aac0dc09769'  (goroutine 107)
buff = '0000000000000000938a75b111a52d46'  (goroutine 108)
buff = '0000000000000000bb25d77bb39f40c5'  (goroutine 109)
buff = '00000000000000005382898c849e5abe'  (goroutine 110)
buff = '00000000000000003e83149a5f2feb67'  (goroutine 111)
buff = '00000000000000006d06240c1e08b578'  (goroutine 112)
buff = '0000000000000000499af66bed112e78'  (goroutine 129)
buff = '000000000000000009e2007b87270bb1'  (goroutine 130)
buff = '000000000000000060eeddc708e2d75f'  (goroutine 131)
buff = '00000000000000005332d537b95eef80'  (goroutine 132)
buff = '000000000000000012f238a56d597369'  (goroutine 133)
buff = '0000000000000000c2cd89d8a4799eef'  (goroutine 137)
buff = '00000000000000000929d71336fdf5b0'  (goroutine 136)
buff = '000000000000000064ec0f9968bbaa21'  (goroutine 134)
buff = '0000000000000000f70e90b8f3003857'  (goroutine 135)
buff = '0000000000000000bdc400b69b983b9e'  (goroutine 148)
buff = '000000000000000015f7619aaa157a1f'  (goroutine 138)
buff = '00000000000000009bc20ccbc82cea86'  (goroutine 139)
buff = '00000000000000005c17c3998763199d'  (goroutine 149)
buff = '0000000000000000fe568512a242b356'  (goroutine 151)
buff = '00000000000000003976e4ef048cbe3d'  (goroutine 150)
buff = '00000000000000004bf44cc939358b5e'  (goroutine 152)
expected next number: 3345, but we got 7827287179213668352; buff = '6ca01c380000000000000d1270aa5fa2' (buff[8:16] decoded as uint64: '14372850786210')
elapsed since starting: 503.292711ms
panic: crashNext true: prev unexpected was: 7827287179213668352; on the one after we see j = 7827287179213668352, while i (after increment at top of loop) is now = 3346; errConn ='<nil>'; canWrite = true; buff='6ca01c380000000000000d1370aa5fa2'; last 8 bytes decoded as uint64: '14377145753506'

goroutine 61 [running]:
main.client(0xc000094038?)
	/Users/jaten/go/src/github.com/glycerine/rpc25519/attic/darwin_word_shift.go:139 +0x599
created by main.startClients in goroutine 1
	/Users/jaten/go/src/github.com/glycerine/rpc25519/attic/darwin_word_shift.go:189 +0x3d
exit status 2

Compilation exited abnormally with code 1 at Sun Nov 17 22:49:50

*/

@glycerine
Copy link
Author

This was my bug. In my readFull() when a timeout occurs, a partial read could also have occurred. I wasn't accounting for that situation. Thanks to Steven Hartland for pointing this out on the golang-nuts list.

@ianlancetaylor
Copy link
Member

Thanks for following up. A good example of community debugging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Darwin
Projects
None yet
Development

No branches or pull requests

4 participants