-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test for runTerraformCmd leaked go-routine #299
Conversation
608cf7f
to
27f77eb
Compare
…timeframe Add a timeout to the test to ensure that the terraform apply cancels within a reasonable time of the 5s timeout. Currently, this test is not canceling the terraform apply as expected. In the logs you can see that the test takes 1 min rather than ~5s: ``` --- PASS: TestContext_sleepTimeoutExpired/sleep-0.12.31 (62.13s) ``` ``` === RUN TestContext_sleepTimeoutExpired/sleep-0.12.31 util_test.go:113: [INFO] running Terraform command: /var/folders/6y/gy9gggt14379c_k39vwb50lc0000gn/T/terraform_1378921380/terraform apply -no-color -auto-approve -input=false -lock=true -parallelism=10 -refresh=true util_test.go:103: CLI Output: // truncated ... time_sleep.sleep: Creating... time_sleep.sleep: Still creating... [10s elapsed] time_sleep.sleep: Still creating... [20s elapsed] time_sleep.sleep: Still creating... [30s elapsed] time_sleep.sleep: Still creating... [41s elapsed] time_sleep.sleep: Still creating... [51s elapsed] time_sleep.sleep: Creation complete after 1m0s [id=2022-05-06T17:40:20Z] Apply complete! Resources: 1 added, 0 changed, 0 destroyed. ```
Processes were not being killed because cmd.ProcessState was nil. With this change, processes will be able to make the request to Kill(). Added a temporary log to printout cmd.ProcessState to demonstrate. Will be removed in next commit. Note: this will cause hanging `TestContext_sleepTimeoutExpired` due to a known Golang issue with killing a command when Stdout or Stderr are set to anything besides `nil` or `*os.File`. This is because the Kill does not notify the stdout/stderr subprocesses to stop. `cmd.Wait` (called by `cmd.Run`) waits indefinitely for those subprocesses to stop.
`TestContext_sleepTimeoutExpired` can occasionally hang when killing a command that has Stdout or Stderr set to anything besides `nil` or `*os.File`. golang/go#23019 Use workaround to read from StdoutPipe and StderrPipe rather than setting Stdout / Stderr
Currently, when runTerraformCmd is called, it starts a go-routine to kill the Terraform CLI on context.Done(). However, when the Terraform CLI completes and runTerraformCmd() finishes, the go-routine continues running unnecessarily. If the caller cancels the context down the line, this will stop the go-routine and it will log the error: "error from kill: os: process already finished" because the Terraform CLI has already finished. Added a test for this in cmd_default.go and cmd_linux.go. Have not tried it in linux yet though. When running with the race detector: ``` ================== WARNING: DATA RACE Read at 0x00c0002516c8 by goroutine 7: bytes.(*Buffer).String() /usr/local/go/src/bytes/buffer.go:65 +0x36a github.com/hashicorp/terraform-exec/tfexec.Test_runTerraformCmd_default() /Users/lornasong/go/src/github.com/hashicorp/terraform-exec/tfexec/cmd_default_test.go:35 +0x360 testing.tRunner() // truncated ... Previous write at 0x00c0002516c8 by goroutine 8: bytes.(*Buffer).grow() /usr/local/go/src/bytes/buffer.go:147 +0x3b1 bytes.(*Buffer).Write() /usr/local/go/src/bytes/buffer.go:172 +0xcd log.(*Logger).Output() /usr/local/go/src/log/log.go:184 +0x466 log.(*Logger).Printf() /usr/local/go/src/log/log.go:191 +0x6e github.com/hashicorp/terraform-exec/tfexec.(*Terraform).runTerraformCmd.func1() /Users/lornasong/go/src/github.com/hashicorp/terraform-exec/tfexec/cmd_default.go:24 +0x2a5 // truncated ... ==================
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lornasong
Thanks for the PR.
It seems there are still some race conditions pointing to the newly introduced code, see https://gist.github.com/radeksimko/a199a28daa4eb8225aa6886686bac7fd
Would you mind looking into it?
The implementation and tests generally look reasonable to me though. I'm not sure what's the root cause of that race condition.
The main reason we were catching the context cancellation and killing the process was because it provided extra context beyond just "context deadline exceeded" error. However we don't really have any existing tests and don't communicate that as a feature, so I'm okay with loosing that. We can revisit error handling generally as part of #107
In terms of other PRs - I think we can focus on this one, which seems to be a superset of #301 and importantly has tests, so feel free to close #301
This also fixes a race condition caused by using the pointer to the io.EOF
@radeksimko - thank you for reviewing and catching the race condition. It looks like the race condition is caused by threads sharing access to I was able to reproduce the race condition using
Switching to |
Thanks for the update @lornasong This seemed to have reduced the race conditions to a point that I can rarely reproduce it via
but unfortunately I still see some race conditions with this
See attached log output: I also noticed an error |
string.Builder is a non-comparable type which is not safe for concurrent use when shared by Cmd.Stdout and Cmd.Stderr. Causes a race condition when accessing the the builder when Cmd is running.
- Use waitgroups for more readability - Improve handling errors from writeOutput - Finish reading from pipes before calling cmd.Wait - fixes a race condition that leads to an error :`read |0: file already closed` - Because now waiting for pipes to finish reading, need to update waitGroup to close buf.Read on context cancel. Otherwise buf.Read blocks until next line before stopping. Causes TestContext_sleepTimeoutExpired takes a little too long to cancel (~20s)
Thanks for catching all of these issues @radeksimko. I pushed two commits to try to address the issues in the gist you shared. The first commit tries to resolve the race condition issues. It looks like the race condition occurs when
Please let me know if you have concerns with going this route. The second commit should hopefully resolve the Please me know any additional feedback or issues you find. Thank you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for all the updates and for the added context. Tests are now repeatedly passing even under
go test ./... -race -count=5 -timeout=30m
👍🏻
func writeOutput(ctx context.Context, r io.ReadCloser, w io.Writer) error { | ||
// ReadBytes will block until bytes are read, which can cause a delay in | ||
// returning even if the command's context has been canceled. Use a separate | ||
// goroutine to prompt ReadBytes to return on cancel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the comment ❤️
Thank you for the review @radeksimko. Appreciate your time and finding all of those issues 🙏 |
* Update TestContext_sleepTimeoutExpired to check for canceling within timeframe Add a timeout to the test to ensure that the terraform apply cancels within a reasonable time of the 5s timeout. Currently, this test is not canceling the terraform apply as expected. In the logs you can see that the test takes 1 min rather than ~5s: ``` --- PASS: TestContext_sleepTimeoutExpired/sleep-0.12.31 (62.13s) ``` ``` === RUN TestContext_sleepTimeoutExpired/sleep-0.12.31 util_test.go:113: [INFO] running Terraform command: /var/folders/6y/gy9gggt14379c_k39vwb50lc0000gn/T/terraform_1378921380/terraform apply -no-color -auto-approve -input=false -lock=true -parallelism=10 -refresh=true util_test.go:103: CLI Output: // truncated ... time_sleep.sleep: Creating... time_sleep.sleep: Still creating... [10s elapsed] time_sleep.sleep: Still creating... [20s elapsed] time_sleep.sleep: Still creating... [30s elapsed] time_sleep.sleep: Still creating... [41s elapsed] time_sleep.sleep: Still creating... [51s elapsed] time_sleep.sleep: Creation complete after 1m0s [id=2022-05-06T17:40:20Z] Apply complete! Resources: 1 added, 0 changed, 0 destroyed. ``` * Remove runTerraformCmd check for cmd.ProcessState Processes were not being killed because cmd.ProcessState was nil. With this change, processes will be able to make the request to Kill(). Added a temporary log to printout cmd.ProcessState to demonstrate. Will be removed in next commit. Note: this will cause hanging `TestContext_sleepTimeoutExpired` due to a known Golang issue with killing a command when Stdout or Stderr are set to anything besides `nil` or `*os.File`. This is because the Kill does not notify the stdout/stderr subprocesses to stop. `cmd.Wait` (called by `cmd.Run`) waits indefinitely for those subprocesses to stop. * Read logs from Stderr/out Pipe to avoid hanging `TestContext_sleepTimeoutExpired` can occasionally hang when killing a command that has Stdout or Stderr set to anything besides `nil` or `*os.File`. golang/go#23019 Use workaround to read from StdoutPipe and StderrPipe rather than setting Stdout / Stderr * Test for runTerraformCmd leaked go-routine Currently, when runTerraformCmd is called, it starts a go-routine to kill the Terraform CLI on context.Done(). However, when the Terraform CLI completes and runTerraformCmd() finishes, the go-routine continues running unnecessarily. If the caller cancels the context down the line, this will stop the go-routine and it will log the error: "error from kill: os: process already finished" because the Terraform CLI has already finished. Added a test for this in cmd_default.go and cmd_linux.go. Have not tried it in linux yet though. When running with the race detector: ``` ================== WARNING: DATA RACE Read at 0x00c0002516c8 by goroutine 7: bytes.(*Buffer).String() /usr/local/go/src/bytes/buffer.go:65 +0x36a github.com/hashicorp/terraform-exec/tfexec.Test_runTerraformCmd_default() /Users/lornasong/go/src/github.com/hashicorp/terraform-exec/tfexec/cmd_default_test.go:35 +0x360 testing.tRunner() // truncated ... Previous write at 0x00c0002516c8 by goroutine 8: bytes.(*Buffer).grow() /usr/local/go/src/bytes/buffer.go:147 +0x3b1 bytes.(*Buffer).Write() /usr/local/go/src/bytes/buffer.go:172 +0xcd log.(*Logger).Output() /usr/local/go/src/log/log.go:184 +0x466 log.(*Logger).Printf() /usr/local/go/src/log/log.go:191 +0x6e github.com/hashicorp/terraform-exec/tfexec.(*Terraform).runTerraformCmd.func1() /Users/lornasong/go/src/github.com/hashicorp/terraform-exec/tfexec/cmd_default.go:24 +0x2a5 // truncated ... ================== * Use CommandContext to kill instead of manually doing it * Fix EOF error check to use error.Is() This also fixes a race condition caused by using the pointer to the io.EOF * Update tests to use separate string.Builder-s for stdout and stderr string.Builder is a non-comparable type which is not safe for concurrent use when shared by Cmd.Stdout and Cmd.Stderr. Causes a race condition when accessing the the builder when Cmd is running. * Fixes to runTerraformCmd for race conditions - Use waitgroups for more readability - Improve handling errors from writeOutput - Finish reading from pipes before calling cmd.Wait - fixes a race condition that leads to an error :`read |0: file already closed` - Because now waiting for pipes to finish reading, need to update waitGroup to close buf.Read on context cancel. Otherwise buf.Read blocks until next line before stopping. Causes TestContext_sleepTimeoutExpired takes a little too long to cancel (~20s) Co-authored-by: Kyle Carberry <[email protected]>
This PR intends to showcase the issue fixed by #276:
If the changes in #276 look good, can close this PR in favor of merging #276. The tests added in this PR aren't very meaning and lose context once the fix is merged.
Issue details:
Currently, when runTerraformCmd is called, it starts a go-routine to kill the
Terraform CLI on context.Done(). However, when the Terraform CLI completes and
runTerraformCmd() finishes, the go-routine continues running unnecessarily.
If the caller cancels the context down the line, this will stop the go-routine
and it will log the error: "error from kill: os: process already finished" because
the Terraform CLI has already finished.
Added a test for this in cmd_default.go and cmd_linux.go. Have not tried it in
linux yet though.
Note: it's possible to see the race with pre-existing tests e.g.
go test -race -run TestContext_sleepTimeoutExpired ./tfexec/internal/e2etest -v
When running with the race detector for
Test_runTerraformCmd_default
:Closes #258
Closes #276