Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

explicitly close socat socket #12632

Merged
merged 2 commits into from
May 6, 2022
Merged

Conversation

git-phu
Copy link
Contributor

@git-phu git-phu commented May 5, 2022

What

closes #11621

Currently the way we handle sending messages from the source pod to the orchestrator pod is:

  • The orchestrator loops forever trying to read messages from the source until it detects that the source pod has terminated
  • The source pod uses socat to transfer a stream of data to the orchestrator.
    • Once the stream's EOF is reached, socat shuts down its write stream.
    • After a 0.5 second timeout, socat closes the connection to the orchestrator as well and the source pod terminates

We can see this in a sample of the socat logs.

2022/05/05 02:03:29 socat[9] I transferred 8192 bytes from 0 to 5
2022/05/05 02:03:29 socat[9] I transferred 5306 bytes from 0 to 5
2022/05/05 02:03:29 socat[9] N socket 1 (fd 0) is at EOF
2022/05/05 02:03:29 socat[9] I shutdown(5, 1)
2022/05/05 02:03:30 socat[9] I poll timed out (no data within 0.500000 seconds)
2022/05/05 02:03:30 socat[9] I shutdown(5, 2)
2022/05/05 02:03:30 socat[9] N exiting with status 0

The problem with this approach is that we have seen cases where it appears that the source pod has already terminated before the orchestrator is able to fully process all the incoming data. This leaves the network socket in a bad state and the orchestrator pod simply hangs. #11621 (comment) for more details on how this might happen.

A more reliable way to handle this interaction should be:

  • The orchestrator loops forever trying to read messages from the source
  • The source pod uses socat to transfer a stream of data to the orchestrator.
    • Once the stream's EOF is reached, socat shuts down its write stream.
  • Once the orchestrator receives the EOF from socat, the orchestrator should initiate shutting down the network socket
    • after socat receives the signal to shut down the read stream from the orchestrator, the network socket is closed and the source pod returns.

This new flow should avoid the problems arising from the source initiating the shutdown. This is also how socat is intended to work.
http://www.dest-unreach.org/socat/doc/socat.html

When one of the streams effectively reaches EOF, the closing phase begins. Socat transfers the EOF condition to the other stream, i.e. tries to shutdown only its write stream, giving it a chance to terminate gracefully. For a defined time socat continues to transfer data in the other direction, but then closes all remaining channels and terminates.

How

  • Change DefaultReplicatonWorker#getReplicationRunnable so that when there are no more messages from the source, the orchestrator initiate closing the network connection.
  • Increase the timeout on socat to 60 seconds (from the default 0.5 seconds) to give the orchestrator time to receive all messages and then initiate shutdown of the network connection.

With this change the socat logs now look like

2022/05/05 16:09:02 socat[8] N reading from and writing to stdio
2022/05/05 16:09:02 socat[8] N opening connection to AF=2 172.25.37.3:9877
2022/05/05 16:09:02 socat[8] N successfully connected from local address AF=2 172.25.32.134:38950
2022/05/05 16:09:02 socat[8] N starting data transfer loop with FDs [0,1] and [5,5]
2022/05/05 16:14:17 socat[8] N socket 1 (fd 0) is at EOF
2022/05/05 16:14:19 socat[8] N socket 2 (fd 5) is at EOF
2022/05/05 16:14:19 socat[8] N exiting with status 0

(note that socket 2 also has an EOF now)

@github-actions github-actions bot added area/platform issues related to the platform area/worker Related to worker labels May 5, 2022
@git-phu git-phu temporarily deployed to more-secrets May 5, 2022 22:00 Inactive
@git-phu git-phu temporarily deployed to more-secrets May 5, 2022 22:00 Inactive
@git-phu git-phu linked an issue May 5, 2022 that may be closed by this pull request
@git-phu git-phu force-pushed the peter/close-socat-socket-explicitly branch from 09dce22 to dad85ff Compare May 6, 2022 15:56
@git-phu git-phu temporarily deployed to more-secrets May 6, 2022 15:58 Inactive
@git-phu git-phu temporarily deployed to more-secrets May 6, 2022 15:58 Inactive
because AirbyteSource is an AutoCloseable
close should be idempotent though so calling it multiple times is fine
@git-phu git-phu temporarily deployed to more-secrets May 6, 2022 16:32 Inactive
@git-phu git-phu temporarily deployed to more-secrets May 6, 2022 16:32 Inactive
@git-phu git-phu marked this pull request as ready for review May 6, 2022 17:09
@git-phu git-phu requested review from a team and xiaohansong and removed request for a team May 6, 2022 17:09
Copy link
Contributor

@supertopher supertopher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great Peter

try {
source.close();
} catch (final Exception e) {
throw new SourceException("Source cannot be stopped!", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this exception. it sounds like Godzilla is coming from the water and cannot be contained

@git-phu
Copy link
Contributor Author

git-phu commented May 6, 2022

did a final verification in a dev env that everything works as expected (discovers, syncs, etc.)

@git-phu git-phu merged commit 0291d62 into master May 6, 2022
@git-phu git-phu deleted the peter/close-socat-socket-explicitly branch May 6, 2022 19:29
suhomud pushed a commit that referenced this pull request May 23, 2022
* explicitly close socat socket

* source can be closed multiple times

because AirbyteSource is an AutoCloseable
close should be idempotent though so calling it multiple times is fine
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform area/worker Related to worker
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fix socket timeout exceptions on failed stream close on kubernetes
3 participants