🐛 Source S3: fixed bug where sync could hang indefinitely #5197

Phlair · 2021-08-04T17:22:27Z

What

will close #5160

I'm about 90% confident this issue was the cause of the problem. The hang only ever happens on schema inference rather than streaming records and that issue indicates csv::ColumnDecoder as the root for non-cancelling. When we're streaming records we're applying a known schema and no longer inferring so this lines up.

How

Since the problem is that pyarrow csv schema inference isn't responding to signal interrupts, using a standard timeout interrupt (either manually build signal interrupt or using timeout on threads) has no effect and the hang still occurs.

To get around this, I've used multiprocessing to spawn a new process and run the schema inference in there. We can then kill this after a timeout (20 seconds) if it still hasn't returned. I've set this to retry 3 times before raising an error. In my testing it has worked no later than on the 2nd attempt (mostly on first) but if it were to reach retry limit, the sync will actually error out now rather than hang forever.

I've built this around Win limitations and tested this on Windows as well since that plays funny with python multiprocessing due to lack of fork() support. These considerations are commented in the code (and the reason why it's less simple than it could be on unix-only).

Note

Given the non-deterministic nature of the bug and the limited testing that affords, I can't guarantee with 100% certainty this solves it but it seems very likely based on my tests.

Updating a connector

Community member or Airbyter

Grant edit access to maintainers (instructions)
Secrets in the connector's spec are annotated with airbyte_secret
Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
Code reviews completed
Documentation updated
- Connector's README.md
- Changelog updated in docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
PR name follows PR naming conventions
Connector version bumped like described here

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

Create a non-forked branch based on this PR and test the below items on it
Build is successful
Credentials added to Github CI. Instructions.
/test connector=connectors/<name> command is passing.
New Connector version released on Dockerhub by running the /publish command described here

This reverts commit c1739ad.

This reverts commit 52404a9.

This reverts commit f0fb6f6.

Phlair · 2021-08-04T17:36:23Z

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1098628910
✅ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1098628910

sherifnada

I don't want to block this, so if you can find solutions to the following issues, judge them to be insignificant/irrelevant, or if I'm just plain missing something, feel free to merge (the other comments are readability focused):

20s constant timeout -- is this reliable? what if it takes more time to read the data?
how can we be certain that reading blockSize*2 actually gives us the entire first row? (what if it's very big?)

sherifnada · 2021-08-05T06:22:21Z

airbyte-integrations/connectors/source-s3/source_s3/source_files_abstract/fileformatparser.py

+                schema_dict = {field.name: field.type for field in streaming_reader.schema}
+            return schema_dict
+
+        # boto3 stuff can't be pickled and so we can't multiprocess with the actual fileobject on Windows systems


what's a better word than "stuff" ? :P

you know... stuff... like things and whatever... 😅
Made this more clear!

sherifnada · 2021-08-05T06:28:03Z

airbyte-integrations/connectors/source-s3/source_s3/source_files_abstract/fileformatparser.py

+        # boto3 stuff can't be pickled and so we can't multiprocess with the actual fileobject on Windows systems
+        # we're reading block_size*2 bytes here, which we can then pass in and infer schema from block_size bytes
+        # the *2 is to give us a buffer as pyarrow figures out where lines actually end so it gets schema correct
+        file_sample = file.read(self._read_options()["block_size"] * 2)


how can we be certain that this block size completely covers the first row of data?

The easy answer would be check on newline chars but in the case where newlines are allowable in the data itself this isn't bulletproof. So really you need to parse the csv to be certain you've got enough of a block to parse the csv... bit of a logic loop!

I've decided to expose block_size in spec so the user can configure it (with instructions to increase it from default if having issues with schema detection).

sherifnada · 2021-08-05T06:33:03Z

airbyte-integrations/connectors/source-s3/source_s3/source_files_abstract/fileformatparser.py

+        # we're reading block_size*2 bytes here, which we can then pass in and infer schema from block_size bytes
+        # the *2 is to give us a buffer as pyarrow figures out where lines actually end so it gets schema correct
+        file_sample = file.read(self._read_options()["block_size"] * 2)
+        schema_dict = None


for readability could we refactor this bit to be:

file_sample = file.read(self._read_options()["block_size"] * 2) schema_dict = self.run_in_external_process(get_schema, file_sample, self._read_options(), self._parse_options(), self._convert_options()) return self.json_schema_to_pyarrow_schema(schema_dict, reverse=True)

basically taking all the external process logic outside into a helper method?

I tried it in my iDE, here's what I had for reference in case it helps:

def run_in_external_process(self, fn, *args): schema_dict = None fail_count = 0 while schema_dict is None: q_worker = mp.Queue() proc = mp.Process( target=multiprocess_queuer, args=( dill.dumps(fn), # use dill to pickle the get_schema function for Windows-compatibility q_worker, *args, ), ) proc.start() try: # this attempts to get return value from function with a 20 second timeout return q_worker.get(timeout=20) except mp.queues.Empty: fail_count += 1 if fail_count > 2: raise TimeoutError("Timed out 3 times while trying to infer schema") self.logger.info("timed out on schema inference, retrying...") finally: try: proc.terminate() except Exception as e: self.logger.info(f"infer schema proc unterminated, error: {e}")

yeah like it, changed!

sherifnada · 2021-08-05T06:34:09Z

airbyte-integrations/connectors/source-s3/source_s3/source_files_abstract/fileformatparser.py

+            proc.start()
+            try:
+                # this attempts to get return value from function with a 20 second timeout
+                schema_dict = q_worker.get(timeout=20)


it might be useful to do an increasing backoff until 60 seconds or something just in case the schema is genuinely taking a long time to load

yeah I think the balancing act here is waiting long enough that we don't time out any long but non-hanging processes while failing relatively quickly in cases of hang or badly formed csvs that should fail inference.

Changed this to a doubling timeout in range 4 -> 60 seconds

Now propagating any actual non-hang errors back up now so it will failfast unless hanging in which case we retry with increasing backoff as above.

Phlair · 2021-08-05T13:09:25Z

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1101568836
✅ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1101568836

# Conflicts: # airbyte-integrations/connectors/source-s3/integration_tests/spec.json

Phlair · 2021-08-05T18:03:00Z

/publish connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1102495955
✅ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1102495955

sherifnada · 2021-08-05T21:01:38Z

future work: we should maybe explore not using pyarrow if that's the problem

# Conflicts: # airbyte-config/init/src/main/resources/config/STANDARD_SOURCE_DEFINITION/69589781-7828-43c5-9f63-8925b1c1ccc2.json # airbyte-config/init/src/main/resources/seed/source_definitions.yaml

Phlair added 12 commits August 4, 2021 14:59

infer schema in multi process

f0fb6f6

use dill to pickle function

52404a9

moved funcs

c1739ad

Revert "moved funcs"

d6dcc6b

This reverts commit c1739ad.

Revert "use dill to pickle function"

796091b

This reverts commit 52404a9.

Revert "infer schema in multi process"

a375674

This reverts commit f0fb6f6.

multiprocess in csv schema iinfer

41738d7

simplify what happens in the multiprocess to offending code

74d832f

try this

c8df7cc

using tempfile

0d8f87f

formatting

635a421

version bump

674ce73

Phlair self-assigned this Aug 4, 2021

Phlair requested a review from sherifnada August 4, 2021 17:22

github-actions bot added area/connectors Connector related issues area/documentation Improvements or additions to documentation labels Aug 4, 2021

changelog + formatting

78b2466

jrhizor temporarily deployed to more-secrets August 4, 2021 17:38 Inactive

sherifnada reviewed Aug 5, 2021

View reviewed changes

addressed review comments

db07444

jrhizor temporarily deployed to more-secrets August 5, 2021 13:14 Inactive

Phlair added 3 commits August 5, 2021 14:56

Merge branch 'master' into george/s3-hanging-fix

78fb897

# Conflicts: # airbyte-integrations/connectors/source-s3/integration_tests/spec.json

re-trigger checks

7049bff

ran testScaffoldTemplates to fix breaking check

2418573

jrhizor temporarily deployed to more-secrets August 5, 2021 18:35 Inactive

sherifnada approved these changes Aug 5, 2021

View reviewed changes

Merge branch 'master' into george/s3-hanging-fix

305c460

# Conflicts: # airbyte-config/init/src/main/resources/config/STANDARD_SOURCE_DEFINITION/69589781-7828-43c5-9f63-8925b1c1ccc2.json # airbyte-config/init/src/main/resources/seed/source_definitions.yaml

Phlair merged commit 137257b into master Aug 5, 2021

Phlair deleted the george/s3-hanging-fix branch August 5, 2021 23:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Source S3: fixed bug where sync could hang indefinitely #5197

🐛 Source S3: fixed bug where sync could hang indefinitely #5197

Phlair commented Aug 4, 2021 •

edited

Loading

Phlair commented Aug 4, 2021 •

edited by github-actions bot

Loading

sherifnada left a comment

sherifnada Aug 5, 2021

Phlair Aug 5, 2021

sherifnada Aug 5, 2021

Phlair Aug 5, 2021

sherifnada Aug 5, 2021

Phlair Aug 5, 2021

sherifnada Aug 5, 2021

Phlair Aug 5, 2021 •

edited

Loading

Phlair Aug 5, 2021 •

edited

Loading

Phlair commented Aug 5, 2021 •

edited by github-actions bot

Loading

Phlair commented Aug 5, 2021 •

edited by github-actions bot

Loading

sherifnada commented Aug 5, 2021 •

edited

Loading

🐛 Source S3: fixed bug where sync could hang indefinitely #5197

🐛 Source S3: fixed bug where sync could hang indefinitely #5197

Conversation

Phlair commented Aug 4, 2021 • edited Loading

What

How

Note

Community member or Airbyter

Airbyter

Phlair commented Aug 4, 2021 • edited by github-actions bot Loading

sherifnada left a comment

Choose a reason for hiding this comment

sherifnada Aug 5, 2021

Choose a reason for hiding this comment

Phlair Aug 5, 2021

Choose a reason for hiding this comment

sherifnada Aug 5, 2021

Choose a reason for hiding this comment

Phlair Aug 5, 2021

Choose a reason for hiding this comment

sherifnada Aug 5, 2021

Choose a reason for hiding this comment

Phlair Aug 5, 2021

Choose a reason for hiding this comment

sherifnada Aug 5, 2021

Choose a reason for hiding this comment

Phlair Aug 5, 2021 • edited Loading

Choose a reason for hiding this comment

Phlair Aug 5, 2021 • edited Loading

Choose a reason for hiding this comment

Phlair commented Aug 5, 2021 • edited by github-actions bot Loading

Phlair commented Aug 5, 2021 • edited by github-actions bot Loading

sherifnada commented Aug 5, 2021 • edited Loading

Phlair commented Aug 4, 2021 •

edited

Loading

Phlair commented Aug 4, 2021 •

edited by github-actions bot

Loading

Phlair Aug 5, 2021 •

edited

Loading

Phlair Aug 5, 2021 •

edited

Loading

Phlair commented Aug 5, 2021 •

edited by github-actions bot

Loading

Phlair commented Aug 5, 2021 •

edited by github-actions bot

Loading

sherifnada commented Aug 5, 2021 •

edited

Loading