Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 Source S3: support of Parquet format #5305

Merged
merged 23 commits into from
Sep 4, 2021

Conversation

antixar
Copy link
Contributor

@antixar antixar commented Aug 10, 2021

How

Using same lib 'pyarrow' as for csv parsing

Recommended reading order

  1. formats/parquet_spec.py
  2. formats/parquet_parserpy

Pre-merge Checklist

Updating a connector

Community member or Airbyter

  • Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Changelog updated in docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
  • PR name follows PR naming conventions
  • Connector version bumped like described here

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • Credentials added to Github CI. Instructions.
  • /test connector=connectors/<name> command is passing.
  • New Connector version released on Dockerhub by running the /publish command described here

@antixar antixar linked an issue Aug 10, 2021 that may be closed by this pull request
@antixar antixar self-assigned this Aug 10, 2021
@github-actions github-actions bot added the area/connectors Connector related issues label Aug 10, 2021
@antixar
Copy link
Contributor Author

antixar commented Aug 12, 2021

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1125652227
❌ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1125652227

@jrhizor jrhizor temporarily deployed to more-secrets August 12, 2021 22:26 Inactive
@github-actions github-actions bot added the area/documentation Improvements or additions to documentation label Aug 13, 2021
@antixar
Copy link
Contributor Author

antixar commented Aug 13, 2021

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1126887837
❌ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1126887837

@jrhizor jrhizor temporarily deployed to more-secrets August 13, 2021 07:54 Inactive
@antixar
Copy link
Contributor Author

antixar commented Aug 13, 2021

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1127075915
❌ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1127075915

@jrhizor jrhizor temporarily deployed to more-secrets August 13, 2021 09:05 Inactive
@antixar
Copy link
Contributor Author

antixar commented Aug 13, 2021

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1127128527
✅ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1127128527

@jrhizor jrhizor temporarily deployed to more-secrets August 13, 2021 09:21 Inactive
@antixar antixar requested review from bazarnov and midavadim August 13, 2021 09:47
Copy link
Collaborator

@bazarnov bazarnov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix branch conflicts + fix the airbyte-integrations/connectors/source-hubspot/source_hubspot/api.py file conflict on this branch.

@antixar
Copy link
Contributor Author

antixar commented Aug 23, 2021

/test connector=connectors/source-s3

1 similar comment
@antixar
Copy link
Contributor Author

antixar commented Aug 23, 2021

/test connector=connectors/source-s3

@antixar
Copy link
Contributor Author

antixar commented Aug 23, 2021

/test connector=connectors/source-s3

@antixar antixar requested a review from bazarnov August 23, 2021 12:46
antixar and others added 2 commits August 30, 2021 15:27
…es_abstract/formats/parquet_spec.py

Co-authored-by: George Claireaux <[email protected]>
…es_abstract/formats/parquet_spec.py

Co-authored-by: George Claireaux <[email protected]>
@airbytehq airbytehq deleted a comment from Phlair Aug 30, 2021
@antixar antixar requested a review from Phlair August 30, 2021 22:23
Copy link
Contributor

@Phlair Phlair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, lgtm! One small note on buffer_size to make that more clear.

Due to the way we're iterating through individual files at the abstract-level, I anticipate issues with partitioned parquet datasets. I think we should make clear in the documentation that partitioned parquet datasets are unsupported for now.
For more context, it should work however the performance could be very bad + the columns used for partition would be missing from output (I think).

@antixar
Copy link
Contributor Author

antixar commented Aug 31, 2021

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1186397350
❌ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1186397350

@jrhizor jrhizor temporarily deployed to more-secrets August 31, 2021 13:54 Inactive
@antixar
Copy link
Contributor Author

antixar commented Aug 31, 2021

/publish connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1186480409
❌ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1186480409

@jrhizor jrhizor temporarily deployed to more-secrets August 31, 2021 14:17 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets August 31, 2021 14:17 Inactive
@sherifnada sherifnada removed their request for review September 1, 2021 00:39
@antixar
Copy link
Contributor Author

antixar commented Sep 3, 2021

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1197766878
❌ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1197766878

@jrhizor jrhizor temporarily deployed to more-secrets September 3, 2021 11:12 Inactive
@antixar
Copy link
Contributor Author

antixar commented Sep 4, 2021

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1201855623
✅ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1201855623

@jrhizor jrhizor temporarily deployed to more-secrets September 4, 2021 22:59 Inactive
@antixar
Copy link
Contributor Author

antixar commented Sep 4, 2021

/publish connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1201885536
✅ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1201885536

@jrhizor jrhizor temporarily deployed to more-secrets September 4, 2021 23:18 Inactive
@antixar antixar merged commit e5c44e6 into master Sep 4, 2021
@antixar antixar deleted the antixar/5102-source-s3-support-parquet branch September 4, 2021 23:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Parquet format in S3 source
5 participants