Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark ReadTask is expensive to serialize #553

Closed
aokolnychyi opened this issue Oct 16, 2019 · 4 comments · Fixed by #569
Closed

Spark ReadTask is expensive to serialize #553

aokolnychyi opened this issue Oct 16, 2019 · 4 comments · Fixed by #569

Comments

@aokolnychyi
Copy link
Contributor

In some Spark jobs, we see a substantial scheduler delay. I assume it happens in TaskSetManager when Spark serializes Iceberg ReadTask. The latter contains a couple of large strings (if you have a lot of columns) and an instance of FileIO (which can contain a full Hadoop conf).

@aokolnychyi
Copy link
Contributor Author

I can confirm the issue is resolved if we avoid serializing FileIO. The main question is how to achieve that with minimum changes.

@aokolnychyi
Copy link
Contributor Author

aokolnychyi commented Oct 17, 2019

As a short-term solution, we can broadcast EncryptionManager and FileIO in IcebergSource. Then Reader and ReadTask can store references to the broadcasted values and fetch actual ones in createPartitionReader while creating TaskDataReader. This seems to solve the scheduler delay issue.

@rdblue thoughts?

@rdblue
Copy link
Contributor

rdblue commented Oct 18, 2019

Using a broadcast sounds good to me for now.

Can you open a PR for this?

@aokolnychyi
Copy link
Contributor Author

Will open a PR today

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants