Spark ReadTask is expensive to serialize #553

aokolnychyi · 2019-10-16T20:37:49Z

In some Spark jobs, we see a substantial scheduler delay. I assume it happens in TaskSetManager when Spark serializes Iceberg ReadTask. The latter contains a couple of large strings (if you have a lot of columns) and an instance of FileIO (which can contain a full Hadoop conf).

The text was updated successfully, but these errors were encountered:

aokolnychyi · 2019-10-17T12:25:22Z

I can confirm the issue is resolved if we avoid serializing FileIO. The main question is how to achieve that with minimum changes.

aokolnychyi · 2019-10-17T14:25:47Z

As a short-term solution, we can broadcast EncryptionManager and FileIO in IcebergSource. Then Reader and ReadTask can store references to the broadcasted values and fetch actual ones in createPartitionReader while creating TaskDataReader. This seems to solve the scheduler delay issue.

@rdblue thoughts?

rdblue · 2019-10-18T00:19:54Z

Using a broadcast sounds good to me for now.

Can you open a PR for this?

aokolnychyi · 2019-10-18T08:56:17Z

Will open a PR today

This was referenced Oct 19, 2019

Add Spark custom Kryo registrator #549

Closed

Use broadcast variables in IcebergSource #569

Merged

rdblue closed this as completed in #569 Dec 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark ReadTask is expensive to serialize #553

Spark ReadTask is expensive to serialize #553

aokolnychyi commented Oct 16, 2019

aokolnychyi commented Oct 17, 2019

aokolnychyi commented Oct 17, 2019 •

edited

Loading

rdblue commented Oct 18, 2019

aokolnychyi commented Oct 18, 2019

Spark ReadTask is expensive to serialize #553

Spark ReadTask is expensive to serialize #553

Comments

aokolnychyi commented Oct 16, 2019

aokolnychyi commented Oct 17, 2019

aokolnychyi commented Oct 17, 2019 • edited Loading

rdblue commented Oct 18, 2019

aokolnychyi commented Oct 18, 2019

aokolnychyi commented Oct 17, 2019 •

edited

Loading