[Improve][Connector-file-base] Improved file allocation algorithm for subtasks. #8453

JeremyXin · 2025-01-04T14:47:02Z

Purpose of this pull request

This pull request is to solve issue #8451

In order to try to solve the above problems, I try to use a polling algorithm to allocate files for subtasks (instead of the current random allocation based on file hash), to ensure the load balance of the allocation, so as to improve performance. When using seatunnel to synchronize hdfs files, I set the number of concurrent files to 10, and there are five files in the path. The following screenshots show the file allocation results of using the original random file allocation algorithm in the source code and the improved polling file allocation algorithm:

The original file allocation algorithm based on file hashing, when the degree of parallelism is greater than the number of files, a SubTask needs to process multiple files.

Optimized file allocation algorithm based on polling, when the degree of parallelism is greater than the number of files, a SubTask only needs to process one file.

Next, the processing performance of the two allocation algorithm are compared. The following task runtime information shows the processing performance of the origin file allocation algorithm and the polling file allocation algorithm:

As you can see, using the original file allocation algorithm, the task processing performance per second is 4520, and the total task time is 929 seconds

It can be seen that using the polling file allocation algorithm, the task's processing performance per second is 10719, and the total task time is 518 seconds

To sum up, it can be seen that the optimized poll-based file allocation algorithm can make the file allocation of subtasks more balanced and effectively improve the task processing performance, which is a direction worthy of consideration for optimization

Does this PR introduce any user-facing change?

How was this patch tested?

The preceding case is based on the fact that I use seatunnel to synchronize external hdfs files to local hdfs files. In this scenario, I set the task concurrency to 10, source to HdfsFile, sink to HdfsFile, and five files in the upstream Hdfs path. The performance of two different file allocation algorithms is compared by actual synchronization task.

My unit test in FileSourceSplitEnumeratorTest class.

If you have any questions, please contact me in time. Thanks.

Check list

If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config
Update the release-note.

…r subtasks

liunaijie · 2025-01-06T01:10:34Z

...a/org/apache/seatunnel/connectors/seatunnel/file/source/split/FileSourceSplitEnumerator.java

@@ -107,8 +110,7 @@ private void assignSplit(int taskId) {
        context.assignSplit(taskId, currentTaskSplits);
        // save the state of assigned splits
        assignedSplit.addAll(currentTaskSplits);
-        // remove the assigned splits from pending splits
-        currentTaskSplits.forEach(split -> pendingSplit.remove(split));


Why delete this?

Because the principle of the poll-based file allocation strategy I consider is based on the location of the split in the collection of pendingSplit files, using assignCount and parallelism modulus to determine which task should be assigned to the split. The premise of this method is that the pendingSplit set remains unchanged during allocation. If you remove allocated splits from the pendingSplit collection, it can cause a change in the location of other splits, potentially changing their owners and causing problems.

I know that the original purpose of adding this line of code is to prevent the double allocation of split, but in my opinion, the submitted code calculates the module based on assignCount and parallelism, and only when the result is consistent with the taskId value, the split is assigned to the corresponding task, which will not cause the double allocation of files. It was checked in unit tests.

So that's why I deleted this line of code. If I need to keep it, I can try to redesign the code flow. If you have any questions, please contact me in time. Thanks.

I get your point, thanks for your explain. This update LGTM

I think we should change the fields name. pendingSplit -> allSplit, So new pendingSplit = allSplit - assignedSplit. It's more clear.

Ok, I have changed the field name and commited it. Thanks for your review

liunaijie · 2025-01-07T01:09:37Z

...java/org/apache/seatunnel/connectors/seatunnel/file/split/FileSourceSplitEnumeratorTest.java

@@ -0,0 +1,94 @@
+package org.apache.seatunnel.connectors.seatunnel.file.split;


The new file need add license header.

liunaijie · 2025-01-07T01:10:29Z

...a/org/apache/seatunnel/connectors/seatunnel/file/source/split/FileSourceSplitEnumerator.java

-    private static int getSplitOwner(String tp, int numReaders) {
-        return (tp.hashCode() & Integer.MAX_VALUE) % numReaders;
+    private static int getSplitOwner(int assignCount, int numReaders) {
+        return assignCount % numReaders;
    }

    @Override


We also should update currentUnassignedSplitSize() method. the result should be pendingSplit - assignedSplit

Ok, I have updated the code as required, thanks for your review

…nedSplitSize method

…st package

Hisoka-X

LGTM. cc @hailin0

hailin0 · 2025-01-07T13:01:09Z

...a/org/apache/seatunnel/connectors/seatunnel/file/source/split/FileSourceSplitEnumerator.java

                int splitOwner =
-                        getSplitOwner(fileSourceSplit.splitId(), context.currentParallelism());
+                        getSplitOwner(assignCount.getAndIncrement(), context.currentParallelism());


For ParallelSource this change may cause duplicate reading of files

https://github.com/apache/seatunnel/blob/dev/seatunnel-translation/seatunnel-translation-base/src/main/java/org/apache/seatunnel/translation/source/ParallelSource.java#L104-L106

I would like to consult you about the circumstances in which repetitive file reading is caused by ParallelSource. Could you please explain in detail? I will try to solve it by reproducing. So far I've written test cases and found nothing like this.

In addition, I would like to ask you if you have any modification opinions, thanks for your review.

For ParallelSource, the object relationship created is as follows:

Parallelism=5

FileSourceSplitEnumerator-1 -> Reader-1
FileSourceSplitEnumerator-2 -> Reader-2
FileSourceSplitEnumerator-3 -> Reader-3
FileSourceSplitEnumerator-4 -> Reader-4
FileSourceSplitEnumerator-5 -> Reader-5

@hailin0 I have found the reason for the duplication of file allocation under ParallelSource: FileSourceSplitEnumerator allSplit is a HashSet type attributes, lead to when different ParallelSource objects initialize allSplit with the open method, the order of the files in the allSplit collection is inconsistent. As a result, files allocated in different ParallelSource readers will be duplicated under the above algorithm. I have reproduced this situation in unit test, you are right.

To solve this problem, I tried to change the type of allSplit to TreeSet and define the comparator to be based on splitid values. This ensures the same order of allSplit collection contents under different Parallelsources, thus solving the problem of duplicate file allocation.

I have created a ParallelSource unit test under the Seatunnel-Translatation-Base project. You can see if there are still problems with the testing process.

What if the number of file count read multiple times is inconsistent?

LGTM

cc @Hisoka-X

…n under ParallelSource

Hisoka-X

LGTM. Thanks @JeremyXin , waiting #8485 to fix ci.

Hisoka-X · 2025-01-09T07:24:56Z

Please rebase from dev to retrigger ci.

JeremyXin · 2025-01-09T12:18:20Z

Please rebase from dev to retrigger ci.
I have used merge operation from dev. Is that correct?

Hisoka-X · 2025-01-09T12:20:09Z

Thanks @JeremyXin for update, please fix ci.

https://github.com/JeremyXin/seatunnel/actions/runs/12689444906/job/35368179294

JeremyXin · 2025-01-09T14:24:50Z

@Hisoka-X Ok, I've added the license header to fix ci

Hisoka-X

Thanks @JeremyXin

Hisoka-X · 2025-01-10T09:32:11Z

I find many connectors has same problem. Such as Jdbc, could you help to improve them too? @JeremyXin
Of course, it will not affect this PR be merged.

JeremyXin · 2025-01-11T08:41:23Z

@Hisoka-X Ok, I am very glad to do this. I will study the principle of other connectors, such as jdbc. If there is any progress, I will put forward pr or communicate with you in the future. Thanks!

leibaoxin added 2 commits January 4, 2025 22:03

[Improve] [connector-file-base] Improved file allocation algorithm fo…

0340fd3

…r subtasks

[Improve] [connector-file-base] run mvnw spotless:apply

23818c5

github-actions bot added connectors-v2 file labels Jan 4, 2025

liunaijie reviewed Jan 6, 2025

View reviewed changes

liunaijie reviewed Jan 7, 2025

View reviewed changes

JeremyXin force-pushed the improve-connector-file-base-0104 branch from 19e589b to 23818c5 Compare January 7, 2025 03:03

leibaoxin added 2 commits January 7, 2025 11:19

[Improve] [connector-file-base] add license and update currentUnassig…

cee98ea

…nedSplitSize method

[Improve] [connector-file-base] change fields name and change unit te…

8bd6047

…st package

Hisoka-X previously approved these changes Jan 7, 2025

View reviewed changes

github-actions bot added approved reviewed labels Jan 7, 2025

Hisoka-X linked an issue Jan 7, 2025 that may be closed by this pull request

[Feature][connector-file-base] The number of files allocated to subtasks is unbalanced. #8451

Closed

3 tasks

hailin0 added the don't merge There needs to be a specific reason in the PR, and it cannot be merged for the time being. label Jan 7, 2025

hailin0 reviewed Jan 7, 2025

View reviewed changes

[Improve] [connector-file-base] Solves the problem of file duplicatio…

c485e61

…n under ParallelSource

JeremyXin dismissed Hisoka-X’s stale review via c485e61 January 8, 2025 09:12

github-actions bot removed approved reviewed labels Jan 8, 2025

Hisoka-X reviewed Jan 9, 2025

View reviewed changes

Merge branch 'apache:dev' into improve-connector-file-base-0104

9c0437a

JeremyXin closed this Jan 9, 2025

JeremyXin reopened this Jan 9, 2025

JeremyXin force-pushed the improve-connector-file-base-0104 branch from 6349e3c to 9c0437a Compare January 9, 2025 12:29

[Improve] [connector-file-base] add license header to fix ci

77cc54f

Hisoka-X approved these changes Jan 10, 2025

View reviewed changes

github-actions bot added approved reviewed labels Jan 10, 2025

hailin0 approved these changes Jan 10, 2025

View reviewed changes

hailin0 merged commit d61cba2 into apache:dev Jan 10, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improve][Connector-file-base] Improved file allocation algorithm for subtasks. #8453

[Improve][Connector-file-base] Improved file allocation algorithm for subtasks. #8453

JeremyXin commented Jan 4, 2025

liunaijie Jan 6, 2025

JeremyXin Jan 6, 2025

liunaijie Jan 7, 2025

Hisoka-X Jan 7, 2025

JeremyXin Jan 7, 2025

liunaijie Jan 7, 2025

liunaijie Jan 7, 2025

JeremyXin Jan 7, 2025

Hisoka-X left a comment

hailin0 Jan 7, 2025

JeremyXin Jan 7, 2025

hailin0 Jan 7, 2025

JeremyXin Jan 8, 2025

hailin0 Jan 8, 2025 •

edited

Loading

hailin0 Jan 8, 2025

Hisoka-X left a comment

Hisoka-X commented Jan 9, 2025

JeremyXin commented Jan 9, 2025

Hisoka-X commented Jan 9, 2025

JeremyXin commented Jan 9, 2025

Hisoka-X left a comment

Hisoka-X commented Jan 10, 2025 •

edited

Loading

JeremyXin commented Jan 11, 2025

		@@ -0,0 +1,94 @@
		package org.apache.seatunnel.connectors.seatunnel.file.split;

[Improve][Connector-file-base] Improved file allocation algorithm for subtasks. #8453

[Improve][Connector-file-base] Improved file allocation algorithm for subtasks. #8453

Conversation

JeremyXin commented Jan 4, 2025

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Hisoka-X left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hailin0 Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Hisoka-X left a comment

Choose a reason for hiding this comment

Hisoka-X commented Jan 9, 2025

JeremyXin commented Jan 9, 2025

Hisoka-X commented Jan 9, 2025

JeremyXin commented Jan 9, 2025

Hisoka-X left a comment

Choose a reason for hiding this comment

Hisoka-X commented Jan 10, 2025 • edited Loading

JeremyXin commented Jan 11, 2025

hailin0 Jan 8, 2025 •

edited

Loading

Hisoka-X commented Jan 10, 2025 •

edited

Loading