diff --git a/docs/en/connector-v2/source/CosFile.md b/docs/en/connector-v2/source/CosFile.md index 702439c3062..15b6de0c6f8 100644 --- a/docs/en/connector-v2/source/CosFile.md +++ b/docs/en/connector-v2/source/CosFile.md @@ -45,7 +45,7 @@ To use this connector you need put hadoop-cos-{hadoop.version}-{version}.jar and ## Options -| name | type | required | default value | +| name | type | required | default value | |---------------------------|---------|----------|---------------------| | path | string | yes | - | | file_format_type | string | yes | - | @@ -64,7 +64,7 @@ To use this connector you need put hadoop-cos-{hadoop.version}-{version}.jar and | sheet_name | string | no | - | | xml_row_tag | string | no | - | | xml_use_attr_format | boolean | no | - | -| file_filter_pattern | string | no | - | +| file_filter_pattern | string | no | | | compress_codec | string | no | none | | archive_compress_codec | string | no | none | | encoding | string | no | UTF-8 | @@ -275,6 +275,55 @@ Specifies Whether to process data using the tag attribute format. Filter pattern, which used for filtering files. +The pattern follows standard regular expressions. For details, please refer to https://en.wikipedia.org/wiki/Regular_expression. +There are some examples. + +File Structure Example: +``` +/data/seatunnel/20241001/report.txt +/data/seatunnel/20241007/abch202410.csv +/data/seatunnel/20241002/abcg202410.csv +/data/seatunnel/20241005/old_data.csv +/data/seatunnel/20241012/logo.png +``` +Matching Rules Example: + +**Example 1**: *Match all .txt files*,Regular Expression: +``` +/data/seatunnel/20241001/.*\.txt +``` +The result of this example matching is: +``` +/data/seatunnel/20241001/report.txt +``` +**Example 2**: *Match all file starting with abc*,Regular Expression: +``` +/data/seatunnel/20241002/abc.* +``` +The result of this example matching is: +``` +/data/seatunnel/20241007/abch202410.csv +/data/seatunnel/20241002/abcg202410.csv +``` +**Example 3**: *Match all file starting with abc,And the fourth character is either h or g*, the Regular Expression: +``` +/data/seatunnel/20241007/abc[h,g].* +``` +The result of this example matching is: +``` +/data/seatunnel/20241007/abch202410.csv +``` +**Example 4**: *Match third level folders starting with 202410 and files ending with .csv*, the Regular Expression: +``` +/data/seatunnel/202410\d*/.*\.csv +``` +The result of this example matching is: +``` +/data/seatunnel/20241007/abch202410.csv +/data/seatunnel/20241002/abcg202410.csv +/data/seatunnel/20241005/old_data.csv +``` + ### compress_codec [string] The compress codec of files and the details that supported as the following shown: @@ -372,6 +421,33 @@ sink { ``` +### Filter File + +```hocon +env { + parallelism = 1 + job.mode = "BATCH" +} + +source { + CosFile { + bucket = "cosn://seatunnel-test-1259587829" + secret_id = "xxxxxxxxxxxxxxxxxxx" + secret_key = "xxxxxxxxxxxxxxxxxxx" + region = "ap-chengdu" + path = "/seatunnel/read/binary/" + file_format_type = "binary" + // file example abcD2024.csv + file_filter_pattern = "abc[DX]*.*" + } +} + +sink { + Console { + } +} +``` + ## Changelog ### next version diff --git a/docs/en/connector-v2/source/FtpFile.md b/docs/en/connector-v2/source/FtpFile.md index ec02f77f9f7..6d114813769 100644 --- a/docs/en/connector-v2/source/FtpFile.md +++ b/docs/en/connector-v2/source/FtpFile.md @@ -84,6 +84,59 @@ The target ftp password is required The source file path. +### file_filter_pattern [string] + +Filter pattern, which used for filtering files. + +The pattern follows standard regular expressions. For details, please refer to https://en.wikipedia.org/wiki/Regular_expression. +There are some examples. + +File Structure Example: +``` +/data/seatunnel/20241001/report.txt +/data/seatunnel/20241007/abch202410.csv +/data/seatunnel/20241002/abcg202410.csv +/data/seatunnel/20241005/old_data.csv +/data/seatunnel/20241012/logo.png +``` +Matching Rules Example: + +**Example 1**: *Match all .txt files*,Regular Expression: +``` +/data/seatunnel/20241001/.*\.txt +``` +The result of this example matching is: +``` +/data/seatunnel/20241001/report.txt +``` +**Example 2**: *Match all file starting with abc*,Regular Expression: +``` +/data/seatunnel/20241002/abc.* +``` +The result of this example matching is: +``` +/data/seatunnel/20241007/abch202410.csv +/data/seatunnel/20241002/abcg202410.csv +``` +**Example 3**: *Match all file starting with abc,And the fourth character is either h or g*, the Regular Expression: +``` +/data/seatunnel/20241007/abc[h,g].* +``` +The result of this example matching is: +``` +/data/seatunnel/20241007/abch202410.csv +``` +**Example 4**: *Match third level folders starting with 202410 and files ending with .csv*, the Regular Expression: +``` +/data/seatunnel/202410\d*/.*\.csv +``` +The result of this example matching is: +``` +/data/seatunnel/20241007/abch202410.csv +/data/seatunnel/20241002/abcg202410.csv +/data/seatunnel/20241005/old_data.csv +``` + ### file_format_type [string] File type, supported as the following file types: @@ -400,6 +453,33 @@ sink { ``` +### Filter File + +```hocon +env { + parallelism = 1 + job.mode = "BATCH" +} + +source { + FtpFile { + host = "192.168.31.48" + port = 21 + user = tyrantlucifer + password = tianchao + path = "/seatunnel/read/binary/" + file_format_type = "binary" + // file example abcD2024.csv + file_filter_pattern = "abc[DX]*.*" + } +} + +sink { + Console { + } +} +``` + ## Changelog ### 2.2.0-beta 2022-09-26 diff --git a/docs/en/connector-v2/source/HdfsFile.md b/docs/en/connector-v2/source/HdfsFile.md index 7413c0428b8..405dfff820f 100644 --- a/docs/en/connector-v2/source/HdfsFile.md +++ b/docs/en/connector-v2/source/HdfsFile.md @@ -41,7 +41,7 @@ Read data from hdfs file system. ## Source Options -| Name | Type | Required | Default | Description | +| Name | Type | Required | Default | Description | |---------------------------|---------|----------|---------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | path | string | yes | - | The source file path. | | file_format_type | string | yes | - | We supported as the following file types:`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`.Please note that, The final file name will end with the file_format's suffix, the suffix of the text file is `txt`. | @@ -62,6 +62,7 @@ Read data from hdfs file system. | sheet_name | string | no | - | Reader the sheet of the workbook,Only used when file_format is excel. | | xml_row_tag | string | no | - | Specifies the tag name of the data rows within the XML file, only used when file_format is xml. | | xml_use_attr_format | boolean | no | - | Specifies whether to process data using the tag attribute format, only used when file_format is xml. | +| file_filter_pattern | string | no | | Filter pattern, which used for filtering files. | | compress_codec | string | no | none | The compress codec of files | | archive_compress_codec | string | no | none | | encoding | string | no | UTF-8 | | @@ -71,6 +72,59 @@ Read data from hdfs file system. **delimiter** parameter will deprecate after version 2.3.5, please use **field_delimiter** instead. +### file_filter_pattern [string] + +Filter pattern, which used for filtering files. + +The pattern follows standard regular expressions. For details, please refer to https://en.wikipedia.org/wiki/Regular_expression. +There are some examples. + +File Structure Example: +``` +/data/seatunnel/20241001/report.txt +/data/seatunnel/20241007/abch202410.csv +/data/seatunnel/20241002/abcg202410.csv +/data/seatunnel/20241005/old_data.csv +/data/seatunnel/20241012/logo.png +``` +Matching Rules Example: + +**Example 1**: *Match all .txt files*,Regular Expression: +``` +/data/seatunnel/20241001/.*\.txt +``` +The result of this example matching is: +``` +/data/seatunnel/20241001/report.txt +``` +**Example 2**: *Match all file starting with abc*,Regular Expression: +``` +/data/seatunnel/20241002/abc.* +``` +The result of this example matching is: +``` +/data/seatunnel/20241007/abch202410.csv +/data/seatunnel/20241002/abcg202410.csv +``` +**Example 3**: *Match all file starting with abc,And the fourth character is either h or g*, the Regular Expression: +``` +/data/seatunnel/20241007/abc[h,g].* +``` +The result of this example matching is: +``` +/data/seatunnel/20241007/abch202410.csv +``` +**Example 4**: *Match third level folders starting with 202410 and files ending with .csv*, the Regular Expression: +``` +/data/seatunnel/202410\d*/.*\.csv +``` +The result of this example matching is: +``` +/data/seatunnel/20241007/abch202410.csv +/data/seatunnel/20241002/abcg202410.csv +/data/seatunnel/20241005/old_data.csv +``` + ### compress_codec [string] The compress codec of files and the details that supported as the following shown: @@ -146,3 +200,26 @@ sink { } ``` +### Filter File + +```hocon +env { + parallelism = 1 + job.mode = "BATCH" +} + +source { + HdfsFile { + path = "/apps/hive/demo/student" + file_format_type = "json" + fs.defaultFS = "hdfs://namenode001" + // file example abcD2024.csv + file_filter_pattern = "abc[DX]*.*" + } +} + +sink { + Console { + } +} +``` diff --git a/docs/en/connector-v2/source/LocalFile.md b/docs/en/connector-v2/source/LocalFile.md index 6d11b992e3a..65f287f057b 100644 --- a/docs/en/connector-v2/source/LocalFile.md +++ b/docs/en/connector-v2/source/LocalFile.md @@ -43,7 +43,7 @@ If you use SeaTunnel Engine, It automatically integrated the hadoop jar when you ## Options -| name | type | required | default value | +| name | type | required | default value | |---------------------------|---------|----------|--------------------------------------| | path | string | yes | - | | file_format_type | string | yes | - | @@ -58,7 +58,7 @@ If you use SeaTunnel Engine, It automatically integrated the hadoop jar when you | sheet_name | string | no | - | | xml_row_tag | string | no | - | | xml_use_attr_format | boolean | no | - | -| file_filter_pattern | string | no | - | +| file_filter_pattern | string | no | | | compress_codec | string | no | none | | archive_compress_codec | string | no | none | | encoding | string | no | UTF-8 | @@ -254,6 +254,55 @@ Specifies Whether to process data using the tag attribute format. Filter pattern, which used for filtering files. +The pattern follows standard regular expressions. For details, please refer to https://en.wikipedia.org/wiki/Regular_expression. +There are some examples. + +File Structure Example: +``` +/data/seatunnel/20241001/report.txt +/data/seatunnel/20241007/abch202410.csv +/data/seatunnel/20241002/abcg202410.csv +/data/seatunnel/20241005/old_data.csv +/data/seatunnel/20241012/logo.png +``` +Matching Rules Example: + +**Example 1**: *Match all .txt files*,Regular Expression: +``` +/data/seatunnel/20241001/.*\.txt +``` +The result of this example matching is: +``` +/data/seatunnel/20241001/report.txt +``` +**Example 2**: *Match all file starting with abc*,Regular Expression: +``` +/data/seatunnel/20241002/abc.* +``` +The result of this example matching is: +``` +/data/seatunnel/20241007/abch202410.csv +/data/seatunnel/20241002/abcg202410.csv +``` +**Example 3**: *Match all file starting with abc,And the fourth character is either h or g*, the Regular Expression: +``` +/data/seatunnel/20241007/abc[h,g].* +``` +The result of this example matching is: +``` +/data/seatunnel/20241007/abch202410.csv +``` +**Example 4**: *Match third level folders starting with 202410 and files ending with .csv*, the Regular Expression: +``` +/data/seatunnel/202410\d*/.*\.csv +``` +The result of this example matching is: +``` +/data/seatunnel/20241007/abch202410.csv +/data/seatunnel/20241002/abcg202410.csv +/data/seatunnel/20241005/old_data.csv +``` + ### compress_codec [string] The compress codec of files and the details that supported as the following shown: @@ -406,6 +455,30 @@ sink { ``` +### Filter File + +```hocon +env { + parallelism = 1 + job.mode = "BATCH" +} + +source { + LocalFile { + path = "/data/seatunnel/" + file_format_type = "csv" + skip_header_row_number = 1 + // file example abcD2024.csv + file_filter_pattern = "abc[DX]*.*" + } +} + +sink { + Console { + } +} +``` + ## Changelog ### 2.2.0-beta 2022-09-26 diff --git a/docs/en/connector-v2/source/OssFile.md b/docs/en/connector-v2/source/OssFile.md index d5326cb86a4..36d998f054c 100644 --- a/docs/en/connector-v2/source/OssFile.md +++ b/docs/en/connector-v2/source/OssFile.md @@ -190,7 +190,7 @@ If you assign file type to `parquet` `orc`, schema option not required, connecto ## Options -| name | type | required | default value | Description | +| name | type | required | default value | Description | |---------------------------|---------|----------|---------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | path | string | yes | - | The Oss path that needs to be read can have sub paths, but the sub paths need to meet certain format requirements. Specific requirements can be referred to "parse_partition_from_path" option | | file_format_type | string | yes | - | File type, supported as the following file types: `text` `csv` `parquet` `orc` `json` `excel` `xml` `binary` | @@ -211,7 +211,7 @@ If you assign file type to `parquet` `orc`, schema option not required, connecto | xml_use_attr_format | boolean | no | - | Specifies whether to process data using the tag attribute format, only used when file_format is xml. | | compress_codec | string | no | none | Which compress codec the files used. | | encoding | string | no | UTF-8 | -| file_filter_pattern | string | no | | `*.txt` means you only need read the files end with `.txt` | +| file_filter_pattern | string | no | | Filter pattern, which used for filtering files. | | common-options | config | no | - | Source plugin common parameters, please refer to [Source Common Options](../source-common-options.md) for details. | ### compress_codec [string] @@ -233,6 +233,55 @@ The encoding of the file to read. This param will be parsed by `Charset.forName( Filter pattern, which used for filtering files. +The pattern follows standard regular expressions. For details, please refer to https://en.wikipedia.org/wiki/Regular_expression. +There are some examples. + +File Structure Example: +``` +/data/seatunnel/20241001/report.txt +/data/seatunnel/20241007/abch202410.csv +/data/seatunnel/20241002/abcg202410.csv +/data/seatunnel/20241005/old_data.csv +/data/seatunnel/20241012/logo.png +``` +Matching Rules Example: + +**Example 1**: *Match all .txt files*,Regular Expression: +``` +/data/seatunnel/20241001/.*\.txt +``` +The result of this example matching is: +``` +/data/seatunnel/20241001/report.txt +``` +**Example 2**: *Match all file starting with abc*,Regular Expression: +``` +/data/seatunnel/20241002/abc.* +``` +The result of this example matching is: +``` +/data/seatunnel/20241007/abch202410.csv +/data/seatunnel/20241002/abcg202410.csv +``` +**Example 3**: *Match all file starting with abc,And the fourth character is either h or g*, the Regular Expression: +``` +/data/seatunnel/20241007/abc[h,g].* +``` +The result of this example matching is: +``` +/data/seatunnel/20241007/abch202410.csv +``` +**Example 4**: *Match third level folders starting with 202410 and files ending with .csv*, the Regular Expression: +``` +/data/seatunnel/202410\d*/.*\.csv +``` +The result of this example matching is: +``` +/data/seatunnel/20241007/abch202410.csv +/data/seatunnel/20241002/abcg202410.csv +/data/seatunnel/20241005/old_data.csv +``` + ### schema [config] Only need to be configured when the file_format_type are text, json, excel, xml or csv ( Or other format we can't read the schema from metadata). @@ -474,6 +523,33 @@ sink { } ``` +### Filter File + +```hocon +env { + parallelism = 1 + job.mode = "BATCH" +} + +source { + OssFile { + path = "/seatunnel/orc" + bucket = "oss://tyrantlucifer-image-bed" + access_key = "xxxxxxxxxxxxxxxxx" + access_secret = "xxxxxxxxxxxxxxxxxxxxxx" + endpoint = "oss-cn-beijing.aliyuncs.com" + file_format_type = "orc" + // file example abcD2024.csv + file_filter_pattern = "abc[DX]*.*" + } +} + +sink { + Console { + } +} +``` + ## Changelog ### 2.2.0-beta 2022-09-26 diff --git a/docs/en/connector-v2/source/OssJindoFile.md b/docs/en/connector-v2/source/OssJindoFile.md index d5bd6d14fa3..933439edc9f 100644 --- a/docs/en/connector-v2/source/OssJindoFile.md +++ b/docs/en/connector-v2/source/OssJindoFile.md @@ -49,7 +49,7 @@ It only supports hadoop version **2.9.X+**. ## Options -| name | type | required | default value | +| name | type | required | default value | |---------------------------|---------|----------|---------------------| | path | string | yes | - | | file_format_type | string | yes | - | @@ -68,7 +68,7 @@ It only supports hadoop version **2.9.X+**. | sheet_name | string | no | - | | xml_row_tag | string | no | - | | xml_use_attr_format | boolean | no | - | -| file_filter_pattern | string | no | - | +| file_filter_pattern | string | no | | | compress_codec | string | no | none | | archive_compress_codec | string | no | none | | encoding | string | no | UTF-8 | @@ -267,6 +267,55 @@ Reader the sheet of the workbook. Filter pattern, which used for filtering files. +The pattern follows standard regular expressions. For details, please refer to https://en.wikipedia.org/wiki/Regular_expression. +There are some examples. + +File Structure Example: +``` +/data/seatunnel/20241001/report.txt +/data/seatunnel/20241007/abch202410.csv +/data/seatunnel/20241002/abcg202410.csv +/data/seatunnel/20241005/old_data.csv +/data/seatunnel/20241012/logo.png +``` +Matching Rules Example: + +**Example 1**: *Match all .txt files*,Regular Expression: +``` +/data/seatunnel/20241001/.*\.txt +``` +The result of this example matching is: +``` +/data/seatunnel/20241001/report.txt +``` +**Example 2**: *Match all file starting with abc*,Regular Expression: +``` +/data/seatunnel/20241002/abc.* +``` +The result of this example matching is: +``` +/data/seatunnel/20241007/abch202410.csv +/data/seatunnel/20241002/abcg202410.csv +``` +**Example 3**: *Match all file starting with abc,And the fourth character is either h or g*, the Regular Expression: +``` +/data/seatunnel/20241007/abc[h,g].* +``` +The result of this example matching is: +``` +/data/seatunnel/20241007/abch202410.csv +``` +**Example 4**: *Match third level folders starting with 202410 and files ending with .csv*, the Regular Expression: +``` +/data/seatunnel/202410\d*/.*\.csv +``` +The result of this example matching is: +``` +/data/seatunnel/20241007/abch202410.csv +/data/seatunnel/20241002/abcg202410.csv +/data/seatunnel/20241005/old_data.csv +``` + ### compress_codec [string] The compress codec of files and the details that supported as the following shown: @@ -364,6 +413,33 @@ sink { ``` +### Filter File + +```hocon +env { + parallelism = 1 + job.mode = "BATCH" +} + +source { + OssJindoFile { + bucket = "oss://tyrantlucifer-image-bed" + access_key = "xxxxxxxxxxxxxxxxx" + access_secret = "xxxxxxxxxxxxxxxxxxxxxx" + endpoint = "oss-cn-beijing.aliyuncs.com" + path = "/seatunnel/read/binary/" + file_format_type = "binary" + // file example abcD2024.csv + file_filter_pattern = "abc[DX]*.*" + } +} + +sink { + Console { + } +} +``` + ## Changelog ### next version diff --git a/docs/en/connector-v2/source/S3File.md b/docs/en/connector-v2/source/S3File.md index d280d6dc7f2..4834b025bc3 100644 --- a/docs/en/connector-v2/source/S3File.md +++ b/docs/en/connector-v2/source/S3File.md @@ -196,7 +196,7 @@ If you assign file type to `parquet` `orc`, schema option not required, connecto ## Options -| name | type | required | default value | Description | +| name | type | required | default value | Description | |---------------------------------|---------|----------|-------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | path | string | yes | - | The s3 path that needs to be read can have sub paths, but the sub paths need to meet certain format requirements. Specific requirements can be referred to "parse_partition_from_path" option | | file_format_type | string | yes | - | File type, supported as the following file types: `text` `csv` `parquet` `orc` `json` `excel` `xml` `binary` | @@ -220,12 +220,66 @@ If you assign file type to `parquet` `orc`, schema option not required, connecto | compress_codec | string | no | none | | | archive_compress_codec | string | no | none | | | encoding | string | no | UTF-8 | | +| file_filter_pattern | string | no | | Filter pattern, which used for filtering files. | | common-options | | no | - | Source plugin common parameters, please refer to [Source Common Options](../source-common-options.md) for details. | ### delimiter/field_delimiter [string] **delimiter** parameter will deprecate after version 2.3.5, please use **field_delimiter** instead. +### file_filter_pattern [string] + +Filter pattern, which used for filtering files. + +The pattern follows standard regular expressions. For details, please refer to https://en.wikipedia.org/wiki/Regular_expression. +There are some examples. + +File Structure Example: +``` +/data/seatunnel/20241001/report.txt +/data/seatunnel/20241007/abch202410.csv +/data/seatunnel/20241002/abcg202410.csv +/data/seatunnel/20241005/old_data.csv +/data/seatunnel/20241012/logo.png +``` +Matching Rules Example: + +**Example 1**: *Match all .txt files*,Regular Expression: +``` +/data/seatunnel/20241001/.*\.txt +``` +The result of this example matching is: +``` +/data/seatunnel/20241001/report.txt +``` +**Example 2**: *Match all file starting with abc*,Regular Expression: +``` +/data/seatunnel/20241002/abc.* +``` +The result of this example matching is: +``` +/data/seatunnel/20241007/abch202410.csv +/data/seatunnel/20241002/abcg202410.csv +``` +**Example 3**: *Match all file starting with abc,And the fourth character is either h or g*, the Regular Expression: +``` +/data/seatunnel/20241007/abc[h,g].* +``` +The result of this example matching is: +``` +/data/seatunnel/20241007/abch202410.csv +``` +**Example 4**: *Match third level folders starting with 202410 and files ending with .csv*, the Regular Expression: +``` +/data/seatunnel/202410\d*/.*\.csv +``` +The result of this example matching is: +``` +/data/seatunnel/20241007/abch202410.csv +/data/seatunnel/20241002/abcg202410.csv +/data/seatunnel/20241005/old_data.csv +``` + ### compress_codec [string] The compress codec of files and the details that supported as the following shown: @@ -349,6 +403,33 @@ sink { } ``` +### Filter File + +```hocon +env { + parallelism = 1 + job.mode = "BATCH" +} + +source { + S3File { + path = "/seatunnel/json" + bucket = "s3a://seatunnel-test" + fs.s3a.endpoint="s3.cn-north-1.amazonaws.com.cn" + fs.s3a.aws.credentials.provider="com.amazonaws.auth.InstanceProfileCredentialsProvider" + file_format_type = "json" + read_columns = ["id", "name"] + // file example abcD2024.csv + file_filter_pattern = "abc[DX]*.*" + } +} + +sink { + Console { + } +} +``` + ## Changelog ### 2.3.0-beta 2022-10-20 diff --git a/docs/en/connector-v2/source/SftpFile.md b/docs/en/connector-v2/source/SftpFile.md index 3eadcd3a69e..95c710110a0 100644 --- a/docs/en/connector-v2/source/SftpFile.md +++ b/docs/en/connector-v2/source/SftpFile.md @@ -71,7 +71,7 @@ The File does not have a specific type list, and we can indicate which SeaTunnel ## Source Options -| Name | Type | Required | default value | Description | +| Name | Type | Required | default value | Description | |---------------------------|---------|----------|---------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | host | String | Yes | - | The target sftp host is required | | port | Int | Yes | - | The target sftp port is required | @@ -96,6 +96,59 @@ The File does not have a specific type list, and we can indicate which SeaTunnel | encoding | string | no | UTF-8 | | common-options | | No | - | Source plugin common parameters, please refer to [Source Common Options](../source-common-options.md) for details. | +### file_filter_pattern [string] + +Filter pattern, which used for filtering files. + +The pattern follows standard regular expressions. For details, please refer to https://en.wikipedia.org/wiki/Regular_expression. +There are some examples. + +File Structure Example: +``` +/data/seatunnel/20241001/report.txt +/data/seatunnel/20241007/abch202410.csv +/data/seatunnel/20241002/abcg202410.csv +/data/seatunnel/20241005/old_data.csv +/data/seatunnel/20241012/logo.png +``` +Matching Rules Example: + +**Example 1**: *Match all .txt files*,Regular Expression: +``` +/data/seatunnel/20241001/.*\.txt +``` +The result of this example matching is: +``` +/data/seatunnel/20241001/report.txt +``` +**Example 2**: *Match all file starting with abc*,Regular Expression: +``` +/data/seatunnel/20241002/abc.* +``` +The result of this example matching is: +``` +/data/seatunnel/20241007/abch202410.csv +/data/seatunnel/20241002/abcg202410.csv +``` +**Example 3**: *Match all file starting with abc,And the fourth character is either h or g*, the Regular Expression: +``` +/data/seatunnel/20241007/abc[h,g].* +``` +The result of this example matching is: +``` +/data/seatunnel/20241007/abch202410.csv +``` +**Example 4**: *Match third level folders starting with 202410 and files ending with .csv*, the Regular Expression: +``` +/data/seatunnel/202410\d*/.*\.csv +``` +The result of this example matching is: +``` +/data/seatunnel/20241007/abch202410.csv +/data/seatunnel/20241002/abcg202410.csv +/data/seatunnel/20241005/old_data.csv +``` + ### file_format_type [string] File type, supported as the following file types: @@ -305,3 +358,30 @@ SftpFile { ``` +### Filter File + +```hocon +env { + parallelism = 1 + job.mode = "BATCH" +} + +source { + SftpFile { + host = "sftp" + port = 22 + user = seatunnel + password = pass + path = "tmp/seatunnel/read/json" + file_format_type = "json" + result_table_name = "sftp" + // file example abcD2024.csv + file_filter_pattern = "abc[DX]*.*" + } +} + +sink { + Console { + } +} +``` \ No newline at end of file diff --git a/docs/zh/connector-v2/source/HdfsFile.md b/docs/zh/connector-v2/source/HdfsFile.md index 0f983a80bcf..9cd254ef808 100644 --- a/docs/zh/connector-v2/source/HdfsFile.md +++ b/docs/zh/connector-v2/source/HdfsFile.md @@ -39,7 +39,7 @@ ## 源选项 -| 名称 | 类型 | 是否必须 | 默认值 | 描述 | +| 名称 | 类型 | 是否必须 | 默认值 | 描述 | |---------------------------|---------|------|----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | path | string | 是 | - | 源文件路径。 | | file_format_type | string | 是 | - | 我们支持以下文件类型:`text` `json` `csv` `orc` `parquet` `excel`。请注意,最终文件名将以文件格式的后缀结束,文本文件的后缀是 `txt`。 | @@ -55,6 +55,7 @@ | kerberos_principal | string | 否 | - | kerberos 的 principal。 | | kerberos_keytab_path | string | 否 | - | kerberos 的 keytab 路径。 | | skip_header_row_number | long | 否 | 0 | 跳过前几行,但仅适用于 txt 和 csv。例如,设置如下:`skip_header_row_number = 2`。然后 Seatunnel 将跳过源文件中的前两行。 | +| file_filter_pattern | string | 否 | - | 过滤模式,用于过滤文件。 | | schema | config | 否 | - | 上游数据的模式字段。 | | sheet_name | string | 否 | - | 读取工作簿的表格,仅在文件格式为 excel 时使用。 | | compress_codec | string | 否 | none | 文件的压缩编解码器。 | @@ -64,6 +65,60 @@ **delimiter** 参数在版本 2.3.5 后将被弃用,请改用 **field_delimiter**。 +### file_filter_pattern [string] + +过滤模式,用于过滤文件。 + +这个过滤规则遵循正则表达式. 关于详情,请参考 https://en.wikipedia.org/wiki/Regular_expression 学习 + +这里是一些例子. + +文件清单: +``` +/data/seatunnel/20241001/report.txt +/data/seatunnel/20241007/abch202410.csv +/data/seatunnel/20241002/abcg202410.csv +/data/seatunnel/20241005/old_data.csv +/data/seatunnel/20241012/logo.png +``` +匹配规则: + +**例子 1**: *匹配所有txt为后缀名的文件*,匹配正则为: +``` +/data/seatunnel/20241001/.*\.txt +``` +匹配的结果是: +``` +/data/seatunnel/20241001/report.txt +``` +**例子 2**: *匹配所有文件名以abc开头的文件*,匹配正则为: +``` +/data/seatunnel/20241002/abc.* +``` +匹配的结果是: +``` +/data/seatunnel/20241007/abch202410.csv +/data/seatunnel/20241002/abcg202410.csv +``` +**例子 3**: *匹配所有文件名以abc开头,并且文件第四个字母是 h 或者 g 的文件*, 匹配正则为: +``` +/data/seatunnel/20241007/abc[h,g].* +``` +匹配的结果是: +``` +/data/seatunnel/20241007/abch202410.csv +``` +**例子 4**: *匹配所有文件夹第三级以 202410 开头并且文件后缀名是.csv的文件*, 匹配正则为: +``` +/data/seatunnel/202410\d*/.*\.csv +``` +匹配的结果是: +``` +/data/seatunnel/20241007/abch202410.csv +/data/seatunnel/20241002/abcg202410.csv +/data/seatunnel/20241005/old_data.csv +``` + ### compress_codec [string] 文件的压缩编解码器及支持的详细信息如下所示: @@ -125,3 +180,25 @@ sink { } ``` +### Filter File + +```hocon +env { + parallelism = 1 + job.mode = "BATCH" +} + +source { + HdfsFile { + path = "/apps/hive/demo/student" + file_format_type = "json" + fs.defaultFS = "hdfs://namenode001" + file_filter_pattern = "abc[DX]*.*" + } +} + +sink { + Console { + } +} +``` \ No newline at end of file