Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature][File] Support config null format for text file read #8109

Merged
merged 1 commit into from
Nov 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion docs/en/connector-v2/source/FtpFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ If you use SeaTunnel Engine, It automatically integrated the hadoop jar when you

## Options

| name | type | required | default value |
| name | type | required | default value |
|---------------------------|---------|----------|---------------------|
| host | string | yes | - |
| port | int | yes | - |
Expand All @@ -62,6 +62,7 @@ If you use SeaTunnel Engine, It automatically integrated the hadoop jar when you
| compress_codec | string | no | none |
| archive_compress_codec | string | no | none |
| encoding | string | no | UTF-8 |
| null_format | string | no | - |
| common-options | | no | - |

### host [string]
Expand Down Expand Up @@ -336,6 +337,13 @@ The compress codec of archive files and the details that supported as the follow
Only used when file_format_type is json,text,csv,xml.
The encoding of the file to read. This param will be parsed by `Charset.forName(encoding)`.

### null_format [string]

Only used when file_format_type is text.
null_format to define which strings can be represented as null.

e.g: `\N`

### common options

Source plugin common parameters, please refer to [Source Common Options](../source-common-options.md) for details.
Expand Down
1 change: 1 addition & 0 deletions docs/en/connector-v2/source/HdfsFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ Read data from hdfs file system.
| compress_codec | string | no | none | The compress codec of files |
| archive_compress_codec | string | no | none |
| encoding | string | no | UTF-8 | |
| null_format | string | no | - | Only used when file_format_type is text. null_format to define which strings can be represented as null. e.g: `\N` |
| common-options | | no | - | Source plugin common parameters, please refer to [Source Common Options](../source-common-options.md) for details. |

### delimiter/field_delimiter [string]
Expand Down
8 changes: 8 additions & 0 deletions docs/en/connector-v2/source/LocalFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ If you use SeaTunnel Engine, It automatically integrated the hadoop jar when you
| compress_codec | string | no | none |
| archive_compress_codec | string | no | none |
| encoding | string | no | UTF-8 |
| null_format | string | no | - |
| common-options | | no | - |
| tables_configs | list | no | used to define a multiple table task |

Expand Down Expand Up @@ -330,6 +331,13 @@ The compress codec of archive files and the details that supported as the follow
Only used when file_format_type is json,text,csv,xml.
The encoding of the file to read. This param will be parsed by `Charset.forName(encoding)`.

### null_format [string]

Only used when file_format_type is text.
null_format to define which strings can be represented as null.

e.g: `\N`

### common options

Source plugin common parameters, please refer to [Source Common Options](../source-common-options.md) for details
Expand Down
1 change: 1 addition & 0 deletions docs/en/connector-v2/source/OssFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,7 @@ If you assign file type to `parquet` `orc`, schema option not required, connecto
| xml_use_attr_format | boolean | no | - | Specifies whether to process data using the tag attribute format, only used when file_format is xml. |
| compress_codec | string | no | none | Which compress codec the files used. |
| encoding | string | no | UTF-8 |
| null_format | string | no | - | Only used when file_format_type is text. null_format to define which strings can be represented as null. e.g: `\N` |
| file_filter_pattern | string | no | | Filter pattern, which used for filtering files. |
| common-options | config | no | - | Source plugin common parameters, please refer to [Source Common Options](../source-common-options.md) for details. |

Expand Down
8 changes: 8 additions & 0 deletions docs/en/connector-v2/source/OssJindoFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ It only supports hadoop version **2.9.X+**.
| compress_codec | string | no | none |
| archive_compress_codec | string | no | none |
| encoding | string | no | UTF-8 |
| null_format | string | no | - |
| common-options | | no | - |

### path [string]
Expand Down Expand Up @@ -343,6 +344,13 @@ The compress codec of archive files and the details that supported as the follow
Only used when file_format_type is json,text,csv,xml.
The encoding of the file to read. This param will be parsed by `Charset.forName(encoding)`.

### null_format [string]

Only used when file_format_type is text.
null_format to define which strings can be represented as null.

e.g: `\N`

### common options

Source plugin common parameters, please refer to [Source Common Options](../source-common-options.md) for details.
Expand Down
1 change: 1 addition & 0 deletions docs/en/connector-v2/source/S3File.md
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,7 @@ If you assign file type to `parquet` `orc`, schema option not required, connecto
| compress_codec | string | no | none | |
| archive_compress_codec | string | no | none | |
| encoding | string | no | UTF-8 | |
| null_format | string | no | - | Only used when file_format_type is text. null_format to define which strings can be represented as null. e.g: `\N` |
| file_filter_pattern | string | no | | Filter pattern, which used for filtering files. |
| common-options | | no | - | Source plugin common parameters, please refer to [Source Common Options](../source-common-options.md) for details. |

Expand Down
3 changes: 2 additions & 1 deletion docs/en/connector-v2/source/SftpFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ The File does not have a specific type list, and we can indicate which SeaTunnel

## Source Options

| Name | Type | Required | default value | Description |
| Name | Type | Required | default value | Description |
|---------------------------|---------|----------|---------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| host | String | Yes | - | The target sftp host is required |
| port | Int | Yes | - | The target sftp port is required |
Expand All @@ -94,6 +94,7 @@ The File does not have a specific type list, and we can indicate which SeaTunnel
| compress_codec | String | No | None | The compress codec of files and the details that supported as the following shown: <br/> - txt: `lzo` `None` <br/> - json: `lzo` `None` <br/> - csv: `lzo` `None` <br/> - orc: `lzo` `snappy` `lz4` `zlib` `None` <br/> - parquet: `lzo` `snappy` `lz4` `gzip` `brotli` `zstd` `None` <br/> Tips: excel type does Not support any compression format |
| archive_compress_codec | string | no | none |
| encoding | string | no | UTF-8 |
| null_format | string | no | - | Only used when file_format_type is text. null_format to define which strings can be represented as null. e.g: `\N` |
| common-options | | No | - | Source plugin common parameters, please refer to [Source Common Options](../source-common-options.md) for details. |

### file_filter_pattern [string]
Expand Down
1 change: 1 addition & 0 deletions docs/zh/connector-v2/source/HdfsFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@
| kerberos_keytab_path | string | 否 | - | kerberos 的 keytab 路径。 |
| skip_header_row_number | long | 否 | 0 | 跳过前几行,但仅适用于 txt 和 csv。例如,设置如下:`skip_header_row_number = 2`。然后 Seatunnel 将跳过源文件中的前两行。 |
| file_filter_pattern | string | 否 | - | 过滤模式,用于过滤文件。 |
| null_format | string | 否 | - | 定义哪些字符串可以表示为 null,但仅适用于 txt 和 csv. 例如: `\N` |
| schema | config | 否 | - | 上游数据的模式字段。 |
| sheet_name | string | 否 | - | 读取工作簿的表格,仅在文件格式为 excel 时使用。 |
| compress_codec | string | 否 | none | 文件的压缩编解码器。 |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -285,11 +285,11 @@ private boolean compareMapValue(Map<?, ?> value, MapType<?, ?> type, Map<?, ?> c

private Boolean checkType(Object value, SeaTunnelDataType<?> fieldType) {
if (value == null) {
if (fieldType.getSqlType() == SqlType.NULL) {
return true;
} else {
return false;
}
return true;
}

if (fieldType.getSqlType() == SqlType.NULL) {
return false;
}

if (fieldType.getSqlType() == SqlType.ROW) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,12 @@ public class BaseSourceConfigOptions {
.withDescription(
"The separator between columns in a row of data. Only needed by `text` file format");

public static final Option<String> NULL_FORMAT =
Options.key("null_format")
.stringType()
.noDefaultValue()
.withDescription("The string that represents a null value");

public static final Option<String> ENCODING =
Options.key("encoding")
.stringType()
Expand Down
Loading