Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement: reuse the tcp connection when plan files #522

Closed
chenzl25 opened this issue Aug 5, 2024 · 13 comments · Fixed by #544
Closed

Improvement: reuse the tcp connection when plan files #522

chenzl25 opened this issue Aug 5, 2024 · 13 comments · Fixed by #544

Comments

@chenzl25
Copy link
Contributor

chenzl25 commented Aug 5, 2024

Considering I am trying to read an iceberg table from S3. Currently, plan_files() seems unable to reuse the TCP connection for HTTP requests. It will lead to a relatively high latency. I am not sure whether it is a good practice to enable Connection: keep-alive by default, what do you think?

@liurenjie1024
Copy link
Contributor

Considering I am trying to read an iceberg table from S3. Currently, plan_files() seems unable to reuse the TCP connection for HTTP requests. It will lead to a relatively high latency. I am not sure whether it is a good practice to enable Connection: keep-alive by default, what do you think?

I think the buffering is handles by opendal? cc @Xuanwo

@Xuanwo
Copy link
Member

Xuanwo commented Aug 5, 2024

Hi, OpenDAL handles those connection-related tasks (by reqwest).

Currently, FileIO builds new operators every time:

#[cfg(feature = "storage-s3")]
Storage::S3 { scheme_str, config } => {
let op = super::s3_config_build(config, path)?;
let op_info = op.info();
// Check prefix of s3 path.
let prefix = format!("{}://{}/", scheme_str, op_info.name());
if path.starts_with(&prefix) {
Ok((op, &path[prefix.len()..]))
} else {
Err(Error::new(
ErrorKind::DataInvalid,
format!("Invalid s3 url: {}, should start with {}", path, prefix),
))
}
}

I think we can improve this by using the same HTTP client instead.

@chenzl25
Copy link
Contributor Author

chenzl25 commented Aug 6, 2024

Great, Looking forward to it!

@chenzl25
Copy link
Contributor Author

chenzl25 commented Aug 6, 2024

I have tested duckdb and clickhouse to scan an iceberg table in MinIO as well. They can reuse a TCP connection to send HTTP requests. BTW, clickhouse can reuse connections aggressively.

DuckDB:
image

Clickhouse:
image

@chenzl25
Copy link
Contributor Author

chenzl25 commented Aug 6, 2024

However, risingwave the database use iceberg-rust seems can't reuse any TCP connection. You can see that every HTTP request is in its own TCP connection.

image

@Xuanwo
Copy link
Member

Xuanwo commented Aug 6, 2024

Thanks for the sharing. I believe this can be addressed by reusing the same http client.

@Xuanwo
Copy link
Member

Xuanwo commented Aug 6, 2024

We will need apache/opendal#4967 for this. I'm working on it now.

@chenzl25
Copy link
Contributor Author

chenzl25 commented Aug 6, 2024

We will need apache/opendal#4967 for this. I'm working on it now.

Which opendal's version do we need to bump into?

@Xuanwo
Copy link
Member

Xuanwo commented Aug 6, 2024

Which opendal's version do we need to bump into?

I'm guessing it will be included in our next release 0.49.

@chenzl25
Copy link
Contributor Author

chenzl25 commented Aug 6, 2024

I have already verified that using a global HTTP client works in this issue and the performance improvement is impressive (about 10 times faster: from 1500+ms to 150+ms).
https://github.com/risingwavelabs/iceberg-rust/blob/24bd5869f2779a8b9786b5a6e1f9723844f5a82c/crates/iceberg/src/io.rs#L438-L440

@Xuanwo
Copy link
Member

Xuanwo commented Aug 6, 2024

Also, cc @sdd, who is focusing on the iceberg benchmark now.

@sdd
Copy link
Contributor

sdd commented Aug 6, 2024

I also have some local code that reuses the same OpenDAL operator rather than creating a new one each time. I'd not submitted it yet as I wasn't sure of the validity of doing that in every possible scenario but it has been working well for me

@Xuanwo
Copy link
Member

Xuanwo commented Aug 6, 2024

I also have some local code that reuses the same OpenDAL operator rather than creating a new one each time. I'd not submitted it yet as I wasn't sure of the validity of doing that in every possible scenario but it has been working well for me

Great! I will address apache/opendal#4967 first, then consider the best approach for iceberg-rust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants