Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: multipart copy #319

Merged
merged 3 commits into from
Jun 5, 2024
Merged

feat: multipart copy #319

merged 3 commits into from
Jun 5, 2024

Conversation

oliverdaff
Copy link
Contributor

📢 What

What changes have been made within this PR?

This PR introduces a new S3MultipartCopier struct for handling multipart copy operations in S3. The changes include:

  • Implementing the S3MultipartCopier struct to efficiently copy large files (greater than 5GB) using multipart copy.
  • Providing custom error handling for multipart copy operations.
  • Integration tests to verify the functionality of multipart copy.

❓ Why

Why are we submitting this PR?

Currently, S3 copy operations only work for files up to 5GB. Multipart copy is not only necessary for larger files but is also much faster for files under 5GB due to parallel copies. However, multipart copy will make more API requests, potentially increasing costs. This PR addresses the need for efficient, scalable, and reliable copying of large objects in S3.

🚦 Depends on

Are there any other PRs that need to be merged first?

  • No, this PR is self-contained and does not depend on any other PRs.

😟 Concerns

  • No major concerns at this time. However, it's important to note the potential increase in costs due to the higher number of API requests made by multipart copy operations.

📝 Notes

@oliverdaff oliverdaff marked this pull request as ready for review June 2, 2024 20:37
@oliverdaff oliverdaff requested a review from a team as a code owner June 2, 2024 20:37
@oliverdaff oliverdaff changed the title Feat/multipart copy feat: multipart copy Jun 3, 2024
Copy link
Contributor

@timleslie timleslie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. The biggest thing missing I think is an example in the published docs showing how to use it. Other than that, the other suggestions are mostly cosmetic.

///
/// The size of the source object in bytes.
#[instrument(skip(client))]
pub async fn get_source_size(
Copy link
Contributor

@timleslie timleslie Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: We don't expose this in the public API, so we can remove the pub here 👍

Suggested change
pub async fn get_source_size(
async fn get_source_size(

Comment on lines 211 to 212

let parts: Vec<_> = (1..=part_count)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion 1: We can avoid the .collect() into a Vec here, since stream::iter will accept the iterator itself.

suggestion 2: Seeing a 1-based iterator is a bit unexpected, so a quick comment would help the reader know that it's intentional, and the reason behind it.

Suggested change
let parts: Vec<_> = (1..=part_count)
// Use 1-based indexing to match expectations of `CopyUploadPart`.
let parts = (1..=part_count)

Comment on lines 229 to 230
})
.collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: See above.

Suggested change
})
.collect();
});

Comment on lines 232 to 234
let completed_parts: Vec<CompletedPart> = stream::iter(parts)
.buffer_unordered(self.max_concurrent_uploads)
.try_collect::<Vec<CompletedPart>>()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: (personal preference). We can avoid the explicit types here, for less busy code.

Suggested change
let completed_parts: Vec<CompletedPart> = stream::iter(parts)
.buffer_unordered(self.max_concurrent_uploads)
.try_collect::<Vec<CompletedPart>>()
let completed_parts = stream::iter(parts)
.buffer_unordered(self.max_concurrent_uploads)
.try_collect()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the explicit types to help reduce the cognitive load on the reader of the code. However, we can remove one of the types and still achieve clarity:

let completed_parts: Vec<CompletedPart> = stream::iter(parts)
    .buffer_unordered(self.max_concurrent_uploads)
    .try_collect()

let source_size =
get_source_size(&self.client, self.source.bucket(), self.source.key()).await?;
let part_count =
((f64::value_from(source_size)? / f64::value_from(self.part_size)?).ceil()).approx()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: This piece of code is begging for an edge case bug. Would be good to have explicit tests for either side of the edges to make 100% sure it does the right thing. Either as additional test cases of the full function, or factored out into a helper method with individual tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored this code and exercised this with property tests and type checked values.

byte_range: (i64, i64),
}

/// A struct to handle S3 multipart copy operations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Could we add some more docs on how to use this feature? It looks like we get some docs for free from TypedBuilder, but they're not great.

Screenshot 2024-06-03 at 11 07 19 AM

In particular, there's no documentation of the individual fields, their types, or how they should be used in the builder. I guess we could work around this by making the fields pub and adding docstrings, or else perhaps an example, like below (but formatted correctly 😅).

In particular, it would be helpful to have some guidance on the part_size and max_concurrent_uploads fields, and when/why the user might want to set them.

Suggested change
/// A struct to handle S3 multipart copy operations.
/// A struct to handle S3 multipart copy operations.
///
/// let dest = S3Object::new(test_bucket, dst_key);
/// let copyier = S3MultipartCopier::builder()
/// .client(client.clone())
/// .source(src)
/// .destination(dest.clone())
/// .part_size((5 * MIB).try_into()?)
/// .max_concurrent_uploads(2)
/// .build();
/// copyier.send().await?;

You can run the following command to generate the docs locally

RUSTDOCFLAGS="--cfg docsrs" cargo +nightly doc --open --all-features

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, it's important to note the potential increase in costs due to the higher number of API requests made by multipart copy operations.

Would be useful to leave a comment for the user about this in this documentation block 👍

Ok(head_object.content_length().unwrap_or(0))
}

const DEFAULT_COPY_PART_SIZE: i64 = 50 * MIB as i64;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question; Is the choice of 50MB here significant? Could we leave a small comment justifying this choice. If it's mostly arbitrary then we can just mention that 👍

@timleslie
Copy link
Contributor

Could you also update the CHANGELOG.md with an entry under the next release (0.13.4) to make life easier for whoever makes the next release 🙏

Comment on lines 29 to 36

pub fn bucket(&self) -> &str {
&self.bucket
}

pub fn key(&self) -> &str {
&self.key
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: What's the motivation for the additional methods here? My preference would be to keep the API as small as possible and have the user access the fields directly for this relatively simple struct.

Suggested change
pub fn bucket(&self) -> &str {
&self.bucket
}
pub fn key(&self) -> &str {
&self.key
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposing values through getters rather than directly accessing public fields ensures better encapsulation. However, removing the pub fields would break backward compatibility, so we must retain them.

@oliverdaff oliverdaff force-pushed the feat/multipart-copy branch from fd04b81 to dd4cd42 Compare June 4, 2024 06:59
@oliverdaff
Copy link
Contributor Author

Review comments addressed. This includes adding property tests to explore edge cases.

part_size = self.part_size.as_ref(),
);

if source_size.as_ref() <= self.part_size.as_ref() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observation: There's a slight contraction here with line 488. We use source <= part here, but source < part (but spelt backwards) at 488 🤔 I don't know if it makes much difference which one we change. If the block size is equal to the part size, then I guess an atomic copy will be more efficient? In that case we should change 488 to be source <= part. I'd also flip line 488 so that the inequality is spelt in the same order as 466, just to make it easier to see how these two lines relate to each other.

Copy link
Contributor

@timleslie timleslie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, one non-blocking observation that we might want to clean up, but other than that this is good to go 👍 The second pass clean up with additional tests really took this code up a notch, great work 🚀

@oliverdaff oliverdaff added this pull request to the merge queue Jun 5, 2024
Merged via the queue into main with commit ab36152 Jun 5, 2024
3 checks passed
@oliverdaff oliverdaff deleted the feat/multipart-copy branch June 5, 2024 00:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants