feat: multipart copy #319

oliverdaff · 2024-05-31T06:55:33Z

📢 What

What changes have been made within this PR?

This PR introduces a new S3MultipartCopier struct for handling multipart copy operations in S3. The changes include:

Implementing the S3MultipartCopier struct to efficiently copy large files (greater than 5GB) using multipart copy.
Providing custom error handling for multipart copy operations.
Integration tests to verify the functionality of multipart copy.

❓ Why

Why are we submitting this PR?

Currently, S3 copy operations only work for files up to 5GB. Multipart copy is not only necessary for larger files but is also much faster for files under 5GB due to parallel copies. However, multipart copy will make more API requests, potentially increasing costs. This PR addresses the need for efficient, scalable, and reliable copying of large objects in S3.

🚦 Depends on

Are there any other PRs that need to be merged first?

No, this PR is self-contained and does not depend on any other PRs.

😟 Concerns

No major concerns at this time. However, it's important to note the potential increase in costs due to the higher number of API requests made by multipart copy operations.

📝 Notes

See the data engineering handbook for more info.

timleslie

This looks great. The biggest thing missing I think is an example in the published docs showing how to use it. Other than that, the other suggestions are mostly cosmetic.

timleslie · 2024-06-03T00:53:29Z

src/s3/multipartcopy.rs

+///
+/// The size of the source object in bytes.
+#[instrument(skip(client))]
+pub async fn get_source_size(


suggestion: We don't expose this in the public API, so we can remove the pub here 👍

Suggested change

pub async fn get_source_size(

async fn get_source_size(

timleslie · 2024-06-03T00:58:08Z

src/s3/multipartcopy.rs

+
+        let parts: Vec<_> = (1..=part_count)


suggestion 1: We can avoid the .collect() into a Vec here, since stream::iter will accept the iterator itself.

suggestion 2: Seeing a 1-based iterator is a bit unexpected, so a quick comment would help the reader know that it's intentional, and the reason behind it.

Suggested change

let parts: Vec<_> = (1..=part_count)

// Use 1-based indexing to match expectations of `CopyUploadPart`.

let parts = (1..=part_count)

timleslie · 2024-06-03T00:58:29Z

src/s3/multipartcopy.rs

+            })
+            .collect();


suggestion: See above.

Suggested change

})

.collect();

});

timleslie · 2024-06-03T00:59:26Z

src/s3/multipartcopy.rs

+        let completed_parts: Vec<CompletedPart> = stream::iter(parts)
+            .buffer_unordered(self.max_concurrent_uploads)
+            .try_collect::<Vec<CompletedPart>>()


suggestion: (personal preference). We can avoid the explicit types here, for less busy code.

Suggested change

let completed_parts: Vec<CompletedPart> = stream::iter(parts)

.buffer_unordered(self.max_concurrent_uploads)

.try_collect::<Vec<CompletedPart>>()

let completed_parts = stream::iter(parts)

.buffer_unordered(self.max_concurrent_uploads)

.try_collect()

I prefer the explicit types to help reduce the cognitive load on the reader of the code. However, we can remove one of the types and still achieve clarity:

let completed_parts: Vec<CompletedPart> = stream::iter(parts) .buffer_unordered(self.max_concurrent_uploads) .try_collect()

timleslie · 2024-06-03T01:01:54Z

src/s3/multipartcopy.rs

+        let source_size =
+            get_source_size(&self.client, self.source.bucket(), self.source.key()).await?;
+        let part_count =
+            ((f64::value_from(source_size)? / f64::value_from(self.part_size)?).ceil()).approx()?;


suggestion: This piece of code is begging for an edge case bug. Would be good to have explicit tests for either side of the edges to make 100% sure it does the right thing. Either as additional test cases of the full function, or factored out into a helper method with individual tests.

Refactored this code and exercised this with property tests and type checked values.

timleslie · 2024-06-03T01:13:16Z

src/s3/multipartcopy.rs

+    byte_range: (i64, i64),
+}
+
+/// A struct to handle S3 multipart copy operations.


suggestion: Could we add some more docs on how to use this feature? It looks like we get some docs for free from TypedBuilder, but they're not great.

In particular, there's no documentation of the individual fields, their types, or how they should be used in the builder. I guess we could work around this by making the fields pub and adding docstrings, or else perhaps an example, like below (but formatted correctly 😅).

In particular, it would be helpful to have some guidance on the part_size and max_concurrent_uploads fields, and when/why the user might want to set them.

Suggested change

/// A struct to handle S3 multipart copy operations.

/// A struct to handle S3 multipart copy operations.

///

/// let dest = S3Object::new(test_bucket, dst_key);

/// let copyier = S3MultipartCopier::builder()

/// .client(client.clone())

/// .source(src)

/// .destination(dest.clone())

/// .part_size((5 * MIB).try_into()?)

/// .max_concurrent_uploads(2)

/// .build();

/// copyier.send().await?;

You can run the following command to generate the docs locally

RUSTDOCFLAGS="--cfg docsrs" cargo +nightly doc --open --all-features

However, it's important to note the potential increase in costs due to the higher number of API requests made by multipart copy operations.

Would be useful to leave a comment for the user about this in this documentation block 👍

timleslie · 2024-06-03T01:17:10Z

src/s3/multipartcopy.rs

+    Ok(head_object.content_length().unwrap_or(0))
+}
+
+const DEFAULT_COPY_PART_SIZE: i64 = 50 * MIB as i64;


question; Is the choice of 50MB here significant? Could we leave a small comment justifying this choice. If it's mostly arbitrary then we can just mention that 👍

timleslie · 2024-06-03T01:23:53Z

Could you also update the CHANGELOG.md with an entry under the next release (0.13.4) to make life easier for whoever makes the next release 🙏

timleslie · 2024-06-03T01:27:39Z

src/s3/s3_object.rs

+
+    pub fn bucket(&self) -> &str {
+        &self.bucket
+    }
+
+    pub fn key(&self) -> &str {
+        &self.key
+    }


question: What's the motivation for the additional methods here? My preference would be to keep the API as small as possible and have the user access the fields directly for this relatively simple struct.

Suggested change

pub fn bucket(&self) -> &str {

&self.bucket

}

pub fn key(&self) -> &str {

&self.key

}

Exposing values through getters rather than directly accessing public fields ensures better encapsulation. However, removing the pub fields would break backward compatibility, so we must retain them.

oliverdaff · 2024-06-04T07:05:40Z

Review comments addressed. This includes adding property tests to explore edge cases.

timleslie · 2024-06-05T00:01:17Z

src/s3/multipartcopy.rs

+            part_size = self.part_size.as_ref(),
+        );
+
+        if source_size.as_ref() <= self.part_size.as_ref() {


observation: There's a slight contraction here with line 488. We use source <= part here, but source < part (but spelt backwards) at 488 🤔 I don't know if it makes much difference which one we change. If the block size is equal to the part size, then I guess an atomic copy will be more efficient? In that case we should change 488 to be source <= part. I'd also flip line 488 so that the inequality is spelt in the same order as 466, just to make it easier to see how these two lines relate to each other.

timleslie

Looks good, one non-blocking observation that we might want to clean up, but other than that this is good to go 👍 The second pass clean up with additional tests really took this code up a notch, great work 🚀

oliverdaff marked this pull request as ready for review June 2, 2024 20:37

oliverdaff requested a review from a team as a code owner June 2, 2024 20:37

oliverdaff changed the title ~~Feat/multipart copy~~ feat: multipart copy Jun 3, 2024

timleslie requested changes Jun 3, 2024

View reviewed changes

timleslie reviewed Jun 3, 2024

View reviewed changes

feat: Added S3 multipart copy

dd4cd42

oliverdaff force-pushed the feat/multipart-copy branch from fd04b81 to dd4cd42 Compare June 4, 2024 06:59

timleslie reviewed Jun 5, 2024

View reviewed changes

timleslie approved these changes Jun 5, 2024

View reviewed changes

oliverdaff added 2 commits June 5, 2024 10:33

chore: clarify atomic copy boundry

b5e1470

chore: typos

864ca12

oliverdaff added this pull request to the merge queue Jun 5, 2024

Merged via the queue into main with commit ab36152 Jun 5, 2024
3 checks passed

oliverdaff deleted the feat/multipart-copy branch June 5, 2024 00:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: multipart copy #319

feat: multipart copy #319

oliverdaff commented May 31, 2024

timleslie left a comment

timleslie Jun 3, 2024 •

edited

Loading

timleslie Jun 3, 2024

timleslie Jun 3, 2024

timleslie Jun 3, 2024

oliverdaff Jun 3, 2024

timleslie Jun 3, 2024

oliverdaff Jun 4, 2024

timleslie Jun 3, 2024

timleslie Jun 3, 2024

timleslie Jun 3, 2024

timleslie commented Jun 3, 2024

timleslie Jun 3, 2024

oliverdaff Jun 3, 2024

oliverdaff commented Jun 4, 2024

timleslie Jun 5, 2024

timleslie left a comment

-        let parts: Vec<_> = (1..=part_count)
+        // Use 1-based indexing to match expectations of `CopyUploadPart`.
+        let parts = (1..=part_count)

-/// A struct to handle S3 multipart copy operations.
+/// A struct to handle S3 multipart copy operations.
+///
+///        let dest = S3Object::new(test_bucket, dst_key);
+///        let copyier = S3MultipartCopier::builder()
+///            .client(client.clone())
+///            .source(src)
+///            .destination(dest.clone())
+///            .part_size((5 * MIB).try_into()?)
+///            .max_concurrent_uploads(2)
+///            .build();
+///        copyier.send().await?;

feat: multipart copy #319

feat: multipart copy #319

Conversation

oliverdaff commented May 31, 2024

📢 What

❓ Why

🚦 Depends on

😟 Concerns

📝 Notes

timleslie left a comment

Choose a reason for hiding this comment

timleslie Jun 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timleslie commented Jun 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oliverdaff commented Jun 4, 2024

Choose a reason for hiding this comment

timleslie left a comment

Choose a reason for hiding this comment

timleslie Jun 3, 2024 •

edited

Loading