Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(storage): Add CRC32 Checksums to Cloud Storage uploads. #1846

Merged
merged 9 commits into from
Jul 22, 2019
Merged

feat(storage): Add CRC32 Checksums to Cloud Storage uploads. #1846

merged 9 commits into from
Jul 22, 2019

Conversation

jdpedrie
Copy link
Contributor

@jdpedrie jdpedrie commented Apr 24, 2019

This cannot be merged until google/crc32 has a stable release.

cc @frankyn.

@jdpedrie jdpedrie added do not merge Indicates a pull request not ready for merge, due to either quality or timing. api: storage Issues related to the Cloud Storage API. labels Apr 24, 2019
@jdpedrie jdpedrie requested a review from dwsupplee as a code owner April 24, 2019 20:37
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Apr 24, 2019
@codecov
Copy link

codecov bot commented Apr 24, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@2de1433). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #1846   +/-   ##
=========================================
  Coverage          ?   91.58%           
  Complexity        ?     4414           
=========================================
  Files             ?      305           
  Lines             ?    13115           
  Branches          ?        0           
=========================================
  Hits              ?    12012           
  Misses            ?     1103           
  Partials          ?        0
Impacted Files Coverage Δ Complexity Δ
Storage/src/Bucket.php 97.18% <ø> (ø) 70 <0> (?)
Core/src/Upload/ResumableUploader.php 100% <100%> (ø) 21 <0> (?)
Storage/src/Connection/Rest.php 100% <100%> (ø) 46 <3> (?)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2de1433...c069209. Read the comment docs.

@codecov-io
Copy link

codecov-io commented Apr 29, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@a0ae202). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff            @@
##             master   #1846   +/-   ##
========================================
  Coverage          ?   92.6%           
  Complexity        ?    4396           
========================================
  Files             ?     304           
  Lines             ?   13054           
  Branches          ?       0           
========================================
  Hits              ?   12089           
  Misses            ?     965           
  Partials          ?       0
Impacted Files Coverage Δ Complexity Δ
Storage/src/Bucket.php 97.18% <ø> (ø) 70 <0> (?)
Core/src/Upload/ResumableUploader.php 100% <100%> (ø) 21 <0> (?)
Storage/src/Connection/Rest.php 100% <100%> (ø) 53 <12> (?)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a0ae202...ed9efcc. Read the comment docs.

@jdpedrie jdpedrie added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Apr 29, 2019
@kokoro-team kokoro-team removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Apr 29, 2019
@jdpedrie jdpedrie added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Apr 29, 2019
@kokoro-team kokoro-team removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Apr 29, 2019
@jdpedrie
Copy link
Contributor Author

Fix for Windows CI failure here: google/php-crc32#6

@jdpedrie jdpedrie added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Apr 29, 2019
@kokoro-team kokoro-team removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Apr 29, 2019
@jdpedrie jdpedrie added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label May 1, 2019
@kokoro-team kokoro-team removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label May 1, 2019
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label May 1, 2019
@dwsupplee dwsupplee added needs work This is a pull request that needs a little love. and removed 🚨 This issue needs some love. labels May 2, 2019
Copy link
Contributor

@frankyn frankyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience @jdpedrie, I've added a few comments.

* disable. Please note that crc32 may have negative performance
* implications in older versions of PHP. Installation of the
* `crc32c` PHP extension is strongly advised for best
* performance. **Defaults to** `true` (md5).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jdpedrie given performance is improved with crc32c, why not set it as the default instead of md5?

Copy link
Contributor Author

@jdpedrie jdpedrie May 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@frankyn crc32 has serious performance drawbacks in situations where the crc32c extension is not installed and where the php installation does not support it out of the box. In other words, when we have to fall back to the pure-PHP implementation, we see unacceptable performance losses:

The first and second rows below are pure-PHP implementations. The third and fourth are using the extension and built-in implementations.

Test Name                   Chunk Size  Hash Iterations      Throughput/s
Google\CRC32\PHP	        256	        24267                1.24 MB/s
Google\CRC32\PHPSlicedBy4	256	        24707                1.26 MB/s
Google\CRC32\Builtin	    256	        1582535              81.03 MB/s
Google\CRC32\Google	        256	        1812113              92.78 MB/s

This corresponds to roughly 87 seconds for a 130mb file in our testing.

@dwsupplee and I were talking about being a bit smarter when choosing a validation algorithm. In situations where the user does not explicitly choose one or the other, I think it should be possible to determine the best option. In PHP 7.4 and up, or when the crc32c extension is installed, we can default to crc32 without losing performance. In other cases, we can default to md5.

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for running performance analysis, could you add what each column represents? I'm not following the second and third columns.

How deterministic is the check for PHP version and extension availability? Thank you for describing this issue in the documentation I missed it.

I want to say, "yes we should add a smart selection for md5, crc32c validation", but that will change the behavior expectations of the library on uploads. An alternative solution could be to add a new option called "auto", but we can shelf that until a later time if we receive feedback on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, sorry. 2nd column is the chunk size. 3rd column is the total amount of data hashed in bytes.

It's an easy check. Since the php-crc32 library provides methods to determine whether an implementation is supported.

use Google\CRC32\Builtin;
use Google\CRC32\CRC32;

$supported = function_exists('crc32c') || Builtin::supports(CRC32::CASTAGNOLI);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had begun drafting the changes to autoselect a signing method. It works on the assumption that the default is to autoselect, but could easily be changed later to work with a potential auto option. If you're curious about how the logic would work, you can take a look at this commit (not in this pull request)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jdpedrie, do hashed bytes represent how many bytes are hashed in 87 second?

I like the idea of using true to determine what's best for your environment. It reduces overhead for the user. Given the example PR you provided, I'd expect to not see auto and instead rely on true for auto selection.

I'm guessing you'll move to complete this PR and then follow-up with the autoselect update in a separate PR. Is this correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jdpedrie, do hashed bytes represent how many bytes are hashed in 87 second?

@frankyn I'm sorry, I misread the benchmark script before. The third column is the number of times hash_update() is called in the test.

The 87 seconds number comes from the total time for benchmarking of hashing a 130mb file. We ran some manual tests to see what it looked like for a large file.

The table comes from the php-crc32 automated benchmark, which you can review here.

I'm guessing you'll move to complete this PR and then follow-up with the autoselect update in a separate PR. Is this correct?

Perhaps. I'll talk to Dave about it. This is blocked by a stable release of the php-crc32 library. Looks like that is getting close.

Copy link
Contributor

@frankyn frankyn May 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@frankyn I'm sorry, I misread the benchmark script before. The third column is the number of times hash_update() is called in the test.

Thanks that source from php-crc32 helped me understand it better. So it's used to determine how many hash_update() operations occur or were able to occur given the time bounds.

It's a little confusing, so I'm going to ignore that and focus on the Mbps instead.

Perhaps. I'll talk to Dave about it. This is blocked by a stable release of the php-crc32 library. Looks like that is getting close.

sgtm, thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little confusing, so I'm going to ignore that and focus on the Mbps instead.

Agreed lol!

* @return array
* @throws GoogleException
* @throws ServiceException
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a typo in documentation or was it recently changed?

At the moment, I don't suspect it to be an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was changed because ServiceException includes utilities for exposing the underlying error to users, while GoogleException does not. Since the uploader does not provide the underlying error detail, and we cannot change the message without potentially causing issues for users who may be relying on it, we swapped the type out. ServiceException extends GoogleException, so we can replace it without breaking changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying, this makes sense.

$crc32c->update($data->read(1048576));
}

$data->seek($pos);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it seek back to the original $pos after rewinding to generate the hash?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is based on the guzzle hash() function. @dwsupplee do you think we need to reset the position in this situation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I'm wondering why does it rewind a stream and not expect a user to provide a stream at the correct position?

I think it's a nicety to verify the cursor position, but it's not very clear that this happens from the surface documentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that this mirrors what we've already been doing, and think we should keep this as is.

@googleapis googleapis deleted a comment from frankyn May 3, 2019
$args['metadata']['md5Hash'] = base64_encode(Psr7\hash($args['data'], 'md5', true));
} elseif ($args['validate'] === 'crc32' && !isset($args['metadata']['crc32c'])) {
$args['metadata']['crc32c'] = $this->crcFromStream($args['data']);
Copy link
Contributor Author

@jdpedrie jdpedrie May 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@frankyn

$args['data'] = Psr7\stream_for($args['data']);

$args['data'] will always be a StreamInterface.

(apologies, I accidentally deleted the comment to which this note is replying) -_-

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol no worries.

Gotcha, Thank you!

@frankyn
Copy link
Contributor

frankyn commented May 8, 2019

Pending the crc32c extension release. Otherwise lgtm.

@jdpedrie jdpedrie changed the title Add CRC32 Checksums to Cloud Storage uploads. feat(storage): Add CRC32 Checksums to Cloud Storage uploads. May 30, 2019
@jdpedrie
Copy link
Contributor Author

jdpedrie commented Jun 26, 2019

@dwsupplee let's move forward with merging this. I asked for a release of google/crc32 (right now it's required at dev-master, so that needs to change before merge) done. I'll keep working on getting the extension to PECL. Users have the option of installing it from source, or if they're on 7.3 or higher, a pretty good implementation is in native PHP. It still defaults to md5 in any case, so any change is opt-in.

@jdpedrie jdpedrie removed do not merge Indicates a pull request not ready for merge, due to either quality or timing. needs work This is a pull request that needs a little love. labels Jul 9, 2019
* calculated hash does not match that of the upstream server the
* upload will be rejected.
* @type bool|string $validate Indicates whether or not validation will
* be applied using md5 or crc32 hashing functionality. If
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we refer to this as crc32c?

$crc32c->update($data->read(1048576));
}

$data->seek($pos);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that this mirrors what we've already been doing, and think we should keep this as is.

*/
protected function crc32cExtensionLoaded()
{
return function_exists('crc32c');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use extension_loaded here instead?

*
* @return bool
*/
protected function supportsBuiltinCrc32()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
protected function supportsBuiltinCrc32()
protected function supportsBuiltinCrc32c()

@@ -204,7 +205,8 @@ public function testInsertObject(
array $options,
$expectedUploaderType,
$expectedContentType,
array $expectedMetadata
array $expectedMetadata,
array $metadataDoesNotHave = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
array $metadataDoesNotHave = []
array $metadataKeysWhichShouldNotBetSet = []

I know it is a bit verbose, but the current naming wasn't reading very clearly to me :). Does this still fall in line with what your expectation is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works for me!

{
$path = __DIR__ . '/data/5mb.txt';

$crc32 = CRC32::create(CRC32::CASTAGNOLI);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
$crc32 = CRC32::create(CRC32::CASTAGNOLI);
$crc32c = CRC32::create(CRC32::CASTAGNOLI);

@dwsupplee dwsupplee merged commit d4faff3 into googleapis:master Jul 22, 2019
@jdpedrie jdpedrie deleted the storage-crc32 branch July 22, 2019 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants