-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(storage): Add CRC32 Checksums to Cloud Storage uploads. #1846
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1846 +/- ##
=========================================
Coverage ? 91.58%
Complexity ? 4414
=========================================
Files ? 305
Lines ? 13115
Branches ? 0
=========================================
Hits ? 12012
Misses ? 1103
Partials ? 0
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #1846 +/- ##
========================================
Coverage ? 92.6%
Complexity ? 4396
========================================
Files ? 304
Lines ? 13054
Branches ? 0
========================================
Hits ? 12089
Misses ? 965
Partials ? 0
Continue to review full report at Codecov.
|
Fix for Windows CI failure here: google/php-crc32#6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patience @jdpedrie, I've added a few comments.
Storage/src/Bucket.php
Outdated
* disable. Please note that crc32 may have negative performance | ||
* implications in older versions of PHP. Installation of the | ||
* `crc32c` PHP extension is strongly advised for best | ||
* performance. **Defaults to** `true` (md5). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jdpedrie given performance is improved with crc32c, why not set it as the default instead of md5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@frankyn crc32 has serious performance drawbacks in situations where the crc32c extension is not installed and where the php installation does not support it out of the box. In other words, when we have to fall back to the pure-PHP implementation, we see unacceptable performance losses:
The first and second rows below are pure-PHP implementations. The third and fourth are using the extension and built-in implementations.
Test Name Chunk Size Hash Iterations Throughput/s
Google\CRC32\PHP 256 24267 1.24 MB/s
Google\CRC32\PHPSlicedBy4 256 24707 1.26 MB/s
Google\CRC32\Builtin 256 1582535 81.03 MB/s
Google\CRC32\Google 256 1812113 92.78 MB/s
This corresponds to roughly 87 seconds for a 130mb file in our testing.
@dwsupplee and I were talking about being a bit smarter when choosing a validation algorithm. In situations where the user does not explicitly choose one or the other, I think it should be possible to determine the best option. In PHP 7.4 and up, or when the crc32c extension is installed, we can default to crc32 without losing performance. In other cases, we can default to md5.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for running performance analysis, could you add what each column represents? I'm not following the second and third columns.
How deterministic is the check for PHP version and extension availability? Thank you for describing this issue in the documentation I missed it.
I want to say, "yes we should add a smart selection for md5, crc32c validation", but that will change the behavior expectations of the library on uploads. An alternative solution could be to add a new option called "auto", but we can shelf that until a later time if we receive feedback on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing, sorry. 2nd column is the chunk size. 3rd column is the total amount of data hashed in bytes.
It's an easy check. Since the php-crc32 library provides methods to determine whether an implementation is supported.
use Google\CRC32\Builtin;
use Google\CRC32\CRC32;
$supported = function_exists('crc32c') || Builtin::supports(CRC32::CASTAGNOLI);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had begun drafting the changes to autoselect a signing method. It works on the assumption that the default is to autoselect, but could easily be changed later to work with a potential auto
option. If you're curious about how the logic would work, you can take a look at this commit (not in this pull request)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jdpedrie, do hashed bytes represent how many bytes are hashed in 87 second?
I like the idea of using true
to determine what's best for your environment. It reduces overhead for the user. Given the example PR you provided, I'd expect to not see auto
and instead rely on true
for auto selection.
I'm guessing you'll move to complete this PR and then follow-up with the autoselect update in a separate PR. Is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jdpedrie, do hashed bytes represent how many bytes are hashed in 87 second?
@frankyn I'm sorry, I misread the benchmark script before. The third column is the number of times hash_update()
is called in the test.
The 87 seconds number comes from the total time for benchmarking of hashing a 130mb file. We ran some manual tests to see what it looked like for a large file.
The table comes from the php-crc32 automated benchmark, which you can review here.
I'm guessing you'll move to complete this PR and then follow-up with the autoselect update in a separate PR. Is this correct?
Perhaps. I'll talk to Dave about it. This is blocked by a stable release of the php-crc32 library. Looks like that is getting close.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@frankyn I'm sorry, I misread the benchmark script before. The third column is the number of times hash_update() is called in the test.
Thanks that source from php-crc32 helped me understand it better. So it's used to determine how many hash_update() operations occur or were able to occur given the time bounds.
It's a little confusing, so I'm going to ignore that and focus on the Mbps instead.
Perhaps. I'll talk to Dave about it. This is blocked by a stable release of the php-crc32 library. Looks like that is getting close.
sgtm, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little confusing, so I'm going to ignore that and focus on the Mbps instead.
Agreed lol!
* @return array | ||
* @throws GoogleException | ||
* @throws ServiceException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this a typo in documentation or was it recently changed?
At the moment, I don't suspect it to be an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was changed because ServiceException includes utilities for exposing the underlying error to users, while GoogleException does not. Since the uploader does not provide the underlying error detail, and we cannot change the message without potentially causing issues for users who may be relying on it, we swapped the type out. ServiceException extends GoogleException, so we can replace it without breaking changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying, this makes sense.
$crc32c->update($data->read(1048576)); | ||
} | ||
|
||
$data->seek($pos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it seek back to the original $pos
after rewinding to generate the hash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is based on the guzzle hash()
function. @dwsupplee do you think we need to reset the position in this situation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I'm wondering why does it rewind a stream and not expect a user to provide a stream at the correct position?
I think it's a nicety to verify the cursor position, but it's not very clear that this happens from the surface documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that this mirrors what we've already been doing, and think we should keep this as is.
$args['metadata']['md5Hash'] = base64_encode(Psr7\hash($args['data'], 'md5', true)); | ||
} elseif ($args['validate'] === 'crc32' && !isset($args['metadata']['crc32c'])) { | ||
$args['metadata']['crc32c'] = $this->crcFromStream($args['data']); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$args['data'] = Psr7\stream_for($args['data']); |
$args['data']
will always be a StreamInterface.
(apologies, I accidentally deleted the comment to which this note is replying) -_-
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol no worries.
Gotcha, Thank you!
Pending the crc32c extension release. Otherwise lgtm. |
@dwsupplee let's move forward with merging this. I asked for a release of |
Storage/src/Bucket.php
Outdated
* calculated hash does not match that of the upstream server the | ||
* upload will be rejected. | ||
* @type bool|string $validate Indicates whether or not validation will | ||
* be applied using md5 or crc32 hashing functionality. If |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we refer to this as crc32c
?
$crc32c->update($data->read(1048576)); | ||
} | ||
|
||
$data->seek($pos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that this mirrors what we've already been doing, and think we should keep this as is.
Storage/src/Connection/Rest.php
Outdated
*/ | ||
protected function crc32cExtensionLoaded() | ||
{ | ||
return function_exists('crc32c'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use extension_loaded
here instead?
Storage/src/Connection/Rest.php
Outdated
* | ||
* @return bool | ||
*/ | ||
protected function supportsBuiltinCrc32() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
protected function supportsBuiltinCrc32() | |
protected function supportsBuiltinCrc32c() |
@@ -204,7 +205,8 @@ public function testInsertObject( | |||
array $options, | |||
$expectedUploaderType, | |||
$expectedContentType, | |||
array $expectedMetadata | |||
array $expectedMetadata, | |||
array $metadataDoesNotHave = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
array $metadataDoesNotHave = [] | |
array $metadataKeysWhichShouldNotBetSet = [] |
I know it is a bit verbose, but the current naming wasn't reading very clearly to me :). Does this still fall in line with what your expectation is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
works for me!
{ | ||
$path = __DIR__ . '/data/5mb.txt'; | ||
|
||
$crc32 = CRC32::create(CRC32::CASTAGNOLI); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$crc32 = CRC32::create(CRC32::CASTAGNOLI); | |
$crc32c = CRC32::create(CRC32::CASTAGNOLI); |
This cannot be merged until
google/crc32
has a stable release.cc @frankyn.