-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode object metadata #688
Comments
Broken unicode processing is bound to lead to problems in the field, please consider reprioritizing for the next release. @jecluis @vmoutoussamy |
This might also be a boto3 bug/feature - following links suggest that. ceph/s3-tests#316 The docs on weather object metadata supports unicode is a bit fuzzy https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html. |
Next step: Validate if this is us. What does the HTTP payload look like? What do we store in the attrs map? |
Here is what is going over the wire (collected with a mitmproxy between s3 tests and s3gw). Request:
Response to a later GET
S3GW receives 'Hello World\xc3\xa9' via header and sends that exact string back. The sfs side stores the string as part of the object attrs as a byte string. Wire data just confirmed, that the wrong result in the test is not from our processing. Nice. Still, where does the weird encoding come from? With boto3 debug logging we get:
Something turns the unicode string The debug logs have:
Another data point: The weird decoding is what we get if we decode utf-8 with latin1:
According to https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html non-ascii metadata is supposed to be RFC 2047 encoded. Indeed if we pass the encoded UTF-8 to python's decoder we get
So, it seems to default to latin1 decode if there isn't any encoding information. Is there an easy way to fix the test?
If we change the test to use the put object metadata attribute to send the unicode data # before
# def set_unicode_metadata(**kwargs): │
# kwargs['params']['headers']['x-amz-meta-meta1'] = u"Hello World\xe9"
│
# client.meta.events.register('before-call.s3.PutObject', set_unicode_metadata)
# after
client.put_object(Bucket=bucket_name, Key='foo', Body='bar', Metadata={'meta1': u"Hello World\xe9"}) we actually get a
See also boto/botocore#2552
Nope:
Does not really work, the header dict expects a string and converts internally to bytes. So we would get into a double encoding problem.
->
I guess adding a special string type that doesn't decode would work, but this is getting silly. In total I think we are doing the right thing with just storing metadata as bytes without processing. To double check, here's the RFC 2047 encoded headers:
Close? |
S3 Test: test_object_set_get_unicode_metadata
We store metadata as part of the
attrs
This might be an RGW issue. The test notes that RGW/RADOS fails this.
The text was updated successfully, but these errors were encountered: