Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode object metadata #688

Closed
irq0 opened this issue Aug 30, 2023 · 4 comments
Closed

Unicode object metadata #688

irq0 opened this issue Aug 30, 2023 · 4 comments
Assignees
Labels
area/rgw-sfs RGW & SFS related kind/bug Something isn't working priority/1 Should be fixed for next release
Milestone

Comments

@irq0
Copy link
Member

irq0 commented Aug 30, 2023

S3 Test: test_object_set_get_unicode_metadata

        response = client.get_object(Bucket=bucket_name, Key='foo')
        got = response['Metadata']['meta1']
        print(got)
        print(u"Hello World\xe9")
>       assert got == u"Hello World\xe9"
E       AssertionError: assert 'Hello Worldé' == 'Hello Worldé'
E         - Hello Worldé
E         ?            ^
E         + Hello Worldé
E         ?            ^^

We store metadata as part of the attrs

2023-08-24T13:14:24.293+0000 7f6e19faa6c0 10 req 0 0.003333419s s3:put_obj > atomic_writer::complete accounted_size: 3, etag: 37b51d194a7513e45b56f6524f2d51f2, set_mtime: 1970-01-01T00:00:00.000000000Z, attrs: user.rgw.acl, user.rgw.etag, user.rgw.x-amz-content-sha256, user.rgw.x-amz-date, user.rgw.x-amz-meta-meta1, delete_at: 1970-01-01T00:00:00.000000000Z, if_match: NA, if_nomatch: NA

This might be an RGW issue. The test notes that RGW/RADOS fails this.

@github-project-automation github-project-automation bot moved this to Backlog in S3GW Aug 30, 2023
@irq0 irq0 added the kind/bug Something isn't working label Aug 30, 2023
@github-actions github-actions bot added the triage/waiting Waiting for triage label Aug 30, 2023
@irq0 irq0 added the priority/2 To be prioritized according to impact label Aug 30, 2023
@jhmarina jhmarina removed the triage/waiting Waiting for triage label Sep 7, 2023
@l-mb l-mb added priority/1 Should be fixed for next release area/rgw-sfs RGW & SFS related LH 1.6 and removed priority/2 To be prioritized according to impact labels Oct 17, 2023
@l-mb
Copy link

l-mb commented Oct 17, 2023

Broken unicode processing is bound to lead to problems in the field, please consider reprioritizing for the next release. @jecluis @vmoutoussamy

@irq0
Copy link
Member Author

irq0 commented Oct 23, 2023

This might also be a boto3 bug/feature - following links suggest that.

ceph/s3-tests#316
boto/boto3#478
boto/botocore#861

The docs on weather object metadata supports unicode is a bit fuzzy https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html.
RGWs take on that https://tracker.ceph.com/issues/908

@jecluis jecluis added this to the v0.23.0 milestone Oct 25, 2023
@irq0
Copy link
Member Author

irq0 commented Oct 30, 2023

Next step: Validate if this is us. What does the HTTP payload look like? What do we store in the attrs map?

@irq0 irq0 self-assigned this Oct 30, 2023
@irq0
Copy link
Member Author

irq0 commented Nov 13, 2023

Here is what is going over the wire (collected with a mitmproxy between s3 tests and s3gw).

Request:

2023-11-13 18:31:17 PUT http://localhost:7481/sfstest-7cznzosspj982n7zlu611-1/foo
                        ← 200 OK [no content] 19ms
                 Request                                 Response                                  Detail
Host:                   localhost:7480
Accept-Encoding:        identity
User-Agent:             Boto3/1.24.96 Python/3.11.5 Linux/6.5.6-1-default Botocore/1.27.96
Content-MD5:            N7UdGUp1E+RbVvZSTy1R8g==
x-amz-meta-meta1:       Hello World\xc3\xa9
X-Amz-Date:             20231113T173117Z
X-Amz-Content-SHA256:   fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9
Authorization:          AWS4-HMAC-SHA256 Credential=test/20231113//s3/aws4_request,
                        SignedHeaders=content-md5;host;x-amz-content-sha256;x-amz-date;x-amz-meta-meta1,
                        Signature=938849e3f95269998e8ad3514fa6a3605f7ad05ea6799c5876e38ecee6236c2d
amz-sdk-invocation-id:  96fb4539-5e87-4e75-82b4-2363e651eb6c
amz-sdk-request:        attempt=1
Content-Length:         3

Response to a later GET

2023-11-13 18:31:17 GET http://localhost:7481/sfstest-7cznzosspj982n7zlu611-1/foo
                        ← 200 OK binary/octet-stream 3b 44ms
                 Request                                 Response                                  Detail
Content-Length:     3
Accept-Ranges:      bytes
Last-Modified:      Mon, 13 Nov 2023 17:31:17 GMT
x-rgw-object-type:  Normal
ETag:               "37b51d194a7513e45b56f6524f2d51f2"
x-amz-meta-meta1:   Hello World\xc3\xa9
Content-Type:       binary/octet-stream
Date:               Mon, 13 Nov 2023 17:31:17 GMT
Connection:         Keep-Alive
Raw                                                                                                               [m:auto]
bar

S3GW receives 'Hello World\xc3\xa9' via header and sends that exact string back. The sfs side stores the string as part of the object attrs as a byte string. Wire data just confirmed, that the wrong result in the test is not from our processing. Nice.

Still, where does the weird encoding come from?

With boto3 debug logging we get:

DEBUG    botocore.endpoint:endpoint.py:114 Making request for OperationModel(name=PutObject) with params: {'url_path': '/s
fstest-t4axpnukoczbc4tfy5tip-1/foo', 'query_string': {}, 'method': 'PUT', 'headers': {'User-Agent': 'Boto3/1.24.96 Python/
3.11.5 Linux/6.5.6-1-default Botocore/1.27.96', 'Content-MD5': 'N7UdGUp1E+RbVvZSTy1R8g==', 'x-amz-meta-meta1': 'Hello Worldé', 'Expect': '100-continue'}, 'body': <_io.BytesIO object at 0x7f1a031b8770>, 'url': 'http://localhost:7480/sfstest-t4ax
pnukoczbc4tfy5tip-1/foo', 'context': {'client_region': '', 'client_config': <botocore.config.Config object at 0x7f19ffc8a1
d0>, 'has_streaming_input': True, 'auth_type': None, 'signing': {'bucket': 'sfstest-t4axpnukoczbc4tfy5tip-1'}}}

DEBUG    botocore.endpoint:endpoint.py:265 Sending http request: <AWSPreparedRequest stream_output=False, method=PUT, url=
http://localhost:7480/sfstest-t4axpnukoczbc4tfy5tip-1/foo, headers={'User-Agent': b'Boto3/1.24.96 Python/3.11.5 Linux/6.5.
6-1-default Botocore/1.27.96', 'Content-MD5': b'N7UdGUp1E+RbVvZSTy1R8g==', 'x-amz-meta-meta1': b'Hello World\xc3\xa9', 'Expect': b'100-continue', 'X-Amz-Date': b'20231113T175837Z', 'X-Amz-Content-SHA256': b'fcde2b2edba56bf408601fb721fe9b5c338d1
0ee429ea04fae5511b68fbf8fb9', 'Authorization': b'AWS4-HMAC-SHA256 Credential=test/20231113//s3/aws4_request, SignedHeaders
=content-md5;host;x-amz-content-sha256;x-amz-date;x-amz-meta-meta1, Signature=551d8a42e3246ce25a32b4960ef7d0149b8b9d3a24ee
e24156ee1bcad8285f4b', 'amz-sdk-invocation-id': b'856c155f-9999-4ecd-8bf0-45726de52bc5', 'amz-sdk-request': b'attempt=1',
'Content-Length': '3'}>

Something turns the unicode string 'Hello Worldé' into its UTF-8 encoding b'Hello World\xc3\xa9' that we also see on the wire. This means, that the response decoder doesn't decode the UTF-8 back as we would expect.

The debug logs have:

DEBUG    botocore.parsers:parsers.py:239 Response headers: {'Content-Length': '3', 'Accept-Ranges': 'bytes', 'Last-Modifie
d': 'Mon, 13 Nov 2023 17:58:37 GMT', 'x-rgw-object-type': 'Normal', 'ETag': '"37b51d194a7513e45b56f6524f2d51f2"', 'x-amz-m
eta-meta1': 'Hello Worldé', 'Content-Type': 'binary/octet-stream', 'Date': 'Mon, 13 Nov 2023 17:58:37 GMT', 'Connection':
 'Keep-Alive'}

Another data point: The weird decoding is what we get if we decode utf-8 with latin1:

>>> 'Hello Worldé'.encode('utf-8').decode("latin1")
'Hello Worldé'

According to https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html non-ascii metadata is supposed to be RFC 2047 encoded. Indeed if we pass the encoded UTF-8 to python's decoder we get

>>> email.header.decode_header('Hello World\xc3\xa9')
[('Hello Worldé', None)]

So, it seems to default to latin1 decode if there isn't any encoding information.

Is there an easy way to fix the test?

  1. Use put_object Metadata arg

If we change the test to use the put object metadata attribute to send the unicode data

# before
# def set_unicode_metadata(**kwargs):                             │
#     kwargs['params']['headers']['x-amz-meta-meta1'] = u"Hello World\xe9"# client.meta.events.register('before-call.s3.PutObject', set_unicode_metadata)
# after
client.put_object(Bucket=bucket_name, Key='foo', Body='bar', Metadata={'meta1': u"Hello World\xe9"})

we actually get a

E               botocore.exceptions.ParamValidationError: Parameter validation failed:
E               Non ascii characters found in S3 metadata for key "meta1", value: "Hello Worldé".
E               S3 metadata can only contain ASCII characters.

See also boto/botocore#2552

  1. Send RFC 2047
     def set_unicode_metadata(**kwargs):                               │
=        kwargs['params']['headers']['x-amz-meta-meta1'] = email.charset.Charset("utf-8").header_encode("Hello
 World\xe9")

Nope:
E AssertionError: assert '=?utf-8?q?He...World=C3=A9?=' == 'Hello Worldé'

  1. Send latin1 encoded

Does not really work, the header dict expects a string and converts internally to bytes. So we would get into a double encoding problem.

     def set_unicode_metadata(**kwargs):                               │
=        kwargs['params']['headers']['x-amz-meta-meta1'] = str("Hello World\xe9".encode("latin1"))

->

E       assert "b'Hello World\\xe9'" == b'Hello World\xe9'

I guess adding a special string type that doesn't decode would work, but this is getting silly.

In total I think we are doing the right thing with just storing metadata as bytes without processing.

To double check, here's the RFC 2047 encoded headers:

PUT
x-amz-meta-meta1:       =?utf-8?q?Hello_World=C3=A9?=
GET
x-amz-meta-meta1:   =?utf-8?q?Hello_World=C3=A9?=

Close?

@irq0 irq0 moved this from Backlog to In Progress 🏗️ in S3GW Nov 14, 2023
@irq0 irq0 closed this as not planned Won't fix, can't repro, duplicate, stale Nov 16, 2023
@github-project-automation github-project-automation bot moved this from In Progress 🏗️ to Done in S3GW Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/rgw-sfs RGW & SFS related kind/bug Something isn't working priority/1 Should be fixed for next release
Projects
None yet
Development

No branches or pull requests

4 participants