Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCS get_object fails with charcode conversion error #533

Closed
mttkay opened this issue Jun 23, 2021 · 10 comments · Fixed by #534
Closed

GCS get_object fails with charcode conversion error #533

mttkay opened this issue Jun 23, 2021 · 10 comments · Fixed by #534

Comments

@mttkay
Copy link

mttkay commented Jun 23, 2021

Broken out of this discussion: https://github.com/fog/fog-google/pull/530/files#r645936386

With fog-google 1.14.0 we are seeing the following error when using the File.copy method and the target object is a binary file (in this case, a Java JAR file):

Encoding::UndefinedConversionError: "\xC4" from ASCII-8BIT to UTF-8
/opt/gitlab/embedded/lib/ruby/2.7.0/delegate.rb:343:in `write'
/opt/gitlab/embedded/lib/ruby/2.7.0/delegate.rb:343:in `block in delegating_block'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/google-api-client-0.50.0/lib/google/apis/core/download.rb:97:in `block in execute_once'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/httpclient-2.8.3/lib/httpclient.rb:1093:in `block in follow_redirect'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/httpclient-2.8.3/lib/httpclient.rb:1250:in `block in do_get_block'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/httpclient-2.8.3/lib/httpclient/session.rb:894:in `read_body_length'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/httpclient-2.8.3/lib/httpclient/session.rb:587:in `get_body'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/httpclient-2.8.3/lib/httpclient.rb:1247:in `do_get_block'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/httpclient-2.8.3/lib/httpclient.rb:1019:in `block in do_request'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/httpclient-2.8.3/lib/httpclient.rb:1133:in `protect_keep_alive_disconnected'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/httpclient-2.8.3/lib/httpclient.rb:1014:in `do_request'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/httpclient-2.8.3/lib/httpclient.rb:1104:in `follow_redirect'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/httpclient-2.8.3/lib/httpclient.rb:854:in `request'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/httpclient-2.8.3/lib/httpclient.rb:743:in `get'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/google-api-client-0.50.0/lib/google/apis/core/download.rb:77:in `execute_once'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/google-api-client-0.50.0/lib/google/apis/core/http_command.rb:113:in `block (2 levels) in execute'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/retriable-3.1.2/lib/retriable.rb:61:in `block in retriable'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/retriable-3.1.2/lib/retriable.rb:56:in `times'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/retriable-3.1.2/lib/retriable.rb:56:in `retriable'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/google-api-client-0.50.0/lib/google/apis/core/http_command.rb:110:in `block in execute'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/retriable-3.1.2/lib/retriable.rb:61:in `block in retriable'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/retriable-3.1.2/lib/retriable.rb:56:in `times'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/retriable-3.1.2/lib/retriable.rb:56:in `retriable'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/google-api-client-0.50.0/lib/google/apis/core/http_command.rb:102:in `execute'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/google-api-client-0.50.0/lib/google/apis/core/base_service.rb:366:in `execute_or_queue_command'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/google-api-client-0.50.0/generated/google/apis/storage_v1/service.rb:1914:in `get_object'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/fog-google-1.14.0/lib/fog/storage/google_json/requests/get_object.rb:58:in `get_object'

Apparently somewhere in the communication between client -> fog-google/json-storage -> google-api-ruby-client -> GCS there is a problem with how byte streams are interpreted.

Possibly related problem: googleapis/google-api-ruby-client#74

I believe what is happening here is that somewhere, that binary byte stream is being interpreted as a UTF-8 string, which it is not:

[13:47:19] ~/Downloads hexdump broken_file.jar | grep c4
0000000 4b50 0403 000a 0000 0000 4bc4 52d7 0000
0000050 494e 4546 5453 4d2e f346 cc4d 4ccb 2d4b
00000d0 000a 0000 0000 4bc4 52d7 0000 0000 0000
0000100 0000 c400 d74b 0052 0000 0000 0000 0000
0000140 0000 4bc4 52d7 0000 0000 0000 0000 0000
0000280 4cd6 d822 8d50 2b14 df43 6270 c42c df2a
0000320 02fc 4b50 0403 000a 0000 0008 4bc4 52d7
0000370 7265 6974 7365 7653 cd4f 2d4b 2c4a 4d49
00003e0 0314 000a 0000 0000 4bc4 52d7 0000 0000
0000460 c400 d74b 0052 0000 0000 0000 0000 0000
0000490 2f6e 4b50 0201 0314 000a 0000 0000 4bc4
00004e0 502f 014b 1402 0a03 0000 0000 c400 d74b
00005a0 4bc4 52d7 0b31 d1ba 0066 0000 006a 0000

Note how there are multiple sequences containing c4XX multi-byte sequences, which denote certain characters with diacritics in Unicode: https://www.utf8-chartable.de/unicode-utf8-table.pl?start=256

Request logs of the failing request:

= Request

GET /storage/v1/b/mk-omnibus-upload-test/o/lost_and_found%2F79%2F02%2F7902699be42c8a8e46fbbb4501726517e86b22c56a189f7625a6da49081b2451%2Fpackages%2F1%2Ffiles%2F1%2Fmy-app-1.0.jar?alt=media&alt=media HTTP/1.1
User-Agent: fog/1.14.0 google-api-ruby-client/0.50.0 Linux/5.11.0-7614-generic (gzip)
X-Goog-Api-Client: gl-ruby/2.7.2 gdcl/0.50.0
authorization: Bearer <snip>
Accept: */*
Accept-Encoding: gzip,deflate
Date: Wed, 23 Jun 2021 13:23:13 GMT
Host: storage.googleapis.com



= Response

HTTP/1.1 200 OK
X-GUploader-UploadID: ABg5-Uz9vhhuXVa8qQ07HuqtS-6XzUCaj-APwTyXLpZnN7jFud0MU4YwmWmu0RAYERhUORO0dA1dbUCi78STRKoxV3Q
ETag: COTDl/jsrfECEAE=
X-Goog-Hash: crc32c=TqHrHA==,md5=ugIGHxsWpmeASfUTe3PX+A==
Content-Type: application/octet-stream
X-Goog-Generation: 1624454593896932
X-Goog-Metageneration: 1
X-Goog-Storage-Class: STANDARD
Pragma: no-cache
Expires: Mon, 01 Jan 1990 00:00:00 GMT
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Content-Disposition: attachment
Date: Wed, 23 Jun 2021 13:23:14 GMT
Vary: Origin
Vary: X-Origin
Content-Length: 1550
Server: UploadServer
Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-T051=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"

PK
�K�R	META-INF/PK
K-*��ϳR0�3��r,J��,K-�V�+�$x���RKRSt�*��
                                      �3R|�R�����y��J3sJ���%9�I�E�yy � �]��l�]@��
y�x�PK
�K�RMETA-INF/maven/PK
�K�R!META-INF/maven/com.mycompany.app/PK
�K�R(META-INF/maven/com.mycompany.app/my-app/PK
�J�R��q%W
         /META-INF/maven/com.mycompany.app/my-app/pom.xml�Q�O�0��W���ڡ�H]b��
                                                                            �J�-e����H[6��n)�
                                                                                             ���;4�{���{��T�$��f9׷��9���p�%N���d
                                                                                                                                � ��S֚~�˲��]A��GOv�a�rm0O�! CONNECTION CLOSED
D, [2021-06-23T13:23:14.031026 #36838] DEBUG -- : Error - #<Encoding::UndefinedConversionError: "\xC4" from ASCII-8BIT to UTF-8>
@mttkay
Copy link
Author

mttkay commented Jun 23, 2021

Looking at https://cloud.google.com/storage/docs/json_api/v1/objects/get#response, I noticed this paragraph:

By default, this responds with an object resource in the response body. If you provide the URL parameter alt=media, then it will respond with the object data in the response body.

And in the request URL, we indeed send ?alt=media (twice for some reason.)

So GCS will send back a binary file, and it looks like we then try to parse this as JSON?

@mttkay
Copy link
Author

mttkay commented Jun 23, 2021

I verified this isn't happening in fog-google 1.13. Here is how that request looks like:

D, [2021-06-23T13:50:13.269963 #1476] DEBUG -- : Sending HTTP get https://storage.googleapis.com/storage/v1/b/mk-omnibus-upload-test/o/lost_and_found%2F79%2F02%2F7902699be42c8a8e46fbbb4501726517e86b22c56a189f7625a6da49081b2451%2Fpackages%2F1%2Ffiles%2F1%2Fmy-app-1.0.jar?
D, [2021-06-23T13:50:13.323003 #1476] DEBUG -- : 200

It looks like request logs look completely different, not sure why.

@mttkay
Copy link
Author

mttkay commented Jun 23, 2021

I just noticed that in 1.14 the copy_object implementation has changed w.r.t. option handling: https://github.com/fog/fog-google/pull/520/files

I wonder if it's related.

@mttkay
Copy link
Author

mttkay commented Jun 23, 2021

I wonder if it's related.

Maybe not related. I tested with those changes reverted and the problem persists.

@stanhu
Copy link
Contributor

stanhu commented Jun 23, 2021

@mttkay Could you try reverting e83c58e? I wonder if we just need to open Tempfile with binmode: true.

@stanhu
Copy link
Contributor

stanhu commented Jun 23, 2021

Ok, I reproduced the problem. Indeed, we need to call buf.binmode to activate the binary stream.

@stanhu
Copy link
Contributor

stanhu commented Jun 23, 2021

#534 solves this issue.

@mttkay
Copy link
Author

mttkay commented Jun 24, 2021

#534 solves this issue.

Yes I can confirm this fixes the issue. Thanks @stanhu !

@geemus
Copy link
Member

geemus commented Jun 24, 2021

Fix released in v1.15.0.

@mttkay
Copy link
Author

mttkay commented Jun 24, 2021

Fix released in v1.15.0.

Thank you @geemus ! 🙇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants