Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opening GCS files via smart_open.open fails to add the name field to the file object #505

Closed
todor-markov opened this issue Jun 4, 2020 · 3 comments · Fixed by #506
Closed

Comments

@todor-markov
Copy link
Contributor

todor-markov commented Jun 4, 2020

Problem description

Currently the gcs.open function returns a Reader or Writer object, but does not add a name field to the object. This creates a problem when trying to open compressed files in GCS, since smart open's compression wrapper relies on the name field to choose the decompression algorithm. Thankfully, the problem is relatively straightforward to fix by slightly updating the open function in gcs.py:

Here is the current version of gcs.open:

if mode == constants.READ_BINARY:
        return Reader(
            bucket_id,
            blob_id,
            buffer_size=buffer_size,
            line_terminator=constants.BINARY_NEWLINE,
            client=client,
        )
    elif mode == constants.WRITE_BINARY:
        return Writer(
            bucket_id,
            blob_id,
            min_part_size=min_part_size,
            client=client,
        )
    else:
        raise NotImplementedError('GCS support for mode %r not implemented' % mode)

Here is the updated version that would fix the issue:

if mode == constants.READ_BINARY:
        fileobj = Reader(
            bucket_id,
            blob_id,
            buffer_size=buffer_size,
            line_terminator=constants.BINARY_NEWLINE,
            client=client,
        )
    elif mode == constants.WRITE_BINARY:
        fileobj = Writer(
            bucket_id,
            blob_id,
            min_part_size=min_part_size,
            client=client,
        )
    else:
        raise NotImplementedError('GCS support for mode %r not implemented' % mode)

    fileobj.name = blob_id
    return fileobj

I tried to push a branch & open a pull request that makes the fix, but it looks like I don't have access rights, so I made this issue instead.

@mpenkov
Copy link
Collaborator

mpenkov commented Jun 5, 2020

Thank you for reporting this.

I tried to push a branch & open a pull request that makes the fix, but it looks like I don't have access rights, so I made this issue instead.

I think you may be pushing to the wrong location. In github, the way to do what you want to do is:

  1. Fork the repo, so you have a personal copy of the repository under your account.
  2. Push the branch to your fork. Because the fork belongs to you, there are no permission issues.
  3. Make a PR
  4. PROFIT

For more, see this link or Google around.

Do you want to go ahead and try making a PR by yourself? If you haven't done this kind of thing before, it's a great opportunity to start.

@petedannemann
Copy link
Contributor

Looks like azure suffers from this as well. I didn't look at other transport mechanisms but hopefully they all have them. What is odd is that our compress / decompress tests don't seem to catch this

@mpenkov
Copy link
Collaborator

mpenkov commented Jun 6, 2020

Yeah, sounds like we have a hole in our test coverage.

@petedannemann Are you able to look into it? If not, I will make a separate ticket and deal with it in time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants