Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support previously opened filehandles as input to xopen.open() #150

Closed
wants to merge 7 commits into from

Conversation

geertvandeweyer
Copy link
Contributor

@geertvandeweyer geertvandeweyer commented Feb 8, 2024

Minimal changes were applied to support passing open binary filehandles to xopen.xopen().

The tests were successful for all supported compression types, for both reading and writing, but I might have missed some of library dependencies.

Use case would be (and tested for) :

  • opening a non-compressed connection to a remote file using smart_open :

f_smart = smart_open('s3://mybucket/myfile.fastq.gz','wb', compression='disable')

  • then pass this handle to xopen to perform efficient (de)compression:

f = xopen.xopen(f_smart,mode='wb')

I've noticed some mypy issues in the tox check. These are related to potentially conflicting object types. This could be fixed by explicitly renaming some variables to split flows depending on the input type (afaik). Let me know if you'd want this, or these issues can be ignored/whitelisted (?)

Copy link
Collaborator

@marcelm marcelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi and thanks for this! There’s PR #148 currently being worked on that also touches some of the code that this PR changes. Once #148 is merged, you’ll probably have to fix some merge conflicts. I’ve left a couple of comments already, but maybe you want to wait with addressing them until then.

src/xopen/__init__.py Outdated Show resolved Hide resolved
src/xopen/__init__.py Outdated Show resolved Hide resolved
tests/test_xopen.py Outdated Show resolved Hide resolved
tests/test_xopen.py Outdated Show resolved Hide resolved
tests/test_xopen.py Outdated Show resolved Hide resolved
tests/test_xopen.py Outdated Show resolved Hide resolved
@geertvandeweyer
Copy link
Contributor Author

ok, thanks for the feedback. I didn't rename as this isn't my code :-) But I'll keep an eye on PR 148 and revise mine afterwards!

@marcelm
Copy link
Collaborator

marcelm commented Feb 16, 2024

Hi @geertvandeweyer, PR #148 is now merged and, as expected, there are some conflicts. Let us know if you need help fixing them.

@geertvandeweyer
Copy link
Contributor Author

I've addressed most comments I think. I'd welcome some feedback on a few issues though :

  • I'm hesitant to rename input arguments. It might break downstream applications. I've renamed filename to file, but kept support for filename for now
  • I can't make the tests work correctly. the xopen_without_igzip doesn't seem to work for me. Raising the ValueError on reading pre-opened gzip through pipedCompression, is not tested.
  • I didn't touch the test_piped.py tests. Help on implementing these is welcome.

@rhpvorderman
Copy link
Collaborator

rhpvorderman commented Feb 19, 2024

I'm hesitant to rename input arguments. It might break downstream applications. I've renamed filename to file, but kept support for filename for now

Currently the xopen function is the only function that should be considered as part of the stable API. A massive refactor took place in the last few weeks, where we purposefully broke all the other interfaces to get a massive saving of 40% less code. The other interfaces were always considered as internal use only, but this is made much more explicit now. As a consequence the next version of xopen will be 2.0.0. So it is theoretically possible to break a few things.

That said, I agree with you. The python stdlib gzip.open interface also supports opening filehandles and this parameter is called filename regardless. For the sake of consistency and not breaking users this should be kept as filename.

@rhpvorderman
Copy link
Collaborator

Oh, I forgot to mention that I am willing to help with the tests as soon as you get it in a state that you feel it is ready. Please ping me when this is the case.

@geertvandeweyer
Copy link
Contributor Author

ok, thanks for helping with the tests! If the setup of allowing both file (default) and filename (with a warning) is fine, then the code is ready afaik. I've tested what I can and it seems ok :-)

If either filename or file should be supported, I can change the code again. Just let me know what you prefer.

@rhpvorderman
Copy link
Collaborator

I think we should not support both. That is unnecessary complexity.

I see builtins.open uses file: https://docs.python.org/3/library/functions.html#open

However the compression modules, which offer an interface that is closer to xopen use filename:
https://docs.python.org/3/library/gzip.html#gzip.open
https://docs.python.org/3/library/bz2.html#bz2.open
https://docs.python.org/3/library/lzma.html#lzma.open

And then there is of course the backwards compatibility argument that puts me in the filename camp.

Really, either of these names is fine, but since filename is used historically for xopen, we should stay with that. What are your thoughts @marcelm ?

@marcelm
Copy link
Collaborator

marcelm commented Feb 19, 2024

gzip.open added support for passing in file objects in python/cpython@6872101. I assume the name of the parameter was left unchanged for backwards compatibility even back then. And bz2.open and lzma.open just copy what gzip.open does ...

It’s a little bit annoying that it’s called filename, but I’d be fine with leaving it as-is. That said, we wouldn’t be breaking too many people’s code. But there could be other code not on GitHub.

(Another thought: We could also turn it into a positional-only parameter.

I don’t have a strong opinion either way. I guess filename is the safest option.

@geertvandeweyer
Copy link
Contributor Author

Removed the file/filename in favor of filename only, as discussed.

@rhpvorderman
Copy link
Collaborator

Fixing the tests was easy enough, but the type system has a lot of complaints. io.IOBase has no name attribute. Also I see that there is no clear distinction between TextIO and BinaryIO for the pipedcompressionwriter. Simply "patching" these issues won't work. This needs a fundamentally different approach in order to be successful.

@rhpvorderman
Copy link
Collaborator

I am building on top of this pr in: https://github.com/pycompression/xopen/tree/acceptfilehandles

The major change I did was that internally now everything is a binary stream. The pipedcompressionprogram now also works with reading streams: I created an extra thread that simply reads from the file and writes it to the stdin of the program.

This has numerous advantages:

  1. Internally everything is a binary stream. Easy peasy.
  2. Python has full control of the input file, even when it is piped trough another program. This is nice for instance when using "tell" to check how far the program is.

There are still some bugs I need to iron out, but it is getting there. Slowly.

I really need to work on other things today as well, so I will leave it be for now.

@geertvandeweyer
Copy link
Contributor Author

thanks for picking up on this. I'll leave it to you to finalize this then. Feel free to close this PR.

@rhpvorderman
Copy link
Collaborator

Done. See #152

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants