-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Began adding an R-native reader for H5AD files. #46
Conversation
Fix to read HDF5-backed arrays properly.
I don't have a lot of experience with HDF5 to be honest but this seems to work. Interestingly it is able to read the DE results in the How do you think it would handle more Python specific stuff in the file? Do we need to add any other checks? What other things still need to be done? I think at least changing |
That's an interesting question. In theory, HDF5 should be language agnostic, so we wouldn't have to worry about Python-specific stuff if anndata saves things in a standard way. In practice, of course, one could dump all sorts of random crap in there (e.g., a pickled object) in which case we're screwed. I think the current strategy of
Yes, and testing. Lots of testing. Throw it at every H5AD file you can get your hands on and make sure it gives reasonable results. We should even run this as part of the |
The idea with the next version of anndata is that we have a registry of specifications. When you come across an element, you check in the registry of specifications to find the appropriate read method. If you can't find a method, a fallback is taken. By default, I think we'll go for a warning, but will have a strict mode to error. You could do something similar? Again, this will be much easier to do with the next version of anndata. The question is whether you want to support older file versions as well.
Btw, would you be interested in sharing a resource for this? E.g. have some figshare or cloud storage volume that holds use cases for backwards compatibility tests? |
I am definitely up for that. I was going to ask around the lab for old/weird In more general news I think I'm done with other minor updates. I would still like to do #47 but not sure I'll have time. The |
Sorry for the late reply. This is a good idea and I've brought this up a few times. I use this in DropletTestFiles, holding versions of CellRanger output all the way back to v2; I don't trust 10X Genomics to give me stable URLs, hence my own copies. It would be fairly easy to host these files on ExperimentHub; R users can pull them down via a dedicated package and Python users accessing the EHub REST API. I brought this up as a possibility for David Fischer's curation project; he ended up working with the CZI on that, but we can recycle this idea for weird old files that mightn't have a home anywhere else. |
That sounds good. I'm looking into what we can do for hosting, but if you've got something already set up that would be great. I'm not familiar with the organization of ExperimentHub, how much structure would we need to observe there? I think there's also a question of scope, and what constitutes "test files". For example, I've been wanting a place to host larger datasets for benchmarking. |
* master: Adjust long tests Update NEWS and bump version Convert dgRMatrix to dgCMatrix in AnnData2SCE() Adjust failing tests Add scIB pancreas long tests Set up long tests Bump version Remember and reuse name of the X matrix Enable return conversion of varm Add conversion checks to all slots in AnnData2SCE() Bump version and update NEWS Update anndata Python dependencies Add compression argument to writeH5AD() Styling Bump version Linting Spellcheck Adapt to changes in HDF5ARrray::HDF5Array()
Needs a lot more...
Not much. You can throw any files on there, provided they are appropriately decorated with metadata. Metadata is formatted like this and needs to be associated with a package. Ideally we would have a separate package like H5ADTestFiles or something appropriately named; I would offer to have scRNAseq be the temporary host package for these files, but given that we're so close to the release, it's unlikely that our request to add files to scRNAseq would be processed in time, so we might as well just wait. In the meantime, I suppose @lazappi can just gather his ragtag bunch of files and do testing locally. |
I got the dates mixed up and didn't realise final changes for this release are supposed to be today. I don't think there is time to test this properly but it seems to mostly work. At the moment it's available as an option but the Python version is still the default. We could merge it and iron out the kinks later (maybe with a note/warning somewhere about it still being experimental)? @LTLA do you think it's worth doing that or should we just put it off until the next release when it should be more stable? |
SGTM |
I think it makes more sense to start of with something less structured. It will help to scope out the requirements of what kinds of files we collect. I also think this should probably ultimately be handled by |
Whatever floats your boat, I'm fine with; as long as we have stable URLs to pull for zellkonverter's (@lazappi in fact it occurs to me that we could just associate zellkonverter itself with these files, thus avoiding the need to have a whole other package. As long as we only |
This creates an R-native reader that directly uses rhdf5 and other BioC packages to read H5AD files. The general idea is to reduce the level of reliance on the basilisk bindings for the 95% of use cases, avoiding the various issues with Python communication; but still retain our current basilisk infrastructure as a fallback for difficult and weird cases.
Seems to work fairly well though some more testing is needed. Code is also required to handle
uns
.