Began adding an R-native reader for H5AD files. #46

LTLA · 2021-04-18T02:32:40Z

This creates an R-native reader that directly uses rhdf5 and other BioC packages to read H5AD files. The general idea is to reduce the level of reliance on the basilisk bindings for the 95% of use cases, avoiding the various issues with Python communication; but still retain our current basilisk infrastructure as a fallback for difficult and weird cases.

Seems to work fairly well though some more testing is needed. Code is also required to handle uns.

Fix to read HDF5-backed arrays properly.

lazappi · 2021-04-27T14:39:50Z

I don't have a lot of experience with HDF5 to be honest but this seems to work. Interestingly it is able to read the DE results in the example_anndata.h5ad file which fails to convert from Python with the current version. The format is a bit different so it would probably cause problems for scanpy if converted back to AnnData but still probably better than what we have now.

How do you think it would handle more Python specific stuff in the file? Do we need to add any other checks?

What other things still need to be done? I think at least changing readH5AD() to use this (and fall back to the Python version if it fails)?

LTLA · 2021-04-27T21:09:26Z

How do you think it would handle more Python specific stuff in the file? Do we need to add any other checks?

That's an interesting question. In theory, HDF5 should be language agnostic, so we wouldn't have to worry about Python-specific stuff if anndata saves things in a standard way. In practice, of course, one could dump all sorts of random crap in there (e.g., a pickled object) in which case we're screwed. I think the current strategy of tryCatch() will at least ensure that most of it works, and if something fails, we can go back and special case it later. Except for the pickles, no real way around that.

What other things still need to be done? I think at least changing readH5AD() to use this (and fall back to the Python version if it fails)?

Yes, and testing. Lots of testing. Throw it at every H5AD file you can get your hands on and make sure it gives reasonable results. We should even run this as part of the longtests if those H5AD files are publicly available - see e.g., DropletUtils.

ivirshup · 2021-05-13T10:51:21Z

In practice, of course, one could dump all sorts of random crap in there (e.g., a pickled object) in which case we're screwed.

The idea with the next version of anndata is that we have a registry of specifications. When you come across an element, you check in the registry of specifications to find the appropriate read method. If you can't find a method, a fallback is taken. By default, I think we'll go for a warning, but will have a strict mode to error.

You could do something similar? Again, this will be much easier to do with the next version of anndata. The question is whether you want to support older file versions as well.

Throw it at every H5AD file you can get your hands on and make sure it gives reasonable results.

Btw, would you be interested in sharing a resource for this? E.g. have some figshare or cloud storage volume that holds use cases for backwards compatibility tests?

lazappi · 2021-05-14T08:13:12Z

Btw, would you be interested in sharing a resource for this? E.g. have some figshare or cloud storage volume that holds use cases for backwards compatibility tests?

I am definitely up for that. I was going to ask around the lab for old/weird .h5ad files for testing but it makes sense to put them somewhere accessible.

In more general news I think I'm done with other minor updates. I would still like to do #47 but not sure I'll have time. The longtests infrastructure should be set up now so we can add files to that as we find them. I'll start looking into what needs to be done here but probably worth merging the current master first.

LTLA · 2021-05-17T05:21:42Z

Btw, would you be interested in sharing a resource for this? E.g. have some figshare or cloud storage volume that holds use cases for backwards compatibility tests?

Sorry for the late reply. This is a good idea and I've brought this up a few times. I use this in DropletTestFiles, holding versions of CellRanger output all the way back to v2; I don't trust 10X Genomics to give me stable URLs, hence my own copies.

It would be fairly easy to host these files on ExperimentHub; R users can pull them down via a dedicated package and Python users accessing the EHub REST API. I brought this up as a possibility for David Fischer's curation project; he ended up working with the CZI on that, but we can recycle this idea for weird old files that mightn't have a home anywhere else.

ivirshup · 2021-05-17T08:15:46Z

It would be fairly easy to host these files on ExperimentHub; R users can pull them down via a dedicated package and Python users accessing the EHub REST API.

That sounds good. I'm looking into what we can do for hosting, but if you've got something already set up that would be great. I'm not familiar with the organization of ExperimentHub, how much structure would we need to observe there?

I think there's also a question of scope, and what constitutes "test files". For example, I've been wanting a place to host larger datasets for benchmarking.

* master: Adjust long tests Update NEWS and bump version Convert dgRMatrix to dgCMatrix in AnnData2SCE() Adjust failing tests Add scIB pancreas long tests Set up long tests Bump version Remember and reuse name of the X matrix Enable return conversion of varm Add conversion checks to all slots in AnnData2SCE() Bump version and update NEWS Update anndata Python dependencies Add compression argument to writeH5AD() Styling Bump version Linting Spellcheck Adapt to changes in HDF5ARrray::HDF5Array()

Needs a lot more...

LTLA · 2021-05-18T07:42:11Z

I'm not familiar with the organization of ExperimentHub, how much structure would we need to observe there?

Not much. You can throw any files on there, provided they are appropriately decorated with metadata. Metadata is formatted like this and needs to be associated with a package. Ideally we would have a separate package like H5ADTestFiles or something appropriately named; I would offer to have scRNAseq be the temporary host package for these files, but given that we're so close to the release, it's unlikely that our request to add files to scRNAseq would be processed in time, so we might as well just wait.

In the meantime, I suppose @lazappi can just gather his ragtag bunch of files and do testing locally.

lazappi · 2021-05-18T15:14:19Z

I got the dates mixed up and didn't realise final changes for this release are supposed to be today. I don't think there is time to test this properly but it seems to mostly work. At the moment it's available as an option but the Python version is still the default. We could merge it and iron out the kinks later (maybe with a note/warning somewhere about it still being experimental)? @LTLA do you think it's worth doing that or should we just put it off until the next release when it should be more stable?

LTLA · 2021-05-18T20:50:08Z

At the moment it's available as an option but the Python version is still the default. We could merge it and iron out the kinks later (maybe with a note/warning somewhere about it still being experimental)?

SGTM

ivirshup · 2021-05-19T00:41:57Z

Ideally we would have a separate package like H5ADTestFiles

I think it makes more sense to start of with something less structured. It will help to scope out the requirements of what kinds of files we collect. I also think this should probably ultimately be handled by anndata since we're going to have more complex requirements (e.g. not just hdf5, benchmarking datasets, adding files on our release schedule).

LTLA · 2021-05-19T01:18:56Z

Whatever floats your boat, I'm fine with; as long as we have stable URLs to pull for zellkonverter's longtests.

(@lazappi in fact it occurs to me that we could just associate zellkonverter itself with these files, thus avoiding the need to have a whole other package. As long as we only Suggests: ExperimentHub, it shouldn't add to the dependency burden. I was also doing this for reference datasets in SingleR before I split them off into a separate package.)

Began adding an R-native reader for H5AD files.

ec4785b

LTLA mentioned this pull request Apr 22, 2021

Error in socketAccept for writeH5AD #44

Closed

Read the uns/ field into the metadata.

764b58e

Fix to read HDF5-backed arrays properly.

lazappi mentioned this pull request Apr 27, 2021

Results from rank_genes_groups are not converted #28

Closed

LTLA mentioned this pull request Apr 28, 2021

Began adding an R-native writer. #48

Open

lazappi added 4 commits May 17, 2021 15:34

Add reader argument to readH5AD()

78861d0

Adjust how .read_dim_data() handles factor levels

5147b25

Add simple test for native reader

de6cfcb

Needs a lot more...

Add details to docs about native R reader

ee16d13

lazappi merged commit 7b65f1e into theislab:master May 19, 2021

ivirshup mentioned this pull request May 20, 2021

zarr support? #52

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Began adding an R-native reader for H5AD files. #46

Began adding an R-native reader for H5AD files. #46

LTLA commented Apr 18, 2021

lazappi commented Apr 27, 2021

LTLA commented Apr 27, 2021

ivirshup commented May 13, 2021

lazappi commented May 14, 2021

LTLA commented May 17, 2021

ivirshup commented May 17, 2021

LTLA commented May 18, 2021

lazappi commented May 18, 2021

LTLA commented May 18, 2021

ivirshup commented May 19, 2021

LTLA commented May 19, 2021

Began adding an R-native reader for H5AD files. #46

Began adding an R-native reader for H5AD files. #46

Conversation

LTLA commented Apr 18, 2021

lazappi commented Apr 27, 2021

LTLA commented Apr 27, 2021

ivirshup commented May 13, 2021

lazappi commented May 14, 2021

LTLA commented May 17, 2021

ivirshup commented May 17, 2021

LTLA commented May 18, 2021

lazappi commented May 18, 2021

LTLA commented May 18, 2021

ivirshup commented May 19, 2021

LTLA commented May 19, 2021