Hdf5r rownames rework #166

rcannood · 2024-02-10T07:28:08Z

Since the lack of support for boolean attributes remains a blocking issue with rhdf5, I decided to give hdf5r a try.

The current PR seems to work a lot better, still has some issues when writing h5ad files with anndataR and then trying to read them out again in Python anndata. I need to do some experiments with writing the same h5ad file from anndataR and Python anndata and seeing where the differences lie. I'm starting to think that in our implementation of hdf5_write_compressed, the dtype and space should not be guessed but instead be manually specified depending on which write_h5ad_* function it was called from.

Luckily, our internal read_*/write_* functions stayed pretty much the same since all rhdf5::* could mostly be substituted with the corresponding hdf5r::* functions.

While making the changes, I was struggling with our decision to keep the obs_names / var_names separate from obs and var, because they are stored inside the obs and var and when making changes to the obs and var the first thing we do is throw it away.

By allowing the rownames of the obs and var to be the obs_names and var_names, the code did get simplified a lot.

hdf5_write_compressed <- function(file, name, value, compression = c("none", "gzip", "lzf"), scalar = FALSE) {
  compression <- match.arg(compression)
  if (!is.null(dim(value))) {
    dims <- dim(value)
  } else {
    dims <- length(value)
  }
  dtype <- hdf5r::guess_dtype(value, scalar = scalar, string_len = Inf)
  space <- hdf5r::guess_space(value, dtype = dtype, chunked = FALSE)

  # TODO: lzf compression is not supported in hdf5r
  # TODO: allow the user to choose compression level
  gzip_level <- if (compression == "none") 0 else 9

  out <- file$create_dataset(
    name = name,
    dims = dims,
    gzip_level = gzip_level,
    robj = value,
    chunk_dims = NULL,
    space = space,
    dtype = dtype
  )
  # todo: create attr?

  out
}

There is currently an issue with the released version of hdf5r (hhoeflin/hdf5r#208) which was the cause of some of the strange errors in packages like MuDataSeurat. We already managed to fix the issue, but it still needs to be merged into the main branch and released.

…o into write-h5ad-categoricals * 'write-h5ad-categoricals' of github.com:scverse/scverseio: fix styling Update write_h5ad_categorical

Replace repeated code in individual writers

lazappi · 2024-02-12T14:06:08Z

I think there was a reason we decided obs_names/var_names needed to be stored separately. I can't remember all the details but I think it had something to do with the fact that what you can store in rownames of a data.frame is pretty limited and it is possible that there are things in a file written from Python that can't be stored that way. Not sure if we ran into actual issues or if it was just a theoretical problem.

I didn't write any of the code for handling compression so I'm not sure about that. I could maybe look if needed but otherwise I don't really have an opinion there.

For me, switching from {rhfd5} to {hdf5r} would be a pretty major change, particularly as it affects if we submit to CRAN/Bioconductor. I'm not entirely opposed but I would want to better understand the differences and pro/cons first. We could probably reach out to the maintainers to try and get the boolean attributes added to {rhdf5} if that's the motivation for switching.

lazappi · 2024-02-13T10:40:02Z

I asked about this and turns out we actually need to write an ENUM attribute not a special boolean thing. This still requires some changes to {rhdf5} but I think relatively minor.

In the process I realised how we currently write boolean values anywhere is wrong and we need to replace it with the ENUM approach. That's my mistake because I didn't know enough about how HDF5 works. Should be relatively easy to fix once the changes in {rhdf5} are settled. Maybe {hdf5r} already does it this way though?

rcannood · 2024-06-27T21:13:16Z

Closing in favour of #169

* port rownames-related changes from #166 and #169 * run styler * fix test * style * style * fix docs * fix documentation * simplify helper functions * simplify test * add more documentation to AnnData * fix docs

lazappi and others added 30 commits December 4, 2023 17:08

Update write_h5ad_categorical

a77035b

fix styling

f739fe2

Merge remote-tracking branch 'origin/main' into write-h5ad-categoricals

4b24b2b

Merge remote-tracking branch 'origin/main' into write-h5ad-categoricals

a05d60c

Update write_h5ad_categorical

a042777

Adjust H5AD categorical write test

d273343

Merge branch 'write-h5ad-categoricals' of github.com:scverse/scversei…

bb2b2cf

…o into write-h5ad-categoricals * 'write-h5ad-categoricals' of github.com:scverse/scverseio: fix styling Update write_h5ad_categorical

Add write_h5ad_attributes function

d4dc429

Replace repeated code in individual writers

ignore cyclomatic complexity warning for write_h5ad_element warning

dbc1dc5

formatting changes

a46a234

in write_h5ad_attributes, allow file to be an open hdf5 file

fd735e4

wip

9720706

wip

d3f1d5c

substitute mentions of rhdf5 with hdf5r

731df9a

strip obs_names and var_names from framework

d0ad1ec

update

16441ff

fix tests and finalize

f01ea73

remove mentions of obs_names and var_names in the constructor

8b58bc8

make sure filenames are always unique

7d4a182

add mode to various functions

c3586e8

manually close anndatas in tests (where needed)

c7b30ca

only close when pointer is valid

e9d16fc

move match

c1dd3c4

use $close() instead of $close_all()

8d73f13

switch to different branch

e3180db

simplify test

9504e0b

gc afterclosing the adata in write_h5ad

0b1850a

guess the dtype and the space

29628a8

update docs

5fd53ad

use hhoeflin's remote

4278c02

rcannood added 2 commits January 16, 2024 10:11

bugfix in hdf5r has been released

bebb3bd

update: nevermind, the fix wasn't included in the release yet

d0371aa

rcannood mentioned this pull request Jun 27, 2024

switch from rhdf5 to hdf5r #169

Merged

rcannood closed this Jun 27, 2024

rcannood added a commit that referenced this pull request Jul 4, 2024

port rownames-related changes from #166 and #169

6980a18

rcannood mentioned this pull request Jul 4, 2024

Make rownames part of obs and var #171

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hdf5r rownames rework #166

Hdf5r rownames rework #166

rcannood commented Feb 10, 2024

lazappi commented Feb 12, 2024

lazappi commented Feb 13, 2024

rcannood commented Jun 27, 2024

Hdf5r rownames rework #166

Hdf5r rownames rework #166

Conversation

rcannood commented Feb 10, 2024

lazappi commented Feb 12, 2024

lazappi commented Feb 13, 2024

rcannood commented Jun 27, 2024