Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hdf5r rownames rework #166

Closed
wants to merge 32 commits into from
Closed

Conversation

rcannood
Copy link
Collaborator

Hey @lazappi !

Since the lack of support for boolean attributes remains a blocking issue with rhdf5, I decided to give hdf5r a try.

The current PR seems to work a lot better, still has some issues when writing h5ad files with anndataR and then trying to read them out again in Python anndata. I need to do some experiments with writing the same h5ad file from anndataR and Python anndata and seeing where the differences lie. I'm starting to think that in our implementation of hdf5_write_compressed, the dtype and space should not be guessed but instead be manually specified depending on which write_h5ad_* function it was called from.


Luckily, our internal read_*/write_* functions stayed pretty much the same since all rhdf5::* could mostly be substituted with the corresponding hdf5r::* functions.


While making the changes, I was struggling with our decision to keep the obs_names / var_names separate from obs and var, because they are stored inside the obs and var and when making changes to the obs and var the first thing we do is throw it away.

By allowing the rownames of the obs and var to be the obs_names and var_names, the code did get simplified a lot.


hdf5_write_compressed <- function(file, name, value, compression = c("none", "gzip", "lzf"), scalar = FALSE) {
  compression <- match.arg(compression)
  if (!is.null(dim(value))) {
    dims <- dim(value)
  } else {
    dims <- length(value)
  }
  dtype <- hdf5r::guess_dtype(value, scalar = scalar, string_len = Inf)
  space <- hdf5r::guess_space(value, dtype = dtype, chunked = FALSE)

  # TODO: lzf compression is not supported in hdf5r
  # TODO: allow the user to choose compression level
  gzip_level <- if (compression == "none") 0 else 9

  out <- file$create_dataset(
    name = name,
    dims = dims,
    gzip_level = gzip_level,
    robj = value,
    chunk_dims = NULL,
    space = space,
    dtype = dtype
  )
  # todo: create attr?

  out
}

There is currently an issue with the released version of hdf5r (hhoeflin/hdf5r#208) which was the cause of some of the strange errors in packages like MuDataSeurat. We already managed to fix the issue, but it still needs to be merged into the main branch and released.

lazappi and others added 30 commits December 4, 2023 17:08
…o into write-h5ad-categoricals

* 'write-h5ad-categoricals' of github.com:scverse/scverseio:
  fix styling
  Update write_h5ad_categorical
Replace repeated code in individual writers
@lazappi
Copy link
Collaborator

lazappi commented Feb 12, 2024

I think there was a reason we decided obs_names/var_names needed to be stored separately. I can't remember all the details but I think it had something to do with the fact that what you can store in rownames of a data.frame is pretty limited and it is possible that there are things in a file written from Python that can't be stored that way. Not sure if we ran into actual issues or if it was just a theoretical problem.


I didn't write any of the code for handling compression so I'm not sure about that. I could maybe look if needed but otherwise I don't really have an opinion there.


For me, switching from {rhfd5} to {hdf5r} would be a pretty major change, particularly as it affects if we submit to CRAN/Bioconductor. I'm not entirely opposed but I would want to better understand the differences and pro/cons first. We could probably reach out to the maintainers to try and get the boolean attributes added to {rhdf5} if that's the motivation for switching.

@lazappi
Copy link
Collaborator

lazappi commented Feb 13, 2024

I asked about this and turns out we actually need to write an ENUM attribute not a special boolean thing. This still requires some changes to {rhdf5} but I think relatively minor.

In the process I realised how we currently write boolean values anywhere is wrong and we need to replace it with the ENUM approach. That's my mistake because I didn't know enough about how HDF5 works. Should be relatively easy to fix once the changes in {rhdf5} are settled. Maybe {hdf5r} already does it this way though?

@rcannood
Copy link
Collaborator Author

Closing in favour of #169

@rcannood rcannood closed this Jun 27, 2024
rcannood added a commit that referenced this pull request Jul 4, 2024
rcannood added a commit that referenced this pull request Jul 8, 2024
* port rownames-related changes from #166 and #169

* run styler

* fix test

* style

* style

* fix docs

* fix documentation

* simplify helper functions

* simplify test

* add more documentation to AnnData

* fix docs
LouiseDck pushed a commit that referenced this pull request Oct 3, 2024
* port rownames-related changes from #166 and #169

* run styler

* fix test

* style

* style

* fix docs

* fix documentation

* simplify helper functions

* simplify test

* add more documentation to AnnData

* fix docs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants