-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Projects using Zarr #19
Comments
Zarr is used by MalariaGEN within several large-scale collaborative scientific projects related to the genetic study of malaria. For example, Zarr is used by the Anopheles gambiae 1000 Genomes project to store and analyse data on genotype calls derived from whole genome sequencing of malaria mosquitoes, which led to this publication. |
Xarray has recently (pydata/xarray#1528) developed a Zarr backend for reading and writing datasets to local and remote stores. In the Xarray use case, we hope to use Zarr to interface with Dask to support parallel reads/writes from cloud-based datastores. We hope to find that this approach out performs traditional file based workflows that use HDF5/NetCDF (pangeo-data/pangeo#48). Xarray itself is used in a wide variety of fields, from climate science to finance. @rabernat gets 99% of the credit for the zarr implementation in xarray. |
Thanks @jhamman. |
I'm participating in a Kaggle competition where the goal is to classify images by the camera models that took them, which is useful for forensics; for example, determining whether an image is a splice of multiple images. I'm using bcolz right now (carrays only), but when I googled whether bcolz can handle simultaneous reads from two processes, one read per process, I came across Zarr. In a Dask issue in 2016, @alimanfoo said that bcolz was not thread-safe. If this is also true of processes, then I'll likely to switch to Zarr. I'm a beginner when it comes to concurrency and parallelism (which I why I used the term "simultaneous" above). Could someone tell me why thread-safety and process-safety come into play even when only reads will occur? Does it have something to do with decompression? Even the name of a concept I could google would be helpful. Thank you. |
Hi Matt, thanks for posting. AFAIK it should be safe to read a bcolz carray
from multiple processes. IIRC previously some issues were encountered when
reading a bcolz carray from multiple threads, which were related to the
interface with the c-blosc library which handles decompression. These may
have since been resolved, in which case it may also now be safe to read a
bcolz carray from multiple threads, but I cannot say for sure. If you only
need an array class (and not a table class) then you may be better off
switching to zarr, as zarr has been designed for concurrency and provides
some additional storage and chunking options. Hth.
…On Mon, Jan 22, 2018 at 2:07 PM, Matt Kleinsmith ***@***.***> wrote:
I'm participating in a Kaggle competition
<https://www.kaggle.com/c/sp-society-camera-model-identification> where
the goal is to classify images by the camera models that took them, which
is useful for forensics, for example, determining whether an image is a
splice of multiple images.
I'm using bcolz right now (carrays only), but when I googled whether bcolz
can handle simultaneous reads from two processes, one read per process, I
came across Zarr. In a Dask issue in 2016
<dask/dask#1033>, @alimanfoo
<https://github.com/alimanfoo> said that bcolz was not thread-safe. If
this is also true of processes, then I'll likely to switch to Zarr.
I'm a beginner when it comes to concurrency and parallelism (which I why I
used the term "simultaneous" above). Could someone tell me why
thread-safety and process-safety come into play *even when only reads
will occur*? Does it have something to do with decompression? Even the
name of a concept I could google would be helpful. Thank you.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/zarr-developers/zarr/issues/228#issuecomment-359433012>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QulMG1s_eHbvTWb8rTAi1dcSy1Agks5tNJYpgaJpZM4RSZug>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596
Email: [email protected]
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
|
Hi I'm currently looking at processing Argo float data in the cloud. Argo float basic data comes at single netcdf files, now you could read that data into a big database but I'm trying to come up with a databaseless solution so that the current files processing systems see no change. @jhamman I like zarr and find it much easier to understand than HDF5 and more flexible. Where I am struggling a little is with xarray I tried just sending each cast to zarr with xarray's to_zarr, but this resulted in thousands of small files. Because of set cluster size this resulted in poor disk utalisation (in windows). This was because each attribute ended up in it's sown director with a single float in a 1k file. Instead I've been storing an xarray object as a pickle I know this could cause problems (security etc.) but it is working very nicely. I create a large arrays of profiles (pickled objects) and it works great as a cache |
@NickMortimer - if there are some xarray/zarr specific issues that would help your use case, I think we'd be keen to engage on the xarray issue tracker. |
@jhamman OK I'll go there. I think it's mainly due to my file that I'm trying to process. There are lots of single value attributes that get stored in their own file. I will talk more over at xarray |
Dask array has
This is a good point. It could be useful to have a webpage to point to that lists projects using zarr. For example, xarray has an "Xarray related projects" page in their sphinx docs and dask has a "Powered by Dask" section in https://dask.org/. Perhaps zarr could have something similar |
Hello @alimanfoo , In 2019, Zarr was also used for compressing the data which was generated in self-play games in a reinforcement learning setup. @misc{miles2019zarr,
title = {zarr-developers/zarr-python},
author = {Miles, Alistair},
copyright = {MIT},
url = {https://github.com/zarr-developers/zarr-python},
abstract = {An implementation of chunked, compressed, N-dimensional arrays for Python.},
urldate = {2019-12-19},
publisher = {Zarr Developers},
month = dec,
year = {2019},
note = {original-date: 2015-12-15T14:49:40Z}
} |
I am writing a book for Manning tentatively called "High performance Python for data analytics" and zarr is a big part of it, especially because a good chunk of the book is about persistence efficiency |
Working on a project that is currently exploring the adoption of a stack built around Dask and Zarr (interface via Xarray). This is envisioned as a replacement for a legacy persistent data model underlying the flagging, calibration, synthesis, visualization, and analysis of astronomical data, especially from radio interferometers (e.g., ALMA). @alimanfoo, does the Zarr project have a preferred method of citation in academic publications? |
Just to say thanks everyone for adding projects here, very much appreciated. Re citation, we don't have a preferred method, your suggestion @QueensGambit sounds good in the interim. Writing a short paper about zarr is very much on the wish list. |
I've recently added saving simulation data as Zarr datasets in xarray-simlab. https://xarray-simlab.readthedocs.io/en/latest/io_storage.html#using-zarr Everything went smoothly when working on this. Thanks @alimanfoo and all Zarr developers! (Also thanks @rabernat, @jhamman and others for Zarr integration with Xarray). I'm working now on running batches of simulations in parallel using Dask and save all that data in the same Zarr datasets (along a batch dimension), but I'm struggling with different things so I might ask for some help. |
https://google.github.io/tensorstore/tensorstore/driver/index.html#chunked-storage-drivers
|
I'm running parallel simulations using dask distributed. I'm using Zarr to create and persist input data to disk for fast reading for each simulation. The data is too large to pass between the parallel processes. I was using Feather before but I needed more complex data structures. Switching to Zarr improved simulation speed by roughly 30%. Thanks for the work on this ❤️ |
Lyft Level 5 just released a dataset in zarr format, it consists of over 1000 hours of driving data and observations. Along with it they have open sourced a codebase relating to the prediction and planning task in autonomous vehicles. Zarr is a great fit for ML problems! Links: |
Do they happen to have a tweet? May be worth retweeting |
Thanks @gzuidhof, tweeted from zarr_dev here: https://twitter.com/zarr_dev/status/1277515272270815232 |
Thanks both 🙂 |
Hi, I am building a distributed multibeam sonar processing software suite using dask/xarray/zarr. Big fan of zarr! The format is so easy to work with. Really appreciate the work of this community. |
We are going to be hosting the Solar Dynamics Observatory (SDO) Machine Learning dataset (https://iopscience.iop.org/article/10.3847/1538-4365/ab1005) on a public-facing Google Cloud bucket, stored in Zarr format! Appreciate the work here, this is fantastic! |
Hi @alimanfoo and Zarr developers, First of all, thank you for developing, maintaining and consistently improving this incredible software. We have created a memory efficient single-cell genomics data processing toolkit, called Scarf, using Zarr. We chose Zarr over HDF5 because:
|
Using it to enable country scale machine learning geology modelling from Sentinel data. |
We're using Zarr for a new visualization application, rending chunks of data directly from cloud object store using WebGL and Mapbox. A full writeup of our approach is here: https://carbonplan.org/blog/maps-library-release |
Hi there! I am using Zarr to convert several hdf5 files coming from NASA P3 aircraft (planes that fly inside clouds including hurricanes). This conversion allows us to manage/access large data set for machine learning applied to cloud microphysics. |
Hey, we've added support for Kudos to @CSSFrancis for adding this to HyperSpy: hyperspy/hyperspy#2825 https://hyperspy.org/hyperspy-doc/current/user_guide/io.html#zspy-format |
Hi! At Sandoz Pharmaceuticals we use Multi-threaded reads and writes using Dask have saved us hours of waiting! Unfortunately I can't share any further details due to trade secrets, but I am really thankful for your great work with this storage format 😄 |
Hi, zarr has become a cornerstone of all my data acquisition pipelines. The main “selling point“ for me is how easy it is to build very robust systems, with limited data loss in case of hardware/software failures. |
Hello everyone, I am a Developer Advocate from Radiant Earth, and I love seeing all of this Zarr work. If any of you are interested in hosting your Zarr data on Source Cooperative (currently in public beta at beta.source.coop), please email me at [email protected]. We would love to increase exposure to any Zarr projects that want to be seen/shared. Also interested in hosting other cloud-optimized data formats on Source as well. To read more about Source, please see this blog post. |
I've used Zarr to store large-scale genetic variation data efficiently in tsinfer I'm also involved in the sgkit project which uses Zarr to work with genomics data more generally (and currently writing a paper about it). Zarr is awesome 👍 |
If you are using Zarr for whatever purpose, we would love to hear about it. If you can, please leave a couple of sentences in a comment below describing who you are and what you are using Zarr for. This information helps us to get a sense of the different use cases that are important for Zarr users, and helps core developers to justify the time they spend on the project.
The text was updated successfully, but these errors were encountered: