Add variable chunk size #35

apmypb · 2023-09-14T17:41:12Z

We can now specify the sizes of chunks to write to the file.
I set the default to contiguous data.
Meaning, if someone wants the data to be chunked, he has to write
data with an additional chunk_size argument:

data = rand(100)
chunk_size = 100
lhd["data", chunk_size] = data

where lhd is a LHDataStore

codecov · 2023-09-14T17:45:08Z

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (e2db37b) 52.23% compared to head (73265f6) 54.41%.

❗ Current head 73265f6 differs from pull request most recent head 0985046. Consider uploading reports for the commit 0985046 to get more accurate results

Files	Patch %	Lines
src/types.jl	97.40%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #35      +/-   ##
==========================================
+ Coverage   52.23%   54.41%   +2.18%     
==========================================
  Files           6        6              
  Lines         605      634      +29     
==========================================
+ Hits          316      345      +29     
  Misses        289      289

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

fhagemann · 2023-09-15T09:51:40Z

Is it possible to read old files in with this or might this result in errors?

fhagemann · 2023-09-15T10:08:06Z

I tested your changes using this script:

using LegendHDF5IO
using Test

using ArraysOfArrays
using DataFrames
using RadiationDetectorSignals
using StatsBase
using TypedTables
using Unitful


filename = "LegendHDF5IO_test.lh5"

A = randn(1000)
B = fill(true, 100)
F = Float64(6.5)
W = RDWaveform(range(0,100,length=100), rand(100))
U = fill(1.0u"mm", 100)
V = 1.0u"pF"
AW = ArrayOfRDWaveforms([W])
AS = ArrayOfSimilarArrays([deepcopy(U) for _ in 1:5])
VV = VectorOfVectors([fill(1.0u"mV", i) for i in 1:5])
hist = fit(Histogram, A)
nt = (A = A, B = B, F = F)
S = "TestString"
SYM = :Symbol

# Create the LH5 file
LHDataStore(filename, "w") do h
    h["Array"] = A
    h["BoolArray"] = B
    h["Histogram"] = hist
    h["Float"] = F
    h["ArrayOfWaveforms"] = AW
    h["ArrayWithUnits"] = U
    h["ArrayOfSimilarArrays"] = AS
    h["VectorOfVectors"] = VV
    h["Value"] = V
    h["NamedTuple"] = nt
    h["String"] = S
    # h["Waveform"] = W
    # h["Symbol"] = SYM
end

# read the data back in
@testset "LHDataStore read-in" begin
    LHDataStore(filename, "r") do h_in
        @test h_in["Array"] == A
        @test h_in["Array"][:] isa typeof(A)
        @test h_in["BoolArray"] == B
        @test_broken h_in["BoolArray"][:] isa typeof(B) # BitVector != Vector{Bool}
        @test h_in["Histogram"] == hist
        @test h_in["Histogram"] isa typeof(hist)
        @test h_in["Float"] == F
        @test h_in["Float"] isa typeof(F)
        @test_broken h_in["ArrayOfWaveforms"] == AW # requires a Colon somehow
        @test h_in["ArrayOfWaveforms"][:] == AW
        @test h_in["ArrayOfWaveforms"][:] isa typeof(AW)
        @test h_in["ArrayWithUnits"] == U
        @test h_in["ArrayWithUnits"][:] isa typeof(U)
        @test h_in["ArrayOfSimilarArrays"] == AS
        @test h_in["ArrayOfSimilarArrays"][:] isa typeof(AS)
        @test h_in["VectorOfVectors"] == VV
        @test h_in["VectorOfVectors"][:] isa typeof(VV)
        @test h_in["Value"] == V
        @test h_in["Value"] isa typeof(V)
        @test h_in["NamedTuple"] == nt
        @test_broken h_in["NamedTuple"] isa typeof(nt) # What's the command to read the Arrays into memory?
        @test h_in["String"] == S
        @test h_in["String"] isa typeof(S)
        # @test h["Waveform"] == W
        # @test h_in["Symbol"] == SYM
    end
end;


# tests for add_column
t = Table(a = rand(100), b = rand(100)*u"pF")
c = rand(100)*u"mm"
t_new = let df = DataFrame(t); df.c = c; Table(df) end # add the column c to t

LHDataStore(filename, "cw") do h
    h["Table"] = t
    LegendHDF5IO.add_column(h, "Table", (c = c,))
    @test h["Table"] != t_new # probably because it's cached ?
end
    
LHDataStore(filename) do h
    @test h["Table"] == t_new
end

Two things to note:

LHDataStore is not able to write Symbols (needed to save SSD simulations) or single RDWaveforms. This is something that we might want to add later
When adding a column, one needs to close the file and re-open it in order to get the new table. Seems like there is some caching going on. This should be written down somewhere (maybe in the docstring of add_column?)

Maybe add (part of) these tests to the test scripts, such that new functionalities will always be tested before being merged

oschulz · 2023-09-15T11:41:31Z

LHDataStore is not able to write Symbols (needed to save SSD simulations) or single RDWaveforms. This is something that we might want to add later

@apmypb can you add this quickly?

fhagemann · 2023-09-15T11:51:00Z

LHDataStore is not able to write Symbols (needed to save SSD simulations) or single RDWaveforms. This is something that we might want to add later

@apmypb can you add this quickly?

Never mind, the methods for writing Symbols are already in there (I just tested the tagged version of LegendHDF5IO, and not the newest commit), everything with Symbols works fine.

For single RDWaveforms, I don't see the urge of already having this in this PR.

oschulz · 2023-09-19T09:08:52Z

I took another look at this - we shouldn't set chunk size as an argument of setindex!. It breaks the typical setindex!-API and users will often write a whole table at once, not each dataset separately anyway, so they won't be able to exert fine-grained per-column control that way.

Chunk size selection should become a property of LHDataStore instead, e.g. LHDataStore(filename, "w"; chunk_size = SOME_DEFAULT_SIZE). If users set chunk_size to nothing`, the data should be written unchunked (appending is then not possible, of course, but on the other hand reading can using memory mapping, this would be useful for ML datasets and the like).

apmypb · 2023-09-20T11:12:27Z

Should I set the default to nothing, or to some value like 10_000?
I would set it to nothing, since we don't necessarily know the structure of the data of our users.
So in the end only the user knows the chunk_size suitable for him.
Also, the way I changed the setindex! functions, you can also specify chunk_size for a Table such that all columns inherit the given chunk_size. Should I still set chunk_size as a property of LHDataStore?

oschulz · 2023-09-25T12:22:52Z

Hm, let's set the default chunk size to 65536 bytes for now.

oschulz · 2023-12-13T14:13:18Z

I think be best approach is to determine chunk size base on "first write", i.e. set chunk size to the size of the arrays, columns, etc. passed to the setindex that ultimately creates the datasets. It'll be the responsibility of the user to write the data in suitable chunks.

LHDataStore can have a simple property usechunks::Bool to control whether to use chunking or not. Chunking should be disabled by default, since we'll only use it in certain cases, e.g. when writing large tables that contain waveforms.

apmypb · 2024-01-18T12:49:20Z

I don't think that LHDataStore should have this property, since it only manages LH5Array's. Having this property would also imply that we expect every HDF5.Dataset inside this LHDataStore to be saved in chunks, which might not be the case for simple text info for example. Keeping the current state would give the user freedom to chunk the dataset he really needs.

My question would be if we should give the user the option to choose the size of chunks and the dimension along which one can extend the dataset, or if only the last dimension of an array is getting chunked (current state). Since the chunking feature only really exists for the purpose of being able to append additional data, we might change the current append! implementations to something like cat with an argument dims determining the dimension along which one can extend the dataset.

oschulz · 2024-01-18T14:35:33Z

I don't think that LHDataStore should have this property, since it only manages LH5Array's. Having this property would also imply that we expect every HDF5.Dataset inside this LHDataStore to be saved in chunks, which might not be the case for simple text info for example.

That's no problem, we'd only apply chunking where it makes sense anyway.

My question would be if we should give the user the option to choose the size of chunks and the dimension along which one can extend the dataset, or if only the last dimension of an array is getting chunked (current state).

Growing only along the last dimension should be enough even long-term. So we should also stick with append!.

added `extend_datastore` and `reduce_datastore` functions In addition to adding and removing the columns it adapts the datatype of the hdf5 group accordingly.

apmypb · 2024-01-19T14:11:23Z

The chunking feature should work now. The usechunks attribute however is not used at all.
I also changed the add_column function to extend_datastore and added reduce_datastore which add and remove columns from tables or elements from NamedTuples on disk, changing the datatype accordingly.
I don't understand why some tests seem to fail on windows though...

fhagemann · 2024-01-19T14:19:44Z

test/test_wrappers.jl

-            LHDataStore(path, "cw") do f
-                f["tmp"] = nt
+            lh5open(path, "cw") do f
+                f["tmp", 50] = nt


This is the syntax to write with a chunk size of 50?

The integer value in setindex! determines the chunk size of the last dimension of this array, or if it's a NamedTuple, of each of the arrays inside it.

test/test_wrappers.jl

fhagemann · 2024-01-19T14:22:49Z

src/types.jl

@@ -305,6 +303,7 @@ julia> lhf["new"] = x
 """
 mutable struct LHDataStore <: AbstractDict{String,Any}
    data_store::HDF5.H5DataStore
+    usechunks::Bool


Is this ever used...

fhagemann · 2024-01-19T14:23:01Z

src/types.jl

@@ -345,131 +344,129 @@ end
 Base.show(io::IO, m::MIME"text/plain", lh::LHDataStore) = HDF5.show_tree(io, lh.data_store)
 Base.show(io::IO, lh::LHDataStore) = show(io, MIME"text/plain"(), lh)

+Base.setindex!(output::LHDataStore, v, i, chunk_size=nothing) = begin
+    output.usechunks = !isnothing(chunk_size)


... other than here (where it is set)?

oschulz · 2024-01-29T17:33:49Z

@apmypb is this good to merge from your side?

CC @theHenks

fhagemann · 2024-01-29T19:11:30Z

Do we care about the failing windows tests though?

oschulz · 2024-01-29T19:20:17Z

The windows test should definitely be fixed, they're passing on main, so there shouldn't be any problem with HDF5.jl itself.

fhagemann · 2024-01-31T12:38:31Z

The problem with Windows was that functions like joinpath and splitdir are defined with a path separator of \\ (see here), whereas the paths in an HDF5 file are still defined as /, no matter the operating system.
I have modified the code such that it is explicitly always using / in LegendHDF5IO.jl

src/LegendHDF5IO.jl

oschulz · 2024-01-31T19:32:16Z

src/generic_io.jl

@@ -239,12 +239,10 @@ end

 function setdatatype!(output::Union{HDF5.Dataset, HDF5.H5DataStore}, datatype::Type)
    dtstr = datatype_to_string(datatype)
-    # @debug "setdatatype!($(_infostring(output)), \"$dtstr\")"
+    hasattribute(output, :datatype) && HDF5.delete_attribute(output, "datatype")


@apmypb in what situations would we overwrite a datatype attribute?

src/types.jl

oschulz · 2024-01-31T19:34:51Z

src/types.jl

@@ -345,131 +344,129 @@ end
 Base.show(io::IO, m::MIME"text/plain", lh::LHDataStore) = HDF5.show_tree(io, lh.data_store)
 Base.show(io::IO, lh::LHDataStore) = show(io, MIME"text/plain"(), lh)

+Base.setindex!(output::LHDataStore, v, i, chunk_size=nothing) = begin


@apmypb , we don't want to use chunk_size attributes in getindex and setindex!.

oschulz · 2024-01-31T19:36:43Z

src/types.jl


 # write <:Real
-Base.setindex!(output::LHDataStore, v::T, i::AbstractString, 
-DT::DataType=typeof(v)) where {T<:Real} = begin
+_setindex!(output::LHDataStore, v::T, i::AbstractString, args...


Why the args...?

oschulz · 2024-01-31T19:37:19Z

src/types.jl

    output.data_store[i] = v
-    DT != Nothing && setdatatype!(output.data_store[i], DT)
-    nothing
+    setdatatype!(output.data_store[i], T)


Probably safest to return nothing explicitly at the end.

oschulz · 2024-01-31T19:40:26Z

src/types.jl

-    dspace = (size(v), (evntsize..., -1))
-    chunk = (evntsize..., CHUNK_SIZE)
+_setindex!(output::LHDataStore, v::AbstractArray{T}, i::AbstractString,
+    chunk_size::Union{Nothing, Int}=nothing) where {T<:Real} = begin


Let's make this

create_entry!(output::LHDataStore, label::AbstractString, contents::AbstractArray{T}; chunk_size::Union{Nothing, Int}=nothing)

instead, with the same order of arguments used in write in HDF5.jl. chunk_size should be a keyword argument.

oschulz · 2024-01-31T19:50:40Z

src/types.jl

@@ -478,7 +475,7 @@ Open a LEGEND HDF5 file and return an `LHDataStore` object.
 LEGEND HDF5 files typically use the file extention ".lh5".
 """
 function lh5open(filename::AbstractString, access::AbstractString = "r")


Let's use

function lh5open(filename::AbstractString, access::AbstractString = "r"; usechunks::Bool = false)

and

LHDataStore(HDF5.h5open(filename, access), usechunks)

oschulz · 2024-01-31T19:54:46Z

src/types.jl

+Currently supported are elements of `NamedTuple`, `TypedTable.Table` or 
+`HDF5.Group`. 
+"""
+function reduce_datastore(lhd::LHDataStore, i::AbstractString)


Let's call this delete_entry!(lhd::LHDataStore, label::AbstractString)

oschulz · 2024-01-31T19:58:39Z

src/types.jl

+
+extend the Table `dest` at `lhd[i]` with columns from `src`.
+"""
+function extend_datastore(lhd::LHDataStore, i::AbstractString, 


Do we need this function? If so, since it purely targeted to tables, maybe call it add_columns! or so?

Passing both lhd and i on the one hand, and dest on the other hand seems redundant?

extend_datastore also supports NamedTuples. I found it very convenient, to not load in the whole table in order to add just a column or another entry in the NamedTuple.

having dest implicitly checks whether the data structure which we want to alter is actually a named tuple or a table, but I guess, I can also check it inside the function itself.

I found it very convenient, to not load in the whole table in order to add just a column or another entry in the NamedTuple.

When do we add columns, actually?

I do this a lot, when I have raw data and determine measured pulse amplitudes and want to add them to the Table. I find this very convenient.

That's actually kind of an anti-pattern, we do this via "horizontal" views across files instead (#37).

In principle, scientific data should not be modified/amended in the same file, once written and closed. But we can leave the functionality in if it's currently in active use.

I don't seem to understand what you mean.

An example LH5 display of a saved Table called "segBEGe" looks like this:

🗂️ HDF5.File: (read-only) └─ 📂 segBEGe ├─ 🏷️ datatype ├─ 🔢 Phi │ ├─ 🏷️ datatype │ └─ 🏷️ units ├─ 🔢 R │ ├─ 🏷️ datatype │ └─ 🏷️ units ├─ 🔢 Z │ ├─ 🏷️ datatype │ └─ 🏷️ units ├─ 🔢 chid │ └─ 🏷️ datatype ├─ 🔢 file_idx │ └─ 🏷️ datatype ├─ 🔢 samples │ └─ 🏷️ datatype └─ 🔢 z ├─ 🏷️ datatype └─ 🏷️ units

Right now, each COLUMN of a Table is saved as a (separate) HDF5 object in the file,
so why would adding a column break the pattern?

No, it wouldn't break the pattern. But from a scientific workflow point of view we should avoid amending tables during the course of data taking and analysis. But it's not a problem to offer the functionality.

oschulz · 2024-01-31T20:00:18Z

src/types.jl

+end
+export reduce_datastore
+
+function _reduce_datastore(lhd::LHDataStore, nt::NamedTuple, 


Better call this _delete_entries or so`? "reduce" has a different meaning in Julia.

apmypb · 2024-02-17T01:21:27Z

I've changed reduce_datastore to delete_entry! and extend_datastore to add_entries!.
Also, chunking by default is set to false. If the usechunks variable is set to true the chunk size is set to the length of the last axis of the corresponding array. However, create_entry also allows for more fine grained control, by setting the chunk_size keyword to some positive integer value.

fhagemann · 2024-02-19T17:51:22Z

In order to get this going, I'm merging this into a dev branch, which can then go to main if approved by @oschulz.
@theHenks: feel free to try it out

oschulz · 2024-02-19T19:18:12Z

Thanks @fhagemann - yes, let's use this on dev in the field for a few days before merging into main.

theHenks · 2024-02-19T19:21:28Z

I will move on with data production with dev. I can provide feedback if it works! Thanks @fhagemann

apmypb changed the title ~~Dev~~ Add variable chunk size Sep 14, 2023

oschulz mentioned this pull request Nov 7, 2023

read_object() with idx parameter is slow legend-exp/legend-pydataobj#29

Closed

update 'setdatatype!' to replace existing datatype

25f0ada

add possibiity to safe data in chunks

11445ca

added `extend_datastore` and `reduce_datastore` functions In addition to adding and removing the columns it adapts the datatype of the hdf5 group accordingly.

apmypb force-pushed the dev branch from d49dc91 to 11445ca Compare January 19, 2024 14:03

fhagemann reviewed Jan 19, 2024

View reviewed changes

Arthur Butorev added 2 commits January 19, 2024 21:16

bug fix for windows

e7b665a

bug fix for Vector of Vector of Vectors

f02ac76

Fix for windows

e0ec748

fhagemann added the enhancement New feature or request label Jan 31, 2024

oschulz reviewed Jan 31, 2024

View reviewed changes

src/LegendHDF5IO.jl Outdated Show resolved Hide resolved

Update src/LegendHDF5IO.jl

73265f6

oschulz requested changes Jan 31, 2024

View reviewed changes

add chunking parameter and improvements

0985046

fhagemann requested a review from oschulz February 17, 2024 19:41

fhagemann changed the base branch from main to dev February 19, 2024 17:50

fhagemann merged commit a81ec28 into legend-exp:dev Feb 19, 2024
6 checks passed

Add variable chunk size #35

Add variable chunk size #35

Conversation

apmypb commented Sep 14, 2023

codecov bot commented Sep 14, 2023 • edited Loading

Codecov Report

fhagemann commented Sep 15, 2023

fhagemann commented Sep 15, 2023 • edited Loading

oschulz commented Sep 15, 2023

fhagemann commented Sep 15, 2023

oschulz commented Sep 19, 2023

apmypb commented Sep 20, 2023 • edited Loading

oschulz commented Sep 25, 2023 • edited Loading

oschulz commented Dec 13, 2023 • edited Loading

apmypb commented Jan 18, 2024

oschulz commented Jan 18, 2024

apmypb commented Jan 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oschulz commented Jan 29, 2024

fhagemann commented Jan 29, 2024

oschulz commented Jan 29, 2024

fhagemann commented Jan 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fhagemann Feb 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apmypb commented Feb 17, 2024

fhagemann commented Feb 19, 2024

oschulz commented Feb 19, 2024

theHenks commented Feb 19, 2024

codecov bot commented Sep 14, 2023 •

edited

Loading

fhagemann commented Sep 15, 2023 •

edited

Loading

apmypb commented Sep 20, 2023 •

edited

Loading

oschulz commented Sep 25, 2023 •

edited

Loading

oschulz commented Dec 13, 2023 •

edited

Loading

apmypb commented Jan 19, 2024 •

edited

Loading

fhagemann Feb 1, 2024 •

edited

Loading