Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add variable chunk size #35

Merged
merged 7 commits into from
Feb 19, 2024
Merged

Add variable chunk size #35

merged 7 commits into from
Feb 19, 2024

Conversation

apmypb
Copy link
Contributor

@apmypb apmypb commented Sep 14, 2023

We can now specify the sizes of chunks to write to the file.
I set the default to contiguous data.
Meaning, if someone wants the data to be chunked, he has to write
data with an additional chunk_size argument:

data = rand(100)
chunk_size = 100
lhd["data", chunk_size] = data

where lhd is a LHDataStore

@codecov
Copy link

codecov bot commented Sep 14, 2023

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (e2db37b) 52.23% compared to head (73265f6) 54.41%.

❗ Current head 73265f6 differs from pull request most recent head 0985046. Consider uploading reports for the commit 0985046 to get more accurate results

Files Patch % Lines
src/types.jl 97.40% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #35      +/-   ##
==========================================
+ Coverage   52.23%   54.41%   +2.18%     
==========================================
  Files           6        6              
  Lines         605      634      +29     
==========================================
+ Hits          316      345      +29     
  Misses        289      289              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@apmypb apmypb changed the title Dev Add variable chunk size Sep 14, 2023
@fhagemann
Copy link
Contributor

Is it possible to read old files in with this or might this result in errors?

@fhagemann
Copy link
Contributor

fhagemann commented Sep 15, 2023

I tested your changes using this script:

using LegendHDF5IO
using Test

using ArraysOfArrays
using DataFrames
using RadiationDetectorSignals
using StatsBase
using TypedTables
using Unitful


filename = "LegendHDF5IO_test.lh5"

A = randn(1000)
B = fill(true, 100)
F = Float64(6.5)
W = RDWaveform(range(0,100,length=100), rand(100))
U = fill(1.0u"mm", 100)
V = 1.0u"pF"
AW = ArrayOfRDWaveforms([W])
AS = ArrayOfSimilarArrays([deepcopy(U) for _ in 1:5])
VV = VectorOfVectors([fill(1.0u"mV", i) for i in 1:5])
hist = fit(Histogram, A)
nt = (A = A, B = B, F = F)
S = "TestString"
SYM = :Symbol

# Create the LH5 file
LHDataStore(filename, "w") do h
    h["Array"] = A
    h["BoolArray"] = B
    h["Histogram"] = hist
    h["Float"] = F
    h["ArrayOfWaveforms"] = AW
    h["ArrayWithUnits"] = U
    h["ArrayOfSimilarArrays"] = AS
    h["VectorOfVectors"] = VV
    h["Value"] = V
    h["NamedTuple"] = nt
    h["String"] = S
    # h["Waveform"] = W
    # h["Symbol"] = SYM
end

# read the data back in
@testset "LHDataStore read-in" begin
    LHDataStore(filename, "r") do h_in
        @test h_in["Array"] == A
        @test h_in["Array"][:] isa typeof(A)
        @test h_in["BoolArray"] == B
        @test_broken h_in["BoolArray"][:] isa typeof(B) # BitVector != Vector{Bool}
        @test h_in["Histogram"] == hist
        @test h_in["Histogram"] isa typeof(hist)
        @test h_in["Float"] == F
        @test h_in["Float"] isa typeof(F)
        @test_broken h_in["ArrayOfWaveforms"] == AW # requires a Colon somehow
        @test h_in["ArrayOfWaveforms"][:] == AW
        @test h_in["ArrayOfWaveforms"][:] isa typeof(AW)
        @test h_in["ArrayWithUnits"] == U
        @test h_in["ArrayWithUnits"][:] isa typeof(U)
        @test h_in["ArrayOfSimilarArrays"] == AS
        @test h_in["ArrayOfSimilarArrays"][:] isa typeof(AS)
        @test h_in["VectorOfVectors"] == VV
        @test h_in["VectorOfVectors"][:] isa typeof(VV)
        @test h_in["Value"] == V
        @test h_in["Value"] isa typeof(V)
        @test h_in["NamedTuple"] == nt
        @test_broken h_in["NamedTuple"] isa typeof(nt) # What's the command to read the Arrays into memory?
        @test h_in["String"] == S
        @test h_in["String"] isa typeof(S)
        # @test h["Waveform"] == W
        # @test h_in["Symbol"] == SYM
    end
end;


# tests for add_column
t = Table(a = rand(100), b = rand(100)*u"pF")
c = rand(100)*u"mm"
t_new = let df = DataFrame(t); df.c = c; Table(df) end # add the column c to t

LHDataStore(filename, "cw") do h
    h["Table"] = t
    LegendHDF5IO.add_column(h, "Table", (c = c,))
    @test h["Table"] != t_new # probably because it's cached ?
end
    
LHDataStore(filename) do h
    @test h["Table"] == t_new
end

Two things to note:

  • LHDataStore is not able to write Symbols (needed to save SSD simulations) or single RDWaveforms. This is something that we might want to add later
  • When adding a column, one needs to close the file and re-open it in order to get the new table. Seems like there is some caching going on. This should be written down somewhere (maybe in the docstring of add_column?)

Maybe add (part of) these tests to the test scripts, such that new functionalities will always be tested before being merged

@oschulz
Copy link
Contributor

oschulz commented Sep 15, 2023

LHDataStore is not able to write Symbols (needed to save SSD simulations) or single RDWaveforms. This is something that we might want to add later

@apmypb can you add this quickly?

@fhagemann
Copy link
Contributor

LHDataStore is not able to write Symbols (needed to save SSD simulations) or single RDWaveforms. This is something that we might want to add later

@apmypb can you add this quickly?

Never mind, the methods for writing Symbols are already in there (I just tested the tagged version of LegendHDF5IO, and not the newest commit), everything with Symbols works fine.

For single RDWaveforms, I don't see the urge of already having this in this PR.

@oschulz
Copy link
Contributor

oschulz commented Sep 19, 2023

I took another look at this - we shouldn't set chunk size as an argument of setindex!. It breaks the typical setindex!-API and users will often write a whole table at once, not each dataset separately anyway, so they won't be able to exert fine-grained per-column control that way.

Chunk size selection should become a property of LHDataStore instead, e.g. LHDataStore(filename, "w"; chunk_size = SOME_DEFAULT_SIZE). If users set chunk_size to nothing`, the data should be written unchunked (appending is then not possible, of course, but on the other hand reading can using memory mapping, this would be useful for ML datasets and the like).

@apmypb
Copy link
Contributor Author

apmypb commented Sep 20, 2023

Should I set the default to nothing, or to some value like 10_000?
I would set it to nothing, since we don't necessarily know the structure of the data of our users.
So in the end only the user knows the chunk_size suitable for him.
Also, the way I changed the setindex! functions, you can also specify chunk_size for a Table such that all columns inherit the given chunk_size. Should I still set chunk_size as a property of LHDataStore?

@oschulz
Copy link
Contributor

oschulz commented Sep 25, 2023

Hm, let's set the default chunk size to 65536 bytes for now.

@oschulz
Copy link
Contributor

oschulz commented Dec 13, 2023

I think be best approach is to determine chunk size base on "first write", i.e. set chunk size to the size of the arrays, columns, etc. passed to the setindex that ultimately creates the datasets. It'll be the responsibility of the user to write the data in suitable chunks.

LHDataStore can have a simple property usechunks::Bool to control whether to use chunking or not. Chunking should be disabled by default, since we'll only use it in certain cases, e.g. when writing large tables that contain waveforms.

@apmypb
Copy link
Contributor Author

apmypb commented Jan 18, 2024

I don't think that LHDataStore should have this property, since it only manages LH5Array's. Having this property would also imply that we expect every HDF5.Dataset inside this LHDataStore to be saved in chunks, which might not be the case for simple text info for example. Keeping the current state would give the user freedom to chunk the dataset he really needs.

My question would be if we should give the user the option to choose the size of chunks and the dimension along which one can extend the dataset, or if only the last dimension of an array is getting chunked (current state). Since the chunking feature only really exists for the purpose of being able to append additional data, we might change the current append! implementations to something like cat with an argument dims determining the dimension along which one can extend the dataset.

@oschulz
Copy link
Contributor

oschulz commented Jan 18, 2024

I don't think that LHDataStore should have this property, since it only manages LH5Array's. Having this property would also imply that we expect every HDF5.Dataset inside this LHDataStore to be saved in chunks, which might not be the case for simple text info for example.

That's no problem, we'd only apply chunking where it makes sense anyway.

My question would be if we should give the user the option to choose the size of chunks and the dimension along which one can extend the dataset, or if only the last dimension of an array is getting chunked (current state).

Growing only along the last dimension should be enough even long-term. So we should also stick with append!.

added `extend_datastore` and `reduce_datastore` functions

In addition to adding and removing the columns it adapts the datatype of the hdf5 group
accordingly.
@apmypb
Copy link
Contributor Author

apmypb commented Jan 19, 2024

The chunking feature should work now. The usechunks attribute however is not used at all.
I also changed the add_column function to extend_datastore and added reduce_datastore which add and remove columns from tables or elements from NamedTuples on disk, changing the datatype accordingly.
I don't understand why some tests seem to fail on windows though...

LHDataStore(path, "cw") do f
f["tmp"] = nt
lh5open(path, "cw") do f
f["tmp", 50] = nt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the syntax to write with a chunk size of 50?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integer value in setindex! determines the chunk size of the last dimension of this array, or if it's a NamedTuple, of each of the arrays inside it.

@@ -305,6 +303,7 @@ julia> lhf["new"] = x
"""
mutable struct LHDataStore <: AbstractDict{String,Any}
data_store::HDF5.H5DataStore
usechunks::Bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ever used...

src/types.jl Outdated
@@ -345,131 +344,129 @@ end
Base.show(io::IO, m::MIME"text/plain", lh::LHDataStore) = HDF5.show_tree(io, lh.data_store)
Base.show(io::IO, lh::LHDataStore) = show(io, MIME"text/plain"(), lh)

Base.setindex!(output::LHDataStore, v, i, chunk_size=nothing) = begin
output.usechunks = !isnothing(chunk_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... other than here (where it is set)?

@oschulz
Copy link
Contributor

oschulz commented Jan 29, 2024

@apmypb is this good to merge from your side?

CC @theHenks

@fhagemann
Copy link
Contributor

Do we care about the failing windows tests though?

@oschulz
Copy link
Contributor

oschulz commented Jan 29, 2024

The windows test should definitely be fixed, they're passing on main, so there shouldn't be any problem with HDF5.jl itself.

@fhagemann
Copy link
Contributor

The problem with Windows was that functions like joinpath and splitdir are defined with a path separator of \\ (see here), whereas the paths in an HDF5 file are still defined as /, no matter the operating system.
I have modified the code such that it is explicitly always using / in LegendHDF5IO.jl

@fhagemann fhagemann added the enhancement New feature or request label Jan 31, 2024
@@ -239,12 +239,10 @@ end

function setdatatype!(output::Union{HDF5.Dataset, HDF5.H5DataStore}, datatype::Type)
dtstr = datatype_to_string(datatype)
# @debug "setdatatype!($(_infostring(output)), \"$dtstr\")"
hasattribute(output, :datatype) && HDF5.delete_attribute(output, "datatype")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apmypb in what situations would we overwrite a datatype attribute?

src/types.jl Outdated
@@ -345,131 +344,129 @@ end
Base.show(io::IO, m::MIME"text/plain", lh::LHDataStore) = HDF5.show_tree(io, lh.data_store)
Base.show(io::IO, lh::LHDataStore) = show(io, MIME"text/plain"(), lh)

Base.setindex!(output::LHDataStore, v, i, chunk_size=nothing) = begin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apmypb , we don't want to use chunk_size attributes in getindex and setindex!.

src/types.jl Outdated

# write <:Real
Base.setindex!(output::LHDataStore, v::T, i::AbstractString,
DT::DataType=typeof(v)) where {T<:Real} = begin
_setindex!(output::LHDataStore, v::T, i::AbstractString, args...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the args...?

src/types.jl Outdated
output.data_store[i] = v
DT != Nothing && setdatatype!(output.data_store[i], DT)
nothing
setdatatype!(output.data_store[i], T)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably safest to return nothing explicitly at the end.

src/types.jl Outdated
dspace = (size(v), (evntsize..., -1))
chunk = (evntsize..., CHUNK_SIZE)
_setindex!(output::LHDataStore, v::AbstractArray{T}, i::AbstractString,
chunk_size::Union{Nothing, Int}=nothing) where {T<:Real} = begin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this

create_entry!(output::LHDataStore, label::AbstractString, contents::AbstractArray{T}; chunk_size::Union{Nothing, Int}=nothing)

instead, with the same order of arguments used in write in HDF5.jl. chunk_size should be a keyword argument.

src/types.jl Outdated
@@ -478,7 +475,7 @@ Open a LEGEND HDF5 file and return an `LHDataStore` object.
LEGEND HDF5 files typically use the file extention ".lh5".
"""
function lh5open(filename::AbstractString, access::AbstractString = "r")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use

function lh5open(filename::AbstractString, access::AbstractString = "r"; usechunks::Bool = false)

and

LHDataStore(HDF5.h5open(filename, access), usechunks)

src/types.jl Outdated
Currently supported are elements of `NamedTuple`, `TypedTable.Table` or
`HDF5.Group`.
"""
function reduce_datastore(lhd::LHDataStore, i::AbstractString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call this delete_entry!(lhd::LHDataStore, label::AbstractString)

src/types.jl Outdated

extend the Table `dest` at `lhd[i]` with columns from `src`.
"""
function extend_datastore(lhd::LHDataStore, i::AbstractString,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this function? If so, since it purely targeted to tables, maybe call it add_columns! or so?

Passing both lhd and i on the one hand, and dest on the other hand seems redundant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extend_datastore also supports NamedTuples. I found it very convenient, to not load in the whole table in order to add just a column or another entry in the NamedTuple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having dest implicitly checks whether the data structure which we want to alter is actually a named tuple or a table, but I guess, I can also check it inside the function itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it very convenient, to not load in the whole table in order to add just a column or another entry in the NamedTuple.

When do we add columns, actually?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do this a lot, when I have raw data and determine measured pulse amplitudes and want to add them to the Table. I find this very convenient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually kind of an anti-pattern, we do this via "horizontal" views across files instead (#37).

In principle, scientific data should not be modified/amended in the same file, once written and closed. But we can leave the functionality in if it's currently in active use.

Copy link
Contributor

@fhagemann fhagemann Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't seem to understand what you mean.

An example LH5 display of a saved Table called "segBEGe" looks like this:

🗂️ HDF5.File: (read-only)
└─ 📂 segBEGe
   ├─ 🏷️ datatype
   ├─ 🔢 Phi
   │  ├─ 🏷️ datatype
   │  └─ 🏷️ units
   ├─ 🔢 R
   │  ├─ 🏷️ datatype
   │  └─ 🏷️ units
   ├─ 🔢 Z
   │  ├─ 🏷️ datatype
   │  └─ 🏷️ units
   ├─ 🔢 chid
   │  └─ 🏷️ datatype
   ├─ 🔢 file_idx
   │  └─ 🏷️ datatype
   ├─ 🔢 samples
   │  └─ 🏷️ datatype
   └─ 🔢 z
      ├─ 🏷️ datatype
      └─ 🏷️ units

Right now, each COLUMN of a Table is saved as a (separate) HDF5 object in the file,
so why would adding a column break the pattern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it wouldn't break the pattern. But from a scientific workflow point of view we should avoid amending tables during the course of data taking and analysis. But it's not a problem to offer the functionality.

src/types.jl Outdated
end
export reduce_datastore

function _reduce_datastore(lhd::LHDataStore, nt::NamedTuple,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better call this _delete_entries or so`? "reduce" has a different meaning in Julia.

@apmypb
Copy link
Contributor Author

apmypb commented Feb 17, 2024

I've changed reduce_datastore to delete_entry! and extend_datastore to add_entries!.
Also, chunking by default is set to false. If the usechunks variable is set to true the chunk size is set to the length of the last axis of the corresponding array. However, create_entry also allows for more fine grained control, by setting the chunk_size keyword to some positive integer value.

@fhagemann fhagemann requested a review from oschulz February 17, 2024 19:41
@fhagemann fhagemann changed the base branch from main to dev February 19, 2024 17:50
@fhagemann
Copy link
Contributor

In order to get this going, I'm merging this into a dev branch, which can then go to main if approved by @oschulz.
@theHenks: feel free to try it out

@fhagemann fhagemann merged commit a81ec28 into legend-exp:dev Feb 19, 2024
6 checks passed
@oschulz
Copy link
Contributor

oschulz commented Feb 19, 2024

Thanks @fhagemann - yes, let's use this on dev in the field for a few days before merging into main.

@theHenks
Copy link

I will move on with data production with dev. I can provide feedback if it works! Thanks @fhagemann

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants