-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add variable chunk size #35
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #35 +/- ##
==========================================
+ Coverage 52.23% 54.41% +2.18%
==========================================
Files 6 6
Lines 605 634 +29
==========================================
+ Hits 316 345 +29
Misses 289 289 ☔ View full report in Codecov by Sentry. |
Is it possible to read old files in with this or might this result in errors? |
I tested your changes using this script: using LegendHDF5IO
using Test
using ArraysOfArrays
using DataFrames
using RadiationDetectorSignals
using StatsBase
using TypedTables
using Unitful
filename = "LegendHDF5IO_test.lh5"
A = randn(1000)
B = fill(true, 100)
F = Float64(6.5)
W = RDWaveform(range(0,100,length=100), rand(100))
U = fill(1.0u"mm", 100)
V = 1.0u"pF"
AW = ArrayOfRDWaveforms([W])
AS = ArrayOfSimilarArrays([deepcopy(U) for _ in 1:5])
VV = VectorOfVectors([fill(1.0u"mV", i) for i in 1:5])
hist = fit(Histogram, A)
nt = (A = A, B = B, F = F)
S = "TestString"
SYM = :Symbol
# Create the LH5 file
LHDataStore(filename, "w") do h
h["Array"] = A
h["BoolArray"] = B
h["Histogram"] = hist
h["Float"] = F
h["ArrayOfWaveforms"] = AW
h["ArrayWithUnits"] = U
h["ArrayOfSimilarArrays"] = AS
h["VectorOfVectors"] = VV
h["Value"] = V
h["NamedTuple"] = nt
h["String"] = S
# h["Waveform"] = W
# h["Symbol"] = SYM
end
# read the data back in
@testset "LHDataStore read-in" begin
LHDataStore(filename, "r") do h_in
@test h_in["Array"] == A
@test h_in["Array"][:] isa typeof(A)
@test h_in["BoolArray"] == B
@test_broken h_in["BoolArray"][:] isa typeof(B) # BitVector != Vector{Bool}
@test h_in["Histogram"] == hist
@test h_in["Histogram"] isa typeof(hist)
@test h_in["Float"] == F
@test h_in["Float"] isa typeof(F)
@test_broken h_in["ArrayOfWaveforms"] == AW # requires a Colon somehow
@test h_in["ArrayOfWaveforms"][:] == AW
@test h_in["ArrayOfWaveforms"][:] isa typeof(AW)
@test h_in["ArrayWithUnits"] == U
@test h_in["ArrayWithUnits"][:] isa typeof(U)
@test h_in["ArrayOfSimilarArrays"] == AS
@test h_in["ArrayOfSimilarArrays"][:] isa typeof(AS)
@test h_in["VectorOfVectors"] == VV
@test h_in["VectorOfVectors"][:] isa typeof(VV)
@test h_in["Value"] == V
@test h_in["Value"] isa typeof(V)
@test h_in["NamedTuple"] == nt
@test_broken h_in["NamedTuple"] isa typeof(nt) # What's the command to read the Arrays into memory?
@test h_in["String"] == S
@test h_in["String"] isa typeof(S)
# @test h["Waveform"] == W
# @test h_in["Symbol"] == SYM
end
end;
# tests for add_column
t = Table(a = rand(100), b = rand(100)*u"pF")
c = rand(100)*u"mm"
t_new = let df = DataFrame(t); df.c = c; Table(df) end # add the column c to t
LHDataStore(filename, "cw") do h
h["Table"] = t
LegendHDF5IO.add_column(h, "Table", (c = c,))
@test h["Table"] != t_new # probably because it's cached ?
end
LHDataStore(filename) do h
@test h["Table"] == t_new
end Two things to note:
Maybe add (part of) these tests to the test scripts, such that new functionalities will always be tested before being merged |
@apmypb can you add this quickly? |
Never mind, the methods for writing For single |
I took another look at this - we shouldn't set chunk size as an argument of Chunk size selection should become a property of LHDataStore instead, e.g. |
Should I set the default to |
Hm, let's set the default chunk size to 65536 bytes for now. |
I think be best approach is to determine chunk size base on "first write", i.e. set chunk size to the size of the arrays, columns, etc. passed to the
|
I don't think that My question would be if we should give the user the option to choose the size of chunks and the dimension along which one can extend the dataset, or if only the last dimension of an array is getting chunked (current state). Since the chunking feature only really exists for the purpose of being able to append additional data, we might change the current |
That's no problem, we'd only apply chunking where it makes sense anyway.
Growing only along the last dimension should be enough even long-term. So we should also stick with |
added `extend_datastore` and `reduce_datastore` functions In addition to adding and removing the columns it adapts the datatype of the hdf5 group accordingly.
The chunking feature should work now. The |
test/test_wrappers.jl
Outdated
LHDataStore(path, "cw") do f | ||
f["tmp"] = nt | ||
lh5open(path, "cw") do f | ||
f["tmp", 50] = nt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the syntax to write with a chunk size of 50?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The integer value in setindex!
determines the chunk size of the last dimension of this array, or if it's a NamedTuple, of each of the arrays inside it.
@@ -305,6 +303,7 @@ julia> lhf["new"] = x | |||
""" | |||
mutable struct LHDataStore <: AbstractDict{String,Any} | |||
data_store::HDF5.H5DataStore | |||
usechunks::Bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this ever used...
src/types.jl
Outdated
@@ -345,131 +344,129 @@ end | |||
Base.show(io::IO, m::MIME"text/plain", lh::LHDataStore) = HDF5.show_tree(io, lh.data_store) | |||
Base.show(io::IO, lh::LHDataStore) = show(io, MIME"text/plain"(), lh) | |||
|
|||
Base.setindex!(output::LHDataStore, v, i, chunk_size=nothing) = begin | |||
output.usechunks = !isnothing(chunk_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... other than here (where it is set)?
Do we care about the failing windows tests though? |
The windows test should definitely be fixed, they're passing on main, so there shouldn't be any problem with HDF5.jl itself. |
The problem with Windows was that functions like |
@@ -239,12 +239,10 @@ end | |||
|
|||
function setdatatype!(output::Union{HDF5.Dataset, HDF5.H5DataStore}, datatype::Type) | |||
dtstr = datatype_to_string(datatype) | |||
# @debug "setdatatype!($(_infostring(output)), \"$dtstr\")" | |||
hasattribute(output, :datatype) && HDF5.delete_attribute(output, "datatype") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@apmypb in what situations would we overwrite a datatype attribute?
src/types.jl
Outdated
@@ -345,131 +344,129 @@ end | |||
Base.show(io::IO, m::MIME"text/plain", lh::LHDataStore) = HDF5.show_tree(io, lh.data_store) | |||
Base.show(io::IO, lh::LHDataStore) = show(io, MIME"text/plain"(), lh) | |||
|
|||
Base.setindex!(output::LHDataStore, v, i, chunk_size=nothing) = begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@apmypb , we don't want to use chunk_size
attributes in getindex
and setindex!
.
src/types.jl
Outdated
|
||
# write <:Real | ||
Base.setindex!(output::LHDataStore, v::T, i::AbstractString, | ||
DT::DataType=typeof(v)) where {T<:Real} = begin | ||
_setindex!(output::LHDataStore, v::T, i::AbstractString, args... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the args...
?
src/types.jl
Outdated
output.data_store[i] = v | ||
DT != Nothing && setdatatype!(output.data_store[i], DT) | ||
nothing | ||
setdatatype!(output.data_store[i], T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably safest to return nothing
explicitly at the end.
src/types.jl
Outdated
dspace = (size(v), (evntsize..., -1)) | ||
chunk = (evntsize..., CHUNK_SIZE) | ||
_setindex!(output::LHDataStore, v::AbstractArray{T}, i::AbstractString, | ||
chunk_size::Union{Nothing, Int}=nothing) where {T<:Real} = begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this
create_entry!(output::LHDataStore, label::AbstractString, contents::AbstractArray{T}; chunk_size::Union{Nothing, Int}=nothing)
instead, with the same order of arguments used in write
in HDF5.jl. chunk_size
should be a keyword argument.
src/types.jl
Outdated
@@ -478,7 +475,7 @@ Open a LEGEND HDF5 file and return an `LHDataStore` object. | |||
LEGEND HDF5 files typically use the file extention ".lh5". | |||
""" | |||
function lh5open(filename::AbstractString, access::AbstractString = "r") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use
function lh5open(filename::AbstractString, access::AbstractString = "r"; usechunks::Bool = false)
and
LHDataStore(HDF5.h5open(filename, access), usechunks)
src/types.jl
Outdated
Currently supported are elements of `NamedTuple`, `TypedTable.Table` or | ||
`HDF5.Group`. | ||
""" | ||
function reduce_datastore(lhd::LHDataStore, i::AbstractString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's call this delete_entry!(lhd::LHDataStore, label::AbstractString)
src/types.jl
Outdated
|
||
extend the Table `dest` at `lhd[i]` with columns from `src`. | ||
""" | ||
function extend_datastore(lhd::LHDataStore, i::AbstractString, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this function? If so, since it purely targeted to tables, maybe call it add_columns!
or so?
Passing both lhd
and i
on the one hand, and dest
on the other hand seems redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extend_datastore
also supports NamedTuples. I found it very convenient, to not load in the whole table in order to add just a column or another entry in the NamedTuple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having dest
implicitly checks whether the data structure which we want to alter is actually a named tuple or a table, but I guess, I can also check it inside the function itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found it very convenient, to not load in the whole table in order to add just a column or another entry in the NamedTuple.
When do we add columns, actually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do this a lot, when I have raw data and determine measured pulse amplitudes and want to add them to the Table
. I find this very convenient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually kind of an anti-pattern, we do this via "horizontal" views across files instead (#37).
In principle, scientific data should not be modified/amended in the same file, once written and closed. But we can leave the functionality in if it's currently in active use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't seem to understand what you mean.
An example LH5 display of a saved Table called "segBEGe" looks like this:
🗂️ HDF5.File: (read-only)
└─ 📂 segBEGe
├─ 🏷️ datatype
├─ 🔢 Phi
│ ├─ 🏷️ datatype
│ └─ 🏷️ units
├─ 🔢 R
│ ├─ 🏷️ datatype
│ └─ 🏷️ units
├─ 🔢 Z
│ ├─ 🏷️ datatype
│ └─ 🏷️ units
├─ 🔢 chid
│ └─ 🏷️ datatype
├─ 🔢 file_idx
│ └─ 🏷️ datatype
├─ 🔢 samples
│ └─ 🏷️ datatype
└─ 🔢 z
├─ 🏷️ datatype
└─ 🏷️ units
Right now, each COLUMN of a Table is saved as a (separate) HDF5 object in the file,
so why would adding a column break the pattern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it wouldn't break the pattern. But from a scientific workflow point of view we should avoid amending tables during the course of data taking and analysis. But it's not a problem to offer the functionality.
src/types.jl
Outdated
end | ||
export reduce_datastore | ||
|
||
function _reduce_datastore(lhd::LHDataStore, nt::NamedTuple, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better call this _delete_entries
or so`? "reduce" has a different meaning in Julia.
I've changed |
Thanks @fhagemann - yes, let's use this on |
I will move on with data production with |
We can now specify the sizes of chunks to write to the file.
I set the default to contiguous data.
Meaning, if someone wants the data to be chunked, he has to write
data with an additional chunk_size argument:
where
lhd
is a LHDataStore