-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metadata
method
#22
Comments
@pdeffebach had some good design ideas about it in DataFrames.jl in the past. As this is raised on a higher level let me give the API I envision for DataFrames.jl for now:
If we agree to this design then I can implement it. The key challenge is rules of propagation of metadata, but this is not DataAPI.jl related thing so I leave this discussion for later. |
See JuliaData/DataFrames.jl#1458 for the last attempt at implementing this in DataFrames. Two points:
In general a choice has to be made between having 1) a single function in the API which would return a metadata dict which would have to implement specific methods ( |
Metadata could technically be stored at any level of something like a table. For example, each column could be a MetadataArray (i.e. from MetadataArrays.jl) and the table itself could have metadata. I worry that if we started trying to design this around column based indexing it would needlessly complicate and potentially limit its wider usability. Even the definition of what "metadata" is to different people is likely to vary so I'm not sure we should even guarantee it returns a certain type. |
Initially I wanted to write that So personally I would prefer the "single function that returns a metadata dict" approach and later the user can just work on the Ah - and now I see we could support I agree with @Tokazama that different people will want different things from metadata therefore I believe the API we provide should be maximally simple and flexible. |
Allowing the user to decide what to do with whatever Would a simple PR to DataAPI.jl on this be a good next step right now? |
One will probably need to reserve key names anyway. In particular I do not think that I think this is such a major thing that we should wait for other JuliaData members to comment before moving forward. |
Maybe we can say that (In practice for DataFrames we would probably store column metadata internally as vectors with one entry per column as JuliaData/DataFrames.jl#1458 does but that can be exposed to users via a lazy |
Agreed
We can discuss what is best in the PR for DataFrames.jl when it is done (essentially we have two options: dict of vectors or vector of dicts). |
As have been thinking about this issue and #1458 I came to the conclusion that we should go back to the fundamentals. And the core issue is:
What I mean that while we seem all agree that adding metadata to tables is needed, actually I would discuss first what kind of metadata we really think people would store in practice. This is a relevant questions as I think we should not create a functionality that later would be very rarely used. Conversely - if we know exactly what we actually want to use we can design API that supports the required use-cases cleanly. My two concerns are:
So now let me go down to the starting question - what metadata we see that would be actually used (this is not a comprehensive list - please comment what you think would be really used - not just potentially used):
|
In addition to what you've mentioned here are some types of metadata that I think would be useful for me personally to be able to store:
I think it depends on how much you care to take ownership of handling all metadata. I would prefer handling metadata be given a minimal interface. It could potentially have a methods for things like joins so that something like I also think that I/O on metadata should be entirely dependent on the package supporting I/O. There aren't many file types equipped to flexibly handle metadata and it seems like the best thing for DataAPI.jl is to just make it simple to extract metadata. |
I agree with all that is said here. As the author of one of the previous attempts I think that meta-data is important and people coming from R and Python often don't fully appreciate how useful metadata is for Stata users and how it has hurt the adoption of R in applied economics, especially household surveys.
My use of metadata in Stata was twofold
you would see a note that said something along the lines of
Stata also has metadata about a table, which is often used to denote a source or author. I never used that feature. With regards to IO, I don't see a huge problem with saving a data frame to two CSVs and providing a convenience method for adding metadata to a DataFrame when the metadata is stored as a Table. Maybe it's a bit heavy handed but it's robust. |
I think these are all great use cases that I've wanted at some point. As someone who deals with lots of different types of metadata I'd really like to emphasize that less is more as this is implemented. It's easy to get stuck in the weeds on every little implementation detail because you have the combination of situations that arise from row specific, column specific, and general table metadata and all the different types of metadata. This is loosely the kind structure I'm considering using... struct Table{T<:AbstractVector,M<:Union{Nothing,AbstractDict{Symbol,Any}}}
data::Vector{T}
index::Dict{Symbol,Int}
meta::M
end
metadata(x::Table) = getfield(x, :meta) Users don't have to ever worry about metadata unless they decide they want it and developers can create whatever type of fancy metadata that changes dispatch as long as it is a subtype of I would think column specific metadata would be easiest if implemented as a column vector with metadata so I could just do |
I agree with most of what has been said. Just one point:
I'm afraid this wouldn't be workable, as it would require users to deal with another new kind of vector just to store metadata. That would force recompiling all functions for that type, and it wouldn't be easy to deal with e.g. We can say the table type is responsible for preserving metadata across concatenations/joins. DataAPI itself doesn't have to know anything about that. |
With regards to spatial data, which is a natural use case of this, is there anyone in the Julia Data community who has a really detailed knowledge of R's It's the best thing ever, being able to use all of Perhaps someone who has worked on that project could provide some insights. |
The project is going to be done during JSoC this year. And one of the reasons I am pressing to decide on metadata now is to have a clear guidance how this extra package should integrate with DataFrames.jl. |
I'm a little slow/late to the discussion here, but have thought a bit about this. I agree with the idea that this is a way that Julia/DataFrames can really stand apart/improve on the situation from R/pandas; having useful metadata integrated w/ a DataFrame could be really powerful when used in the right contexts. That said, I worry about some of the suggestions around metadata use because they start to become so fundamental or logic-driven. IMO, if some kind of data starts to become so critical we're changing how things are computed/etc. then it probably deserves a more structured solution that just a metadata entry in a DataFrame. IMO, metadata should be primarily "descriptive" about the object; give context, explain values and cardinality thereof; tweaking printing/showing seems fine to me. I just worry about packages starting to abuse My other thought is that while I agree that DataFrames can do a tight integration w/ metadata, I do thing we should allow/encourage |
I have discussed:
with @visr with the context of geospatial data (temporial data is the same I think) and we came to the same conclusion. The logic in packages using tables should primarly be based either on type or a trait of a column (trait is probably preferable as currently Julia does not allow for multiple inheritance), but not metadata attached to it. So given this - are there any more comments how the reference API should look like? |
So I'm not sure what exactly the proposed API is? Is it just that
|
Actually I would prefer this idea as it would be much more composable. The consequence for DataFrames.jl users would be:
So: |
Ah yes that's interesting. Indeed it's quite convenient in R to be able to attach metadata to any object, and yet in Julia we don't want to have to wrap any object in a special type just to add metadata. Though losing the metadata on copy would be annoying. That could easily be fixed in DataFrames by ensuring we copy/readd the metadata when copying the columns (this would be needed for important cases like (Otherwise, returning an empty |
Personally I would feel safer if we worked this way. I would prefer to have a function that copies medatada explicitly that can be called if someone needs it. |
There are a lot of packages that use the term "metadata" (e.g, ImageMetadata.jl, MetadataArrays.jl, MetaGraph.jl, FieldMetadata.jl, FieldProperties.jl, etc.). I don't think an interface like |
@Tokazama can you explain a little more why you think the |
It wouldn't carry any type information so if someone did use something like a |
Sorry, I'm still not following the concern. Why/where would type information be important? The discussion has revolved around meta = metadata(x)
# see metadata keys
keys(meta)
# iterate over metadata key-value pairs
for (k, v) in meta
end
# check if a specific metadata key is present
haskey(meta, :specific_key) So depending on whether We should probably require that the object returned be |
Actually I prefer Apart from convenience it is a clear signal for the developers not to use metadata to encode program logic - Julia provides other means to to this efficiently. Metadata, as I think about it now (but my opinions evolve based on the comments we get here as the design here is not an easy decision) should be for lightweight things like descriptive strings or maybe some hints how output should be formatted (as working with |
I'm not against this being the case for specific implementations like what might be done in DataFrames but I don't think it should be the only option. |
I agree. You don't want too many interfaces relying on specifically named
I don't fully understand this. IMO metadata should be attached to a data frame and I agree with @nalimilan, it would be annoying to have metadata disappear with copying, consider something as simple as
This sort of global |
@pdeffebach, I think there's a lot more flexibility in the As I've played around with ideas/implementations, I just don't see a realistic way to make a system that is general enough to be widely used that relies on either wrapper objects or requiring metadata fields. It just doesn't scale. The doc system, however, is extremely rich and accomplishes its goal/job very well, IMO; attaching extra information to types, variables, fields, etc. |
@pdeffebach What kind of operations would be performed within Or maybe in your example you meant that assigning a new vector to an existing column via |
@bkamins That's a minor point I'd say. We should be consistent across Tables.jl, so better discuss
@Tokazama Note that in my proposal
@quinnj The reason is that we need to define an API to access column-level metadata. I agree something like Metadata.jl is enough if we decide that column-level metadata should be attached to vector objects themselves rather than stored in the table (my second proporsal). But I think @pdeffebach had arguments against it. |
Having metadata be persistent across joins and reassignment is a crucial feature. If there was a Tables.jl level API for could make assurances about the persistence of metadata. @quinnj if you aren't too familiar with Stata, this is basically the model for the behavior I would like metadata to have. In Stata, it's all about persistence. My argument against having metadata attached to vector objects is that
I'm going to cc @matthieugomez here, since he is someone familiar with Stata who has probably also thought about this in Julia. |
Similar to |
@pdeffebach With
We could copy metadata in |
Lets say I have
with the meadata in a global What happens when the columns are copied inside the |
Yes, GC of this dict is an issue I think. |
Yes, using a global dict will certainly be slower and less memory-efficient than storing column-level metadata in the table (especially since in that case we can store metadata using a vector with one entry for each column, and use the data frame index to map names to positions, like at JuliaData/DataFrames.jl#1458). But I wonder whether it really matters in practice: if you need to copy the column vector anyway, copying the metadata should be cheap in comparison. Maybe we could also add a finalizer to column vectors when adding metadata, so that we can delete the entry from the global dict when the object it destroyed? @Tokazama Have you considered that? |
I don't think that's possible without type piracy. There is no unique type provided by "Metadata" that wraps an instance that is attached to global metadata when using If you want to do what @quinnj is suggesting, something like this should work... function Metadata.global_metadata(tbl::MyTableType, column_name, module_name)
return Metadata.global_metadata(getproperty(tbl, column_name), m)
end ...redirecting Similary, you can do this if you expect your columns to wrap metadata. Metadata.metadata(tbl::MyTableType, column_name) = metadata(getproperty(tbl, column_name), m) Handling persistence of metadata without a wrapper type (i.e. global metadata) would just require actively using |
Adding a finalizer doesn't require having a special type AFAICT. You can just call The concern about performance is that in DataFrames |
Well, that's extremely good to know. I'm trying to add that now but it has the caveat that it only works with mutable structs. Any suggestions on getting this to work with something like |
I think this discussion has surpassed my technical knowledge but as the co-author of JuliaData/DataFrames.jl#1458 (Milan made the design), I like it's implementation. It's super transparent, I could make PRs to it, and users can understand it with a conceptual model. It's just a bit scary for me to have metadata be implemented by a global dict that is invisible to the user, but that could just be my lack of technical knowledge. |
I do not think in DataFrames.jl we have to use a default metadata mechanism - we can do whatever we like. That is why we are discussing it in DataAPI.jl as I would like first to agree on the API of getting metadata and, if possible for setting metadata (but this is less crucial as I believe different table types might provide custom mechanisms for setting the metadata). I think it is important to keep the API and the implementation separate, as otherwise we might run into problems in the future that might be hard to envision currently. Metadata.jl is very nice but it should be an opt-in I think, i.e. if some table type likes Metadata.jl it can start depending on it; but it should not be enforced. |
Yes, keeping API and implementation separate is usually a good thing. But a difficulty here is that if we don't add |
I think having The default implementations could be:
which would also cover the case of a default table-level metadata. Now In particular:
can use a completely different code path. The only problem to solve is if both vector and table define metadata for column which should take the precedence, but this should be solved at |
In my opinion, metadata only makes sense at the level of table. Arrays should not have metadata themselves. i.e.
|
I agree that it is also my use case. However, we should design a flexible system that would fit different use cases. I can imagine that people might want to attach metadata to anything in general (this is what Metadata.jl provides now). Note that in order to have column-level metadata you would have to opt-in for this (normal Recently we had a similar discussion related to |
This is definitely the way to go. I'm currently using this for graphs, tables, and arrays. Performance and storage needs are different for each of these, but it's nice to be able to use a predictable interface for accomplishing this. For example, you could do this in DataFrames.jl struct DataFrameColumnMetadata{T<:AbstractDataFrame} <: AbstractDict{Symbol,Any}
tbl::T
end
function metadata(x::DataFrameColumnMetadata, k)
c = getcolumn(x.tbl, k)
if has_metadata(c)
# indicates the metadata was not found without throwing an error or interfering
# with metadata that my use `nothing` or `missing` as a meaningful value.
return Metadata.no_metadata
else
return metadata(c)
end
end
function metadata(tbl::AbstractDataFrame)
if has_metadata(tbl)
return metadata(tbl)
else
return DataFrameColumnMetadata(tbl)
end
end |
But now every call to |
There's a lot of flexibility here so that this doesn't need to be decided here. struct DataFrame <: AbstractDataFrame
columns::Vector{AbstractVector}
colindex::Index
end
Metadata.metadata(df::DataFrame) = Metadata.global_metadata(df, Main)
struct MetaDataFrame <: AbstractDataFrame
columns::Vector{AbstractVector}
colindex::Index
metadata::Dict{Symbol,Any}
end
Metadata.metadata(df::DataFrame) = getfield(df, :metadata) |
@Tokazama I don't understand how your last proposal stored column-level metadata. That's the main decision to make when designing a general API I think. Attaching metadata to the data frame itself is quite easy (either using Metadata.jl or a custom field in the struct). |
It wasn't intended to illustrate anymore than that you could store metadata in an instance or global metadata. In reality you would want to ensure that the keys in the metadata correspond to columns (e.g., |
@Tokazama I just saw you recently removed support for global metadata in Metadata.jl (Tokazama/Metadata.jl@e88941c). AFAICT this means there's no way to attach metadata to arbitrary objects without wrapping them in a new type. Is that right? This would be unfortunate as it was one of the main features we discussed above. |
I can add it back in. Im still ironing out some details before releasing the next version. The new ability to set variables in modules makes it easier to do this sort of thing without macros. |
We can usually correspond metadata to the the data's self, values, indices/axes, or dimensions.
This can make the ever branching set of possibilities with metadata far more manageable. function index_metadata(m, inds...)
if should_drop_meta(m)
return nothing
else
f = should_copy_meta(m) ? copy : identity
if is_axesmeta(m)
f(map(getindex, m, inds))
elseif is_dimmeta(m)
f(dropints(m, inds))
elseif is_valmeta(m)
f(m[inds...])
else
f(m)
end
end
end
There are certainly plenty of details that remain to make this into a robust generic interface, but I thought it might at least provide some helpful thoughts on how to proceed. |
I pulled out the globally stored metadata stuff into a new package and I'm registering it now JuliaRegistries/General#63519. |
It is often the case that one wants to attach metadata of some sort to an array/graph/etc. How do people feel about adding something basic like
metadata(x) = nothing
that can then be extended by other packages?The text was updated successfully, but these errors were encountered: