-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed types for Scientific Metadata #984
Comments
Sounds good to me and should cover a good portion of our use cases. |
Hi @nitrosx, |
@jkotan unit support will be available for the following types:
Regarding your examples of scan and measurements dataset, could you please provide an example of them. I'm not sure I do understand what are their differences and how relevant they are regarding allowing multi-dimensional quantities in the metadata. Regarding your last two points, my main concern (and the concern of most of the collaborators) is that allowing n-dimensional quantities in the metadata, we will exhaust the available space in the document and there will be a sprawling of time series in the metadata or data slowly sipping in the metadata. |
Hello @nitrosx, scientificMetadata:
DOOR_proposalId: "99991173" For number quantities which are almost constant we store an average, min, max and std. scientificMetadata:
source_current:
counts: 3
max: 0.02578197419643402
min: 0.025225341320037842
std: 0.0002973251885217321
unit: "mA"
value: 0.02556405154367288
valueSI: 0.00002556405154367288
unitSI: "A" However, important scan quantities which are different for each scan, e.g. like a scan command, we need to store as a list of strings. scientificMetadata:
ScanCommand:
- "ascan exp_mot02 0.0 6.0 6 0.1"
- "ascan exp_mot01 0.1 6.0 6 0.1"
- "ascan exp_mot02 0.0 5.0 6 0.1" Similarly our users/beamline-scientiests have request to aggregate also numerical physical quantities which are changing from scan to scan, i.e. to store them in a list, one value for each scans in a measurement. For such quantities average is not useful e.g. A number of scan for each measurement could be different e.g. 1 or 1000. The aim to do it is to reduces data size storage, i.e. do not store duplicated metadata in a series of similar scans. Of course all our solutions still under discussion so we don't know what will be the structure of our final production datasets. |
@jkotan thank you so much for the explanation and the examples. I think that a possible solution for you would be the following flow:
Producing the metadata for the measurement datasets, as you say in your post, implies that some metadata entry will go from a single value to a list or time series. At this point, I would start to ask myself if the resulting list or time series is still just metadata or it has become data. If we apply this to your third example: scientificMetadata:
ScanCommand:
- "ascan exp_mot02 0.0 6.0 6 0.1"
- "ascan exp_mot01 0.1 6.0 6 0.1"
- "ascan exp_mot02 0.0 5.0 6 0.1" the measurement dataset will have an additional data file containing the full list of scan commands, like: - scan: 1
command: "ascan exp_mot02 0.0 6.0 6 0.1"
- scan: 2
command: "ascan exp_mot01 0.1 6.0 6 0.1"
- scan : 3
command: "ascan exp_mot02 0.0 5.0 6 0.1"
- ... while in the metadata, we would insert a summary of those. The metadata fields of the summary will depend on what is important for the users when they are searching for such datasets. scientificMetadata:
scan_command_main: "ascan"
scan_command_motors: ["exp_mot01", "exp_mot02"]
scan_command_parameter_1: 0.0
scan_command_parameter_2_min: 5.0
scan_command_parameter_2_max: 6.0
scan_command_parameter_2_number_of_values: 2
scan_command_parameter_3: 6
scan_command_parameter_4: 0.1 |
This seems like a particular schema that should be validated under #966 |
A couple of notes. As @sbliven pointed out in the regular developer meeting, a lot of what you're trying to accomplish is probably search and not storage/tagging. Second, Mongo is not a great engine for storing very large numerical data. Serializing large 3d data into json so that it can go into Mongo is very inefficient, and searching for it would probably be likewise inefficient. Another point to note is that you're definition of data type mixes the concepts of dimension, shape and datatype. FWIW, there are widely-used frameworks out there that have conventions for this already. If your user is using scientific python, then take a look at Numpy and its arrays. dtype doesn't give you everything you've specified (definitely not ISO8601) but it gives you a lot. We plan to use tiled for servicing arrays and tables from source data, and will set next to SciCat. This does not address your search issue, however. |
could these be interesting, as a way to delegate complex/custom search needs to elasticsearch and thus to the adopting facility? From what I understood from a quick read, one can create in elasticsearch custom pipelines ("analysers") which are executed before index creation. These analysers apply sequentially a set of "filters" which can be user-defined. An application could be this issue, as the unit conversion on arrays and the sequential search could all be covered by ELS with a custom filter and analyser. |
General principleAllow me to repeat and expand on some general comments from the verbal discussion: I think we should commit to not enforcing any particular structure on the The major features which depend on metadata structure are:
Any I missed? SearchGetting back to the issue at hand, which I think relates only to the two search features. @nitrosx has a good summary of the data types/shapes we would like to search by, as well as some of the operators we want for each type. I think the next step is to look at available search technologies (loopback, elastic search, GraphQL, etc) and see whether they could support these. If so then they likely already have some preferred syntax for specifying the data types (eg JSON-LD). |
I spent sometimes reading about dimensionality in numpy.
This helped me clarifying better the |
PSI probably wants support for entries like in https://discovery.psi.ch/datasets/20.500.11935%2Fc5bce731-55fc-4c57-b049-6c32ad6601c4 |
Great to see "dummy" test data in our production instance 🙄 PSI certainly has use cases where we might include vectors. However I think it's an ongoing discussion whether this should be metadata in SciCat. @bpedersen2 How would you suggest querying this data? Is there a preexisting query language that would support vectors, ranges, quantities, etc? I tried following your code for SI quantities but didn't grasp how it gets integrated into search or the frontend. |
I would argue that vectors this long are not really metadata. They should be in data and have summary property in the metadata. Something like:
|
currently this is not supported. How searching currently works: FE: There are fixed terms defined for possible relations: (https://github.com/SciCatProject/frontend/blob/3e0aee212c953c56511d46307ccf30862d06162f/src/app/state-management/models/index.ts#L81)
These are passed together with a field spec (lhs) and the user suppleid value(rhs) to the BE. BE: These strings are used in a switch to generate a suitable mongo query |
This issue is an attempt to summarize the issues and discussions following PR #925 and issues #924 and #940.
In the biweekly meeting, the community expressed support in having multi-dimensional quantities in the metadata, although it is concerned by the fact that there is no limit on the dimensionality of the array, opening up to the possibility to allow a user to save a time series as metadata value.
To address the use case presented in #924 but avoid lengthy time series in metadata, I proposed to allow the following metadata types (italics indicates new types):
Search on types number, string and quantity will continue to behave as they currently behave.
I proposed the following search syntax and behavior for the new types:
It is expanded to an or operator of the same condition on all the elements of the quantity.
It will be true if it is true for at least one of the element of the quantity
Example:
This apply the condition directly to a single element of the quantity.
Example:
All the date/time operators will be supported.
Does the range r contains the value v
Example:
Is range r greater than v?
Example:
Is range r less than v?
Example:
All the relevant tests must be added in to the BE and FE.
This is just a proposal. Please comment below and elaborate further the ideas.
The text was updated successfully, but these errors were encountered: