Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed types for Scientific Metadata #984

Open
nitrosx opened this issue Jan 2, 2024 · 14 comments
Open

Proposed types for Scientific Metadata #984

nitrosx opened this issue Jan 2, 2024 · 14 comments
Assignees
Labels
enhancement New feature or request feature New feature

Comments

@nitrosx
Copy link
Member

nitrosx commented Jan 2, 2024

This issue is an attempt to summarize the issues and discussions following PR #925 and issues #924 and #940.

In the biweekly meeting, the community expressed support in having multi-dimensional quantities in the metadata, although it is concerned by the fact that there is no limit on the dimensionality of the array, opening up to the possibility to allow a user to save a time series as metadata value.

To address the use case presented in #924 but avoid lengthy time series in metadata, I proposed to allow the following metadata types (italics indicates new types):

  • number: plain unit-less number
  • string: plain string
  • [1d] quantity: a one-dimensional quantity with associated unit. Example: length of 1m
  • 2d quantity: a two-dimensional quantity with associated unit. Example: a point on a plane, [1,2]m
  • 3d quantity: a three-dimensional quantity with associated unit. Example: a point in space, [1,2,3]m
  • quantity range: Example: a frequency range like [1,10]Hz
  • datetime: ISO 8601 string indicating a date and time. Example: 2024-01-02T17:06:04+07:00
  • datetime range: 2 ISO 8601 strings indicating a date and time range. Example: [2024-01-02T13:06:04+07:00, 2024-01-03T19:06:04+07:00]
  • date: ISO 8601 string indicating a date: Example: 2024-01-02
  • date range: 2 ISO 8601 strings indicating a date range: Example: [2024-01-01, 2024-01-03]
  • time: ISO 8601 string indicating a time. Example: T17:06:04+07:00
  • time range: 2 ISO 8601 strings indicating a time range. Example: [T13:06:04+07:00, T19:06:04+07:00]
  • object: container for nested metadata fields.

Search on types number, string and quantity will continue to behave as they currently behave.
I proposed the following search syntax and behavior for the new types:

  • 2d quantity and 3d quantity:
    • q op v.
      It is expanded to an or operator of the same condition on all the elements of the quantity.
      It will be true if it is true for at least one of the element of the quantity
      Example:
      • Value: position = [1,2,3]m
      • Condition: position = 3m
      • Expansion: position[0] = 3m or position[1] = 3m or position[2] = 3m (False or False or True)
      • Result: True
    • q[i] op v
      This apply the condition directly to a single element of the quantity.
      Example:
      • Value: position = [1,2,3]m
      • Condition: position[2] = 3m
      • Expansion: position[2] = 3m (True)
      • Result: True
  • datetime, date, and time
    All the date/time operators will be supported.
  • range (all range types)
    • r = v, v in r
      Does the range r contains the value v
      Example:
      • Value: frequency = [1,10]Hz
      • Condition: frequency = 3Hz
      • Expansion: frequency[0]<3Hz and frequency[1]>3Hz (True and True)
      • Result: True
    • q > v
      Is range r greater than v?
      Example:
      • Value: frequency = [1,10]Hz
      • Condition: frequency > 0.5Hz
      • Expansion: frequency[0]>0.5Hz (True)
      • Result: True
    • q < v
      Is range r less than v?
      Example:
      • Value: frequency = [1,10]Hz
      • Condition: frequency < 11Hz
      • Expansion: frequency[1]<11Hz (True)
      • Result: True

All the relevant tests must be added in to the BE and FE.

This is just a proposal. Please comment below and elaborate further the ideas.

@nitrosx nitrosx added enhancement New feature or request feature New feature labels Jan 2, 2024
@nitrosx nitrosx changed the title Proposed type of Scientific Metadata Proposed types for Scientific Metadata Jan 2, 2024
@bpedersen2
Copy link
Contributor

Sounds good to me and should cover a good portion of our use cases.

@jkotan
Copy link
Contributor

jkotan commented Jan 8, 2024

Hi @nitrosx,
not sure I understand idea. Would you like to add unit support for only the above mentioned types or forbid all other types in scientific metadata (hope not).
At DESY we have to types of Datasets: scan datasets and measurement datasets. The latter groups metadata from the scans. In the measurement metadata we often use lists to aggregate metadata from different scans e.g. a list of ScanCommand or inputDatasets (in the case measurement datasets are raw-like).
Also we had discussion with our beamline scientists how to store 3x3 hkl matrix. Currently we store it encoded in a string but often it is stored as a list of 9 numbers.
I could also imagine that some one would like to store 4-vectors which are much more natural in theory or particle physics.

@nitrosx
Copy link
Member Author

nitrosx commented Jan 8, 2024

@jkotan unit support will be available for the following types:

  • quantity
  • 2d quantity
  • 3d quantity
  • quantity range

Regarding your examples of scan and measurements dataset, could you please provide an example of them. I'm not sure I do understand what are their differences and how relevant they are regarding allowing multi-dimensional quantities in the metadata.

Regarding your last two points, my main concern (and the concern of most of the collaborators) is that allowing n-dimensional quantities in the metadata, we will exhaust the available space in the document and there will be a sprawling of time series in the metadata or data slowly sipping in the metadata.
I will be happy to discuss!!!

@jkotan
Copy link
Contributor

jkotan commented Jan 9, 2024

Hello @nitrosx,
At DESY we've started with creating dataset for each scan (scan dataset). However, we perform a lot of scans, so our IT has reported that our scan datasets uses a lot of DB resources. Therefore we start to think to create a dataset not for a single scan but for a group of scans related to each other i.e. some kind of measurement e.g. 'calibration'.
For such measurement we create a measurement dataset where we group most important scientificMetadata from our scans. For the string quantities which are constant it is easy,

scientificMetadata:
   DOOR_proposalId: "99991173"

For number quantities which are almost constant we store an average, min, max and std.

scientificMetadata:
  source_current: 
    counts: 3
    max: 0.02578197419643402
    min: 0.025225341320037842
    std: 0.0002973251885217321
    unit: "mA"
    value: 0.02556405154367288
    valueSI: 0.00002556405154367288
    unitSI: "A"

However, important scan quantities which are different for each scan, e.g. like a scan command, we need to store as a list of strings.

scientificMetadata:
  ScanCommand: 
    - "ascan exp_mot02 0.0 6.0 6 0.1"
    - "ascan exp_mot01 0.1 6.0 6 0.1"
    - "ascan exp_mot02 0.0 5.0 6 0.1"

Similarly our users/beamline-scientiests have request to aggregate also numerical physical quantities which are changing from scan to scan, i.e. to store them in a list, one value for each scans in a measurement. For such quantities average is not useful e.g.
neither in Poisson nor in Gaussian distribution, e.,g. some motor positions.

A number of scan for each measurement could be different e.g. 1 or 1000. The aim to do it is to reduces data size storage, i.e. do not store duplicated metadata in a series of similar scans.

Of course all our solutions still under discussion so we don't know what will be the structure of our final production datasets.

@nitrosx
Copy link
Member Author

nitrosx commented Jan 9, 2024

@jkotan thank you so much for the explanation and the examples.

I think that a possible solution for you would be the following flow:

  • create a raw dataset for each acquisition
  • post process raw datasets and create a collection datasets (which you call a measurement dataset)
  • combine all the metadata from the raw datasets grouped in the measurement datasets (some massaging needs to happen)
  • reduce metadata in raw dataset
  • add generated metadata to measurement dataset

Producing the metadata for the measurement datasets, as you say in your post, implies that some metadata entry will go from a single value to a list or time series. At this point, I would start to ask myself if the resulting list or time series is still just metadata or it has become data.
One possible solution is that the resulting measurement datasets has an additional data file with the list / time series and in metadata, we would insert an entry with the summary information, like min, max and number of values or delta.

If we apply this to your third example:

scientificMetadata:
  ScanCommand: 
    - "ascan exp_mot02 0.0 6.0 6 0.1"
    - "ascan exp_mot01 0.1 6.0 6 0.1"
    - "ascan exp_mot02 0.0 5.0 6 0.1"

the measurement dataset will have an additional data file containing the full list of scan commands, like:

- scan: 1
  command: "ascan exp_mot02 0.0 6.0 6 0.1"
- scan: 2
  command: "ascan exp_mot01 0.1 6.0 6 0.1"
- scan : 3
  command: "ascan exp_mot02 0.0 5.0 6 0.1"
- ...

while in the metadata, we would insert a summary of those. The metadata fields of the summary will depend on what is important for the users when they are searching for such datasets.
A possible metadata entries could be:

scientificMetadata:
  scan_command_main: "ascan"
  scan_command_motors: ["exp_mot01", "exp_mot02"]
  scan_command_parameter_1: 0.0
  scan_command_parameter_2_min: 5.0
  scan_command_parameter_2_max: 6.0
  scan_command_parameter_2_number_of_values: 2
  scan_command_parameter_3: 6
  scan_command_parameter_4: 0.1

@sbliven
Copy link
Member

sbliven commented Jan 9, 2024

This seems like a particular schema that should be validated under #966

@dylanmcreynolds
Copy link
Contributor

A couple of notes. As @sbliven pointed out in the regular developer meeting, a lot of what you're trying to accomplish is probably search and not storage/tagging.

Second, Mongo is not a great engine for storing very large numerical data. Serializing large 3d data into json so that it can go into Mongo is very inefficient, and searching for it would probably be likewise inefficient.

Another point to note is that you're definition of data type mixes the concepts of dimension, shape and datatype. FWIW, there are widely-used frameworks out there that have conventions for this already. If your user is using scientific python, then take a look at Numpy and its arrays. dtype doesn't give you everything you've specified (definitely not ISO8601) but it gives you a lot.

We plan to use tiled for servicing arrays and tables from source data, and will set next to SciCat. This does not address your search issue, however.

@minottic
Copy link
Contributor

could these be interesting, as a way to delegate complex/custom search needs to elasticsearch and thus to the adopting facility?

From what I understood from a quick read, one can create in elasticsearch custom pipelines ("analysers") which are executed before index creation. These analysers apply sequentially a set of "filters" which can be user-defined. An application could be this issue, as the unit conversion on arrays and the sequential search could all be covered by ELS with a custom filter and analyser.

@sbliven
Copy link
Member

sbliven commented Jan 10, 2024

General principle

Allow me to repeat and expand on some general comments from the verbal discussion:

I think we should commit to not enforcing any particular structure on the scientificMetadata for a generic SciCat instance. Instead, we should think of features depending on metadata structure as "progressive enhancements," where SciCat can provide additional functionality for datasets that do follow some standard structure. I would suggest organizing our issues not around the metadata structure but rather around what features we would like to implement.

The major features which depend on metadata structure are:

  • Searching
    1. Unit-aware searching. Any fields consisting of value+unit attributes are converted to standard SI units before applying search operations.
    2. Searches of non-scalar data. Filter array values by min/max/contains operators. Search for ranges spanning a value ([idea] Handling of scan vars in scientific metadata #940).
  • Visualization
    1. Tabular UI for scientificMetadata. Display flat scientificMetadata as a table of key/value pairs
    2. Customizing the scientificMetadata UI. For instance, human readable keys (Metadata keys incongruency #939) or nice formatting of quantity units.
  • Validation
    1. Enforcement of site-specific schemas. Validate scientificMetadata against one or more schemas, which may be configured based on local conventions, certain data types, etc (Validation of scientificMetadata #966)

Any I missed?

Search

Getting back to the issue at hand, which I think relates only to the two search features. @nitrosx has a good summary of the data types/shapes we would like to search by, as well as some of the operators we want for each type. I think the next step is to look at available search technologies (loopback, elastic search, GraphQL, etc) and see whether they could support these. If so then they likely already have some preferred syntax for specifying the data types (eg JSON-LD).

@nitrosx
Copy link
Member Author

nitrosx commented Jan 19, 2024

I spent sometimes reading about dimensionality in numpy.
Here are the two resources that I read:

This helped me clarifying better the quantity case reported in the original post above.
I would like to clarify what I meant by _x_d-quantity in metadata.
The goal is to allow users to create metadata entries of type quantity (aka measurement with units) with dimensionality 1 and of size 1, 2 and 3. To translate in numpy terminology, a quantity is an array with one dimension and allowed size of 1, 2 and 3.
This will required to specify the query syntax for quantity, which I agree with @sbliven, we should do some research to find out if there is any best practice or standard and adopt it.

@bpedersen2
Copy link
Contributor

PSI probably wants support for entries like in https://discovery.psi.ch/datasets/20.500.11935%2Fc5bce731-55fc-4c57-b049-6c32ad6601c4

@sbliven
Copy link
Member

sbliven commented Jan 23, 2024

PSI probably wants support for entries like in https://discovery.psi.ch/datasets/20.500.11935%2Fc5bce731-55fc-4c57-b049-6c32ad6601c4

Great to see "dummy" test data in our production instance 🙄

PSI certainly has use cases where we might include vectors. However I think it's an ongoing discussion whether this should be metadata in SciCat.

@bpedersen2 How would you suggest querying this data? Is there a preexisting query language that would support vectors, ranges, quantities, etc? I tried following your code for SI quantities but didn't grasp how it gets integrated into search or the frontend.

@nitrosx
Copy link
Member Author

nitrosx commented Feb 21, 2024

I would argue that vectors this long are not really metadata. They should be in data and have summary property in the metadata. Something like:

  • length/dimensions/size
  • average value
  • other aggregated values

@bpedersen2
Copy link
Contributor

bpedersen2 commented Feb 21, 2024

@bpedersen2 How would you suggest querying this data? Is there a preexisting query language that would support vectors, ranges, quantities, etc? I tried following your code for SI quantities but didn't grasp how it gets integrated into search or the frontend.

currently this is not supported.

How searching currently works:

FE: There are fixed terms defined for possible relations: (https://github.com/SciCatProject/frontend/blob/3e0aee212c953c56511d46307ccf30862d06162f/src/app/state-management/models/index.ts#L81)

type ScientificConditionRelation =
  | "EQUAL_TO_NUMERIC"
  | "EQUAL_TO_STRING"
  | "GREATER_THAN"
  | "LESS_THAN";

These are passed together with a field spec (lhs) and the user suppleid value(rhs) to the BE.

BE: These strings are used in a switch to generate a suitable mongo query

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feature New feature
Projects
None yet
Development

No branches or pull requests

7 participants