How to search through NWB metadata (for the IBL dataset) #100

GaelleChapuis · 2023-07-03T12:56:49Z

GaelleChapuis
Jul 3, 2023

Hello,

When working with the IBL dataset, there are a couple queries we routinely do via ONE (using typically one.alyx.rest) to search for datasets according to their metadata, and I was wondering how to do the same using NWB. Our archive number is: https://dandiarchive.org/dandiset/000409

For example:

If I wanted to get all the sessions acquired in a given lab, how can I do it (and I mean that without looking at the folder naming)?
If I wanted to know all the insertions acquired at X planned coordinates (e.g. what we call the “repeated site”), how can I get the session folder + know which ephys data to use within it?
How can I get sessions associated with a particular publication tag (there are multiple ones already, associated with different papers)?

Thank you for your help !

bendichter · 2023-07-03T13:09:38Z

bendichter
Jul 3, 2023
Maintainer

@GaelleChapuis These types of detailed metadata-based asset-level searches can be done using the Python API. Some of this metadata is already extracted into asset-level metadata, and others will require opening the NWB file. See here for existing examples. I'll see if I can put together scripts the do the types of queries you are asking for.

1 reply

GaelleChapuis Jul 3, 2023
Author

Thank you ! If you are about to create a script, I would have one more example question:

How to find session that have a particular dataset, for example the raw data for the left camera?

I will also look through the documentation, thank you for the pointer

bendichter · 2023-07-03T15:02:49Z

bendichter
Jul 3, 2023
Maintainer

In many cases you'll be able to query the asset-level metadata in the DANDI API, (i.e. the result of this line below: metadata = asset.get_raw_metadata()) but in this case many of the attributes you are looking for are in the NWB file and have not been extracted, so you'll need to open each file in stream mode. Here's an example script that I recommend running on DANDI Hub. This query will take a while, partly because the IBL dataset is an outlier that contains many sessions, and partly because all of the data you are asking for has not been extracted into the asset metadata where it can be accessed much more quickly. Thanks for these example queries, this gives us a good target to optimize for, and we will certainly be using these as user needs as we improve our asset-level search functionality. For now, here's how you would do it, ideally run on DANDI Hub for speed and a pre-configured environment:

from dandi.dandiapi import DandiAPIClient
from tqdm.notebook import tqdm
from pynwb import NWBHDF5IO
import h5py
import fsspec

fs = fsspec.filesystem("http")


def parse_metadata(s3_url):
    """Function to open nwb file and parse the desired metadata"""
    with fs.open(s3_url, "rb") as f:
        with h5py.File(f, "r") as file:
            with NWBHDF5IO(file=file, mode="r", load_namespaces=True) as io:
                nwbfile = io.read()
                
                return dict(
                    path=metadata["path"],
                    institution=nwbfile.institution,
                    lab=nwbfile.lab,
                    left_video_path=nwbfile.acquisition['OriginalVideoLeftCamera'].external_file[0],
                    related_publications=nwbfile.related_publications,
                )
            
            
# iterate over all assets. If it is an NWB file, run `parse_metadata` and accumulate results
client = DandiAPIClient()
dandiset = client.get_dandiset("000409")
assets = list(dandiset.get_assets())

results = []
for asset in tqdm(assets[:20]):
    metadata = asset.get_raw_metadata()
    if metadata['encodingFormat'] == 'application/x-nwb':
        s3_url = metadata["contentUrl"][1]
        results.append(parse_metadata(s3_url))
        
results

I've added the [:20] in here so you can see it run in a reasonable amount of time, but that should be removed to query the entire dandiset. I am not sure if we stored information about the injection coordinates. @CodyCBakerPhD, did we cover this metadata in the conversion?

0 replies

CodyCBakerPhD · 2023-07-05T06:06:44Z

CodyCBakerPhD
Jul 5, 2023

@GaelleChapuis Thanks for getting in touch! As you look over the NWB file versions of the IBL data, let me know of any other little details like this that I can look into including/modifying/otherwise fixing

If I wanted to know all the insertions acquired at X planned coordinates (e.g. what we call the “repeated site”), how can I get the session folder + know which ephys data to use within it?

@CodyCBakerPhD, did we cover this metadata in the conversion?

Note that the 'repeated site' experiment has not been specifically converted yet - the previous conversion was focused specifically on the brain wide map, though as memory serves there were a fair number of sessions that overlapped the two

Though also, as I recall from my notes on the matter, there may be some additional variability in the trial structures for the sessions outside of the BWM which I'd need to dig deeper into in order to resolve the mapping properly

Anyway, the field I believe you're asking about is the 'trajectory_estimate' property of a particular probe ID (pid) in a given session (eid), correct?

I did not include that information in the first round of NWB mapping since all electrodes/sorted units have precise CCF (and the other two atlases as well) coordinates, which seemed to be more informative overall

But if you're saying this is intended to be used as a summary value, search field, or other high level filter of session metadata, then I'd be happy to take a look at including it when I do a reconversion to include the passive data (which now appears to have been released, is that correct?)

How can I get sessions associated with a particular publication tag (there are multiple ones already, associated with different papers)?

I will note that an eventual goal to help manage this navigation specific to IBL, though not yet achieved, was to have separate DANDI sets corresponding to each major segment of the data release (behavior - brain wide map - repeated side - spike sorting benchmark)

The plan was also to link these via DANDI's 'associated projects' metadata feature

0 replies

GaelleChapuis · 2023-08-14T11:44:56Z

GaelleChapuis
Aug 14, 2023
Author

Hello, thank you both for your answers, and sorry for the late reply - I was away.

To get a feel for the user experience, I installed the DANDI API on my machine, and run the lines of code you provided above @bendichter -- putting a break so it stop after the first asset matching.
This already takes a while, i.e. several minutes. I don't think we can rely on this for users to make their searches.

Here is what is returned from the metadata:

How possible is it to add fields that we know external users will be looking for?
For example, users will surely want to search recordings made in a given brain region. Or recordings that have a certain kind of dataset (video left), or are associated to a release tag. We can provide a list of important search queries if helpful? Also, we could help modify these metadata files accordingly (tagging Olivier Winter @oliche here)?

To answer to Cody:
Anyway, the field I believe you're asking about is the 'trajectory_estimate' property of a particular probe ID (pid) in a given session (eid), correct?
Yes, but this might not be the metadata the most needed externally ; it might be worth first putting efforts into having the metadata we assume most external users will search by.

0 replies

GaelleChapuis · 2023-08-14T12:00:17Z

GaelleChapuis
Aug 14, 2023
Author

Follow up question: I cannot seem to find the equivalent of the release tag, for example the tag 2022_Q4_IBL_et_al_BWM we have associated to each dataset of the Brainwide map release.
The related_publications field in the example above returns the actual publication DOI. Where could I find this information?

Thank you for your help !

1 reply

CodyCBakerPhD Aug 14, 2023

The BWM part of the release tag is the DANDI set ID itself

To get something closer to a Q4/Q3 tag added to that, would be something to include in the release notes of publications and re-publications and would be controlled/specified by the version of the DANDI set (usage across the board defaults to most recently published version)

The first publication of the DANDI sett is awaiting approval from your team w.r.t. the new 'processed only' files with follow up cleaning the attached raw companions (see email chain for specifics)

CodyCBakerPhD · 2023-08-14T17:16:25Z

CodyCBakerPhD
Aug 14, 2023

We can provide a list of important search queries if helpful?

Yes, that would be helpful

0 replies

GaelleChapuis · 2023-08-17T12:36:03Z

GaelleChapuis
Aug 17, 2023
Author

Hello,

What is important is to distinguish the search case between "session" and "insertion".
Indeed, some information are specific to insertion (such as the example query on brain location below).
Some other information pertain to a session (and therefore propagate to all insertions within) such as the task protocol. Nonetheless, it can be useful to make queries for insertion of a given aspect associated to a session (the example of the task protocol is given below, but you can well imaging wanting to know quickly how many insertion are available for a subject etc).
The release_tag is particular as it pertains to dataset. Here the search on insertion is important as we use certain insertions (truly, certain datasets) only in each paper. For example, the set Reproducible Site insertions (N=~70) is a subset of the Brainwide map, and we need to identify which ones have been used for the particular publication.

I hope this makes sense, let me know if not.

Here are useful queries to begin with, each time the ONE documentation is linked below :

Search sessions with a given:

dataset, date_range, laboratory, project, subject, task_protocol
ONE documentation
release tag
ONE documentation

Search insertions with a given:

brain region acronym(s)
ONE documentation and ONE documentation
release tag
ONE documentation
task_protocol
ONE documentation

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to search through NWB metadata (for the IBL dataset) #100

{{title}}

Replies: 7 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to search through NWB metadata (for the IBL dataset) #100

GaelleChapuis Jul 3, 2023

Replies: 7 comments · 2 replies

bendichter Jul 3, 2023 Maintainer

GaelleChapuis Jul 3, 2023 Author

bendichter Jul 3, 2023 Maintainer

CodyCBakerPhD Jul 5, 2023

GaelleChapuis Aug 14, 2023 Author

GaelleChapuis Aug 14, 2023 Author

CodyCBakerPhD Aug 14, 2023

CodyCBakerPhD Aug 14, 2023

GaelleChapuis Aug 17, 2023 Author

GaelleChapuis
Jul 3, 2023

Replies: 7 comments 2 replies

bendichter
Jul 3, 2023
Maintainer

GaelleChapuis Jul 3, 2023
Author

bendichter
Jul 3, 2023
Maintainer

CodyCBakerPhD
Jul 5, 2023

GaelleChapuis
Aug 14, 2023
Author

GaelleChapuis
Aug 14, 2023
Author

CodyCBakerPhD
Aug 14, 2023

GaelleChapuis
Aug 17, 2023
Author