From eb0ab9826f134643ee6f2bd5832a2c4cff0f162b Mon Sep 17 00:00:00 2001 From: floradanna Date: Mon, 21 Jun 2021 15:18:27 +0200 Subject: [PATCH 1/5] existing data page --- _data/sidebars/main.yml | 2 + _data/tags.yml | 2 + pages/your_problem/existing_data.md | 84 +++++++++++++++++++++++++++++ 3 files changed, 88 insertions(+) create mode 100644 pages/your_problem/existing_data.md diff --git a/_data/sidebars/main.yml b/_data/sidebars/main.yml index d0d372f1b..566a7022b 100644 --- a/_data/sidebars/main.yml +++ b/_data/sidebars/main.yml @@ -52,6 +52,8 @@ subitems: url: /storage.html - title: Data transfer url: /data_transfer.html + - title: Existing data + url: /existing_data.html - title: Identifiers url: /identifiers.html - title: Licensing diff --git a/_data/tags.yml b/_data/tags.yml index 62b748373..669f4dfdf 100644 --- a/_data/tags.yml +++ b/_data/tags.yml @@ -66,6 +66,8 @@ data analysis: url: data_analysis.html identifiers: url: identifiers.html +existing data: + url: existing_data.html # Assembly tags nels: url: nels_assembly.html diff --git a/pages/your_problem/existing_data.md b/pages/your_problem/existing_data.md new file mode 100644 index 000000000..5744e8c92 --- /dev/null +++ b/pages/your_problem/existing_data.md @@ -0,0 +1,84 @@ +--- +title: Existing data +keywords: +contributors: [Rob Hooft, Flora D’Anna, Pinar Alper, Yvonne Kallberg, Karel Berka, Marko Vidak, Olivier Collin, Ulrike Wittig] +tags: [collect, reuse, researcher] +description: how to find and reuse existing data. +--- + + +## How can you find existing data? + +### Description +Many datasets could exist that you can reuse for your project. Even if you know the literature very well, you can not assume that you know everything that is available. Datasets that you should be looking for can either be collected for the same purpose in another earlier project, but it could also have been collected for a completely different purpose and still serve your goals. + +### Considerations +* Creation of scientific data can be a costly process. For a research project to receive funding one needs to justify, in the project’s data management plan, the need for data creation and why reuse is not possible. Therefore it is advised to always check first if there exists suitable data to reuse for your project. + +* When the outputs of a project are to be published, the methodology of selecting a source dataset will be subjected to peer review. Following community best practice for data discovery and documenting your method will help you later in reviews. + +* List the characteristics of the datasets you are looking for, e.g. format, availability, coverage, etc. This enables you to formulate the search terms. Please see [Gregory K. et al. Eleven quick tips for finding research data. PLoS Comput Biol 14(4): e1006038 (2018)](https://doi.org/10.1371/journal.pcbi.1006038) for more information. + + +### Solutions +* Locate the repositories relevant for your field. + * Check the bibliography on relevant publications, and check where the authors of those papers have stored their data. Note those repositories. If papers don’t provide data, contact the authors. + * Data papers provide peer-reviewed descriptions of publicly available datasets or databases and link to the data source in repositories. Data papers can be published in dedicated journals, such as [Scientific Data](https://www.nature.com/sdata/), or be a specific article type in conventional journals. + * Search for research communities in the field, and find out whether they have policies for data submission that mention data repositories. For instance, [ELIXIR communities in Life Sciences](https://elixir-europe.org/communities). + +* Locate the primary journals in the field, and find out what data repositories they endorse. + * Journal websites will have a “Submitter Guide”, where you’ll find lists of recommended deposition databases per discipline, or generalist repositories. For instance, [Scientific Data's Recommended Repositories]( https://www.nature.com/sdata/policies/repositories). + * You can also find the databases supported by a journal through the policy interface of [FAIRsharing](https://fairsharing.org/policies/). + +* Search registries for suitable data repositories. + * [FAIRsharing](https://fairsharing.org) is an ELIXIR resource listing repositories. + * [Re3data](https://www.re3data.org) lists repositories from all fields of science. + * [Google Dataset search](https://datasetsearch.research.google.com) or [Datacite](https://search.datacite.org) for localization of datasets. + * The [Omics Discovery Index (OmicsDI)](https://www.omicsdi.org) provides a knowledge discovery framework across heterogeneous omics data (genomics, proteomics, transcriptomics and metabolomics). + +* Search through all repositories you found to identify what you could use. Give priority to curated repositories. + + +## How can you reuse existing data? + +### Description +When you find data of interest, you should first check if the quality is good and if you are allowed to use the data for your purpose. This process might be difficult, so you can find guidelines and tools below. + +### Considerations +* Before reusing the data, make sure to check if a licence is attached and that it allows your intended use of the data. + +* Check if metadata or documentation are provided with the data. Metadata and documentation should provide enough information for a correct interpretation and reuse of the data. The use of standard metadata schemes and ontologies increase reusability of the data. + +* Quality of the data is of utmost importance. You should check whether there is a data curation process on the repository (automatic, manual, community). This information should be available on the repository’s website. Check if the repository provides a quality status of each dataset (e.g. star rating system or quality indicators). + +* The data you choose to reuse may be versioned. Before you start to reuse it you should decide which version of the dataset you will use. + +### Solutions +* Verify that the data is suitable for reuse. + * Check the [licences](licensing) or repository policy for data usage. + * Data from publications can generally be used but make sure that you cite the publication as reference. + * If you cannot find the licence of the data, contact the authors. No licence means no reuse allowed. + * If you are reusing personal (identifiable) or even sensitive data, some extra care needs to be taken (see [Human data](human_data) and [Sensitive data](Sensitive_data) pages): + * Make sure you select a data repository that has a clear, published data access/use policy. You do not want to be liable for improper reuse of personal information. For instance, if you’re downloading human data from some lab’s website make sure there is a statement/confirmation that the data was collected with ethical and legal considerations in place. + * Sensitive data is often shared under restrictions. Check in the description of the access conditions whether these match with your project (i.e. whether you would be able to successfully ask to get access to the data). For instance, certain datasets can only be accessed by projects with Ethics/Institutional Review Board approval or some can only be used within a specific research field. + +* Verify the quality of the data. Some repositories have quality indicators, such as: + * Star system indicating level of curation, e.g. for manually curated/non-curated entries. + * [Evidence ontology](https://evidenceontology.org). + * Detailed quality assessment methods. For instance, PDBe has several [structure quality assessment metrics](https://www.ebi.ac.uk/pdbe/about/news/assessing-pdb-structure-quality). + +* If metadata is available, check the quality of metadata. For instance, information about experimental setup, sample preparation, data analysis/processing can be necessary to reuse the data and reproduce the experiments. + +* Decide which version (if present) of the data you will use. + * You can decide to always use the version that is available at the start of the project. You would do this if switching to the new versions would not be very beneficial to the project or it would require major changes. In this case, you need to make sure that you and others, who want to reproduce your results, can access the old version at a later stage too. + * You can update to the latest versions if new ones come out during your project. You would do this if the new version does not require major changes in your project workflow, and/or if the updates could improve your project. In this case, consider that you may need to re-do all your calculations based on a new version of the dataset and make sure that everything stays consistent. + + + + + + + +## Relevant tools and resources + +{% include toollist.html tag="transfer" %} From 9c3c0c886abff9aa8d7a4f6e889e9ff0f705bda6 Mon Sep 17 00:00:00 2001 From: floradanna Date: Mon, 21 Jun 2021 15:23:08 +0200 Subject: [PATCH 2/5] fix tag --- pages/your_problem/existing_data.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pages/your_problem/existing_data.md b/pages/your_problem/existing_data.md index 5744e8c92..c4765cb10 100644 --- a/pages/your_problem/existing_data.md +++ b/pages/your_problem/existing_data.md @@ -81,4 +81,4 @@ When you find data of interest, you should first check if the quality is good an ## Relevant tools and resources -{% include toollist.html tag="transfer" %} +{% include toollist.html tag="existing data" %} From 510bfca9346f3bfacb4a83791b04e14eda3510c3 Mon Sep 17 00:00:00 2001 From: Bert Droesbeke <44875756+bedroesb@users.noreply.github.com> Date: Wed, 23 Jun 2021 10:28:07 +0200 Subject: [PATCH 3/5] Update existing_data.md --- pages/your_problem/existing_data.md | 1 - 1 file changed, 1 deletion(-) diff --git a/pages/your_problem/existing_data.md b/pages/your_problem/existing_data.md index c4765cb10..3f86a8436 100644 --- a/pages/your_problem/existing_data.md +++ b/pages/your_problem/existing_data.md @@ -1,6 +1,5 @@ --- title: Existing data -keywords: contributors: [Rob Hooft, Flora D’Anna, Pinar Alper, Yvonne Kallberg, Karel Berka, Marko Vidak, Olivier Collin, Ulrike Wittig] tags: [collect, reuse, researcher] description: how to find and reuse existing data. From 94d71239d907002f3dc068437f63d2b8e3065414 Mon Sep 17 00:00:00 2001 From: Bert Droesbeke <44875756+bedroesb@users.noreply.github.com> Date: Thu, 24 Jun 2021 09:35:31 +0200 Subject: [PATCH 4/5] Tag is already added --- _data/tags.yml | 2 -- 1 file changed, 2 deletions(-) diff --git a/_data/tags.yml b/_data/tags.yml index 669f4dfdf..62b748373 100644 --- a/_data/tags.yml +++ b/_data/tags.yml @@ -66,8 +66,6 @@ data analysis: url: data_analysis.html identifiers: url: identifiers.html -existing data: - url: existing_data.html # Assembly tags nels: url: nels_assembly.html From e268aca875f07b78bbedac6d8a239d7fa27c026c Mon Sep 17 00:00:00 2001 From: Bert Droesbeke <44875756+bedroesb@users.noreply.github.com> Date: Wed, 30 Jun 2021 21:29:33 +0200 Subject: [PATCH 5/5] Update existing_data.md --- pages/your_problem/existing_data.md | 1 + 1 file changed, 1 insertion(+) diff --git a/pages/your_problem/existing_data.md b/pages/your_problem/existing_data.md index 3f86a8436..5457509d6 100644 --- a/pages/your_problem/existing_data.md +++ b/pages/your_problem/existing_data.md @@ -2,6 +2,7 @@ title: Existing data contributors: [Rob Hooft, Flora D’Anna, Pinar Alper, Yvonne Kallberg, Karel Berka, Marko Vidak, Olivier Collin, Ulrike Wittig] tags: [collect, reuse, researcher] +page_tag: existing data description: how to find and reuse existing data. ---