Skip to content

Commit

Permalink
Merge pull request #448 from elixir-europe/floradanna-patch-2
Browse files Browse the repository at this point in the history
Update analysing.md
  • Loading branch information
floradanna authored Mar 15, 2021
2 parents 688d2fb + ec5c39e commit d34cb06
Showing 1 changed file with 14 additions and 18 deletions.
32 changes: 14 additions & 18 deletions pages/rdm_cycle/analysing.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,31 @@
---
title: Analysing
keywords: [Data analysis, Computing, Collaboration]
contributors: [Rob Hooft, Olivier Collin, Munazah Andrabi]
contributors: [Rob Hooft, Olivier Collin, Munazah Andrabi, Flora D'Anna]
---

## What is data analysis?

Data analysis encompasses all the different data manipulation and transformations that will help scientists to discover information or generate new knowledge.
This is the step where the actual work on the data towards the goal of a research project takes place.
The steps of the workflow in the Analysis phase of a project will often be repeated several times to explore the data as well as to optimize the process.
According to the different types of data (quantitative or qualitative) the methods will differ.

Data analysis follows the (often automated, batch) processing in the Processing stage.
Data analysis consists in exploring the collected data to begin understanding the messages contained in a dataset and/or in applying mathematical formula (or models) to identify relationships between variables.
The steps of the workflow in the analysis phase will often be repeated several times to explore the data as well as to optimize the workflow itself.
According to the different types of data (quantitative or qualitative) the data analysis methods will differ. Data analysis follows the (often automated, batch) data processing stage.

## Why is data analysis important?

Since Data Analysis is the stage where new knowledge and information are generated, it can be considered as central in the research process.
With many disciplines becoming data-oriented, more and more data intensive projects will occur and will involve experts from many thematic fields.
Since data analysis is the stage where new knowledge and information are generated, it can be considered as central in the research process. Because of the relevance of the data analysis stage in research findings, it is essential that the analysis workflow applied to a dataset complies with the FAIR principles. Moreover, it is extremily important that the analysis workflow is reproducible by other researchers and scientists.
With many disciplines becoming data-oriented, more and more data intensive projects will occur and will require experts from many thematic fields.

## What should be considered for data analysis?
Because of the diversity of domains and techologies in Life Sciences, data can be either "small" or "big data". As a consequence, the methods and technical solutions used for data analysis might differ. The characteristics of "big data" are often summarized by a growing list of ["V's" properties: Volume, Velocity, Variety, Variability, Veracity, Visualization and Value](https://bigdatapath.wordpress.com/2019/11/13/understanding-the-7-vs-of-big-data/).

Because of their nature, data in Life Sciences are now considered as Big Data. These characteristics of Big Data, often summarized by a growing list of "V" properties (Volume, Velocity, Variety, Veracity, Value, etc.), impact strongly the methods and technical solutions used for Data Analysis.

* At the storage level, including the transfer of the data from the data production facility to the computing facility for its analysis, it is worthwhile to compare the cost of the transfer of massive amounts of data compared to the transfer of virtual images of machines for the analysis.
* The variety of the data poses an integration challenge that can only adressed with the help of best practices that make data interoperable and reusable.
* The Data Analysis phase relies on the previous steps (collection, processing) that will lay the foundations for the generation of new knowledge by providing acurate and trustworthy data.
* For the analysis of data, you will first have to consider the computing environment and choose between several computing infrastructure types for e.g. cluster, cloud. You will also need to select the appropriate work environment according to your needs and expertise (command line, web portal).
* The location of your data is important because of the needed need of proximity with computing resources. This can impact data transfer across the different infrastructures.
* You will have to select the tools best suited for the analysis of your data. Resources such as [bio.tools](https://bio.tools) can be very helpful.
* The data analysis stage relies on the previous stages (collection, processing) that will lay the foundations for the generation of new knowledge by providing acurate and trustworthy data.
* The variety of the data poses an integration challenge that can only be adressed with the help of best practices that make data interoperable and reusable during the collection and/or process stage.
* The location of your data is important because of the need of proximity with computing resources. This can impact data transfer across the different infrastructures. It is worthwhile to compare the cost of the transfer of massive amounts of data compared to the transfer of virtual images of machines for the analysis.
* For the analysis of data, you will first have to consider the computing environment and choose between several computing infrastructure types, e.g. cluster, cloud. You will also need to select the appropriate work environment according to your needs and expertise (command line, web portal).
* You will have to select the tools best suited for the analysis of your data.
* It is important to document the exact steps used for data analysis. This includes the version of the software used, as well as the parameters used, as well as the computing environment. Manual "manipulation" of the data may complicate this documentation process.
* In the case of collaborative data analysis, you will have to ensure access to the data and tools for all collaborators. This can be achieved by setting up virtual research environments (VRE).
* In the case of collaborative data analysis, you will have to ensure access to the data and tools for all collaborators. This can be achieved by setting up virtual research environments.
* Consider publishing your analysis workflow according to the FAIR principles as well as your datasets.


## Problems to be addressed at this stage
Expand Down

0 comments on commit d34cb06

Please sign in to comment.