Map out the workflow from data collection to interactive and reproducible data publication #26

npscience · 2017-02-27T13:06:06Z

I'd like to offer a lightning talk on an idea for how a journal could display open research data in a more engaging and useful manner for researchers, including a showcase of a few known tools already available to bring this idea to light. The aim is to inspire attendees to answer/address the following questions, perhaps collating resources in a wiki:

what tools do we need to enable researchers to prepare their data for open, interactive, and reproducible publication, from day one?
what tools do researchers already use to collect, analyse, store, and document their data? do they export in standard formats for open data?
where are the blockers in the flow of information? what tools are needed to break these barriers?
what features do readers of research want to see on the journal page? how would they like to interact with the figure and underlying data?
what will help researchers to adopt the open and reproducible data workflow?

Daniel-Mietchen · 2017-02-27T21:30:54Z

Thanks — I've added the lightning talks tag. Since this does go deeper and has potential for becoming the focus of a team at the event, I also added the idea tag.

bmkramer · 2017-03-04T10:46:34Z

Sounds good! Some resources that might be useful for work on this:

our database of app. 600 scholcomm tools across the research workflow: http://bit.ly/innoscholcomm-list (mostly online tools, additions welcome!)
survey results showing what (combination of) tools researchers actually use for various aspects of their workflow:

-- raw dataset Zenodo or Kaggle
-- dashboard
-- tool combinations (interactive Google sheet)

More general info on the project these are all results from: 101 Innovations in Scholarly Communication

Can't really contribute today/tomorrow due to other commitments, but will try to check in at some point!

npscience · 2017-03-04T11:36:38Z

Thanks Bianca - super helpful!

goodwingibbins · 2017-03-04T12:59:40Z

I'm really interested in this! Been wanting to make some pathways for the openly-available climate data (https://www.esrl.noaa.gov/gmd/ccgg/trends/) to be turned into takeaway arguments/points about climate change, hopefully to remove the "us versus them" mentality of "scientists say this, so you should/shouldn't trust it blindly".

One issue that comes up with things like Jupyter etc is differentiating the way computers need to be spoken to and what the audience needs for transparency.

One possible connection is the ideas here: http://worrydream.com/#!/LearnableProgramming

edsaperia · 2017-03-04T13:05:24Z

I am currently collecting data that I want to publish effectively, so I guess I'm a user of this project? Is that helpful?

npscience · 2017-03-04T13:29:54Z

@edsaperia Yes, user insight will be crucial.

Do you know of any tools already that you would use to illustrate your data collection and analysis steps, so that when you come to publish, readers can see what you've done and perhaps give it a go themselves?

Alternatively, would you like to play with any of the tools listed at https://github.com/sparcopen/open-research-doathon/blob/master/reproducible_open_data_resources.md and see what you think of them? Would they work for you? Why (not)?

npscience · 2017-03-04T13:32:09Z

@goodwingibbins This is totally on point. The trouble with lots of these tools is that they are geared towards the programmer-user. If there's a way to adapt them for people less comfortable with the lingo, that would be very useful.

To start:

what are the difficulties?
how would we translate them?
is it possible to make a friendlier version or add-on for current tools? Or do we need new ones specifically for the traditional tool user (I'm thinking Excel)

npscience · 2017-03-04T15:24:02Z

@edsaperia's resources:

WikiMedia data visualisation framework is vega: http://vega.github.io/ and https://vega.github.io/vega/

open source, maintained by the vega community, supported by wikipedia
data input as json, can create visualisations, and change colours, etc
ideal workflow might be to offer support to translate from dataset to json (if not already), produce graph to show as static (using form interface), and also submit the code for the graph so that readers can modify as they wish. Author doesn't need to see code.
d3.js is more powerful

Is there a standard taxonomy for methodology in the life sciences? i.e. for a reproducible auditable document from data --> analysis --> visualisation.

For example, there are academics who produce/research these methodologies, e.g. LSE's department of methodology (for social sciences) http://www.lse.ac.uk/methodology/Home.aspx

gpa-smith · 2017-03-04T16:20:32Z

One issue that we have identified that distances the final data visualisation from the 'core' data set is the retrospective way data management may kick in at the publication stage, attempting to make sense of essentially unstructured data at the end of the process.

Is this a dominant issue for researchers?
Would better data curation tools and resources form a useful solution? e.g. services or platforms offered earlier in the process by institutions, repositories or journals?
Do technical solutions along the lines of Jupyter notebooks help to mitigate these issues, allowing more interactive links from data to final visualisation? Is widening their appeal and use beyond computational science thus a priority?

bmkramer · 2017-03-04T16:38:03Z

Thanks @npscience for pointing out Vega, didn't know that one! @Daniel-Mietchen: with the Wikimedia Graph extension, another one to add to the Wikimedia workflow? (referencing a separate discussion yesterday)

Further responding to/building on @npscience comment above:

Two other tools for possibly bridging the gap between 'clicking & coding':

Plot.ly - import data as spreadsheet, graphical interface, web-based, export graphs as images or code, free basic version & paid plans, but open source here: (https://github.com/plotly)
The Gamma - work spreadsheet-based, generate code. Open source (MIT license) Disclaimer: I have not yet really looked at/tested/worked with this.

Two systems for standard taxonomy of biomed/life science workflows (but mostly aimed towards experimental stage of the workflow):

Autoprotocol - open standard for specifying experimental protocols
ISA Framework -open source, "helps you to provide rich description of the experimental metadata (i.e. sample characteristics, technology and measurement types, sample-to-data relationships) so that the resulting data and discoveries are reproducible and reusable" (quoted from site)

Daniel-Mietchen · 2017-03-04T16:54:19Z

@bmkramer In terms of workflows, I would not add individual MediaWiki extensions, just mention that there are thousands of them that together cover all aspects of many research cycles. I had briefly mentioned (but not shown) one of them yesterday: https://www.mediawiki.org/wiki/Extension:Jmol .

bmkramer · 2017-03-04T17:02:10Z

@gpa-smith Some thoughts on this, following your 3 questions:

Agree that this is a general data-management issue, and as such, should indeed be part of the research workflow from the beginning, not as an add-on at publication (as with data management in general!)
Which is why I also think solutions in the form of tools/platforms etc should be journal- and publisher independent (but journals/publishers should accommodate (and ideally require) e.g. code-based viz at submission).
And yes, I think it would be great if executable visualizations / generating code-based viz would be part of workflows beyond computational science. More researchers that learn to code would help in general (looking sternly at myself, too...), but for more widespread adoption, tools/platforms that make this possible for non-coders would be big help in moving this practice forward.

npscience · 2017-03-04T17:41:27Z

@bmkramer - Thank you, I'll explore these new tools and standards.
(Edited March 8 to remove mention of bkramer, incorrect handle)

All - is it worth creating a map of this space? Or an ideal workflow to see where the gaps remain?

gpa-smith · 2017-03-04T18:02:16Z

@bmkramer - ISA framework is an interesting one; the journal Scientific Data uses ISA-Tab to generate the structured side of metadata for its Data Descriptor articles, which are focused around datasets as opposed to traditional articles that have data submitted as supporting material.

The ability to create machine readable metadata for other article types at earlier stages, or at least to feed into something like the ISA-framework at an end point would be beneficial. We have talked about a similar process for integration between something simple like an excel spreadsheet feeding into a JSON solution like Vega.

The early collaborative working space is a useful area to look at developing, for example https://github.com/jupyter/colaboratory, Google drive integration

npscience · 2017-03-04T18:19:05Z

Ok. Tomorrow's tasks, for me at least (feel free to add):

find out more about all of the above.
map out the flow, what is already being done, what is needed, what are the opportunities for each of us to take these forward.

fionabradley · 2017-03-04T18:36:12Z

Does the http://www.nltk.org/ fit in here?

I agree with Bianca that easy tools for the non-coder are essential. Tableau Community is nice but a suite of open source tools is ideal. I just learning Python because it's popular in humanities and social science (along with R) but it will be a long time before being able to do anything useful with it. :)

bmkramer · 2017-03-05T11:19:09Z

One other aspect of this as a workflow is integration in the writing process. Overleaf and Authorea both (in varying aspects) integrate with Jupyter notebooks, for example, and Authorea works with git-based versioning.

Integration with such a workflow would also allow publishers to stimulate/facilitate reproducible reporting, while not tying that aspect of manuscript preparation/submission to a locked-in, proprietary system*. With preprint services offering similar integrations, focus could be more on publications themselves than on publication venue.

Back to workflows, I also like Kieran Healy's take on the difference between the 'office based' and the 'engineering model' http://plain-text.co

*Elsevier at some point piloted executable papers (again, for computer science only), but then dropped the pilot: https://www.elsevier.com/physical-sciences/computer-science/executable-papers-improving-the-article-format-in-computer-science

npscience · 2017-03-05T12:33:14Z

Check out the Data Stack at https://blog.liip.ch/archive/2017/02/13/data-stack.html

Tools to consider:

rawgraphs.io
Apache Hadoop
Tableau
https://python-xy.github.io/

npscience · 2017-03-05T12:36:03Z

Tasks:
[] map out the basic workflow for a researcher, from data collection to publication. Include steps for creating figures from data, that are both interactive and reproducible
[] populate the workflow with current tools
^^ this requires:
[] knowledge of tools used by life scientists (analyze 101innovations data) @npscience doing this
[] understand the input/output file types of each tool
[] is the tool non-proprietary? at least: can you output data and analysis script in open standards?

bkramer · 2017-03-05T12:56:05Z

Hi, You have inadvertently cc'd me (Brian Kramer [email protected]) on this thread.

…

-b Sunday, March 5, 2017, 06:19 -0500 from Bianca Kramer <[email protected]>:

One other aspect of this as a workflow is integration in the writing process. Overleaf and Authorea both (in varying aspects) integrate with Jupyter notebooks, for example, and Authorea works with git-based versioning. Integration with such a workflow would also allow publishers to stimulate/facilitate reproducible reporting, while not tying that aspect of manuscript preparation/submission to a locked-in, proprietary system*. With preprint services offering similar integrations, focus could be more on publications themselves than on publication venue. Back to workflows, I also like Kieran Healy's take on the difference between the 'office based' and the 'engineering model' http://plain-text.co *Elsevier at some point piloted executable papers (again, for computer science only), but then dropped the pilot: https://www.elsevier.com/physical-sciences/computer-science/executable-papers-improving-the-article-format-in-computer-science — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or mute the thread .

npscience · 2017-03-05T14:06:05Z

What data analysis tools for life scientists use?

Source data: 101innovations
Authors: B. Kramer, J. Rosman
What is this? 2015-2016 survey of tools used in scholarly workflow

Aim:
Identify the common tools used for data analysis by life scientists, to inform the workflow map (mentioned above).

Methods:

Uploaded cleaned-data-innovations-in-scholarly-communication-survey_def.csv to OpenRefine. Filtered for LIFESCIENCES responders only. Downloaded: 101innovations-data-npscience.csv
Made array into single column, kept ROLE field associated. Removed all other fields (note to self: next time, keep the ID field to serve as a lookup function).
Added in the free-text tools: for those who listed other tools, analysed the 'ANALYZSPECCL' column data --> included ROLE field with tool specified in free text. Cleaned the free text ('Other') into defined tools ('Other-clean') where possible (see 'tool cleaning categories.csv').
EXCLUSIONS: Some of the free text tools specified tools included as predefined tools in the survey. Checked back into raw data to see where entrant had selected tool and specified it in free text too, to exclude these duplications. These were:
--- 7718acf7db511b2d56378405e1c41d34,PhD Student,iPython
--- 901081a86fa72f183d9c5c1dea44d339,Professor/...,MS Excel
--- 0650e0e9e8c109f7137f9efdd706ed2c,Professor/...,MS Excel
--- 31f20ad6aff37b44918bd66ba57eb27a,Professor/...,MS Excel
--- 7c63d1938b3a772d53f592f510e3b60a,Professor /...,R
--- 0e7f8fc2486739f81b6ac760c6c15cdc,PhD student,R statistics (=R)
--- 91c960039bdb23cc2c71b1f11f9c009f,Librarian,SPSS
--- 31f20ad6aff37b44918bd66ba57eb27a,Professor/...,SPSS
--- 96582f92afa4d69ce1e636d51a927c79,Professor/...,SPSS

Results:

output dataset: '101innovations-npscience-lifesci-analysistools.csv' https://github.com/npscience/open-research-doathon/blob/master/101innovations-npscience-lifesci-analysistools.csv
tool value cleaning data: https://github.com/npscience/open-research-doathon/blob/master/tool%20cleaning%20categories.csv
results data (count of tools total and by role using PivotTable): https://github.com/npscience/open-research-doathon/blob/master/101innovations-npscience-lifesci-analysistools-counts.csv

npscience · 2017-03-05T14:33:11Z

Side comments: Plot.ly is not easy to use. And slow.

HKLondon · 2017-03-05T14:51:53Z

6 high level workflows for publishing data in more traditional journals:
https://docs.google.com/spreadsheets/d/1A_cFRnN6_j5bpUAFqv6Fb14xyJAtjKSg_jf-qCfriiY/edit?usp=sharing

npscience · 2017-03-05T15:06:28Z

@HKLondon Great! Here's the url for the workflow diagram [WIP] https://www.draw.io/#G0B_a2JekZMrW8Z0xRc3hoR1Baczg

HKLondon · 2017-03-05T15:29:30Z

Whilst most academic publishers can link to data it seems like very few academic publishers can (easily) publish interactive data sets like the OECD: http://stats.oecd.org/index.aspx?DataSetCode=PDB_LV or integrate interactive figures within the HTML versions of articles (many more examples of interactive PDFs see https://peerj.com/preprints/1594.pdf.

Some examples:

3D visualization (Elsevier)
Animated figures (Interactions)
Interactive graphic (F1000)
Interactive figure (Nature Chemistry)
Publisher produced interactive infographics (BMJ)
Crystallography figures (JOURNAL OF APPLIED CRYSTALLOGRAPHY)
Interactive plots (Elsevier)

Might be interesting to survey publishers to find out what the stumbling blocks are... publishers slow to change, few researchers wanting to publish interactive items, complexity of managing these items through submission process, tagging issues in article XML/JATS files, problems with platform integrations, long term archiving issues - including problems with submission of files to PubMed Central, etc...

pherterich · 2017-03-05T15:46:04Z

Remembered this a bit too late, but there was a RDA working group on publishing workflows, but it might be a bit too generic compared to what you're interested in. http://doi.org/10.5281/zenodo.20308

npscience · 2017-03-05T17:28:42Z

Outstanding:
[x] create interactive visualisation of the 'analysis tools that life scientists use' data
Files needed are at https://github.com/npscience/open-research-doathon

...count.csv
master.R --> RShiny app (thanks to @bjw49 :D)
Remaining tweaks for the Shiny:
[x] order bars by highest count first (native order in the csv); currently alphabetical
[] transpose x<-->y so that tool names are on y-axis
[] select roles to display as grouped bars (e.g. show all, PhD and professor bars)

--> this is happening in my repo at: npscience#2

Notes: really difficult for a novice to start using any of the above tools for visualisation....

rossmounce · 2017-03-07T22:33:26Z

@npscience your comments about plotly befuddle me!

In my experience plot.ly was great to go from data to interactive, configurable visualisations with rapidity. Especially for data layouts I wasn't familiar with e.g. choropleth maps. Ultimately I didn't find it quite had capability to do all the complicated fiddling necessary for "publication quality" figures - I had to dive back into R and do it 'the hard way'.

But for quick, interactive, exploratory data analysis i still find plotly very easy to use - definitely here to stay in my playbook.

npscience · 2017-03-08T10:22:00Z

@rossmounce noted, the more opinions the better, so thanks for chiming in. I think there's a huge gap in our literacy here; but I'm on the upward learning curve.

rossmounce · 2017-03-08T10:31:32Z

@npscience having said all that I haven't tried Tableau so maybe Tableau or other such services are even better than plot.ly, but from a standpoint of a user with experience of spreadsheet software, R and plot.ly (admittedly limited experience of the wide breadth of available options!) I can definitely see plot.ly & web services like it have a niche / use-case. If R is one's base reference (as is the case for many biologists?) almost anything else is going to be "easier" & "quicker" !

Daniel-Mietchen added idea lightning-talks labels Feb 27, 2017

JosephMcArthur mentioned this issue Mar 4, 2017

The ideas that I might help do something about #42

Closed

gpa-smith mentioned this issue Mar 4, 2017

If you need help, then please post in this thread. #43

Closed

goodwingibbins mentioned this issue Mar 4, 2017

Climate Visualisations #40

Closed

Daniel-Mietchen added the postponed label Mar 5, 2017

Daniel-Mietchen closed this as completed Mar 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Map out the workflow from data collection to interactive and reproducible data publication #26

Map out the workflow from data collection to interactive and reproducible data publication #26

npscience commented Feb 27, 2017

Daniel-Mietchen commented Feb 27, 2017

bmkramer commented Mar 4, 2017 •

edited

Loading

npscience commented Mar 4, 2017

goodwingibbins commented Mar 4, 2017

edsaperia commented Mar 4, 2017

npscience commented Mar 4, 2017

npscience commented Mar 4, 2017

npscience commented Mar 4, 2017 •

edited

Loading

gpa-smith commented Mar 4, 2017

bmkramer commented Mar 4, 2017

Daniel-Mietchen commented Mar 4, 2017 •

edited

Loading

bmkramer commented Mar 4, 2017

npscience commented Mar 4, 2017 •

edited

Loading

gpa-smith commented Mar 4, 2017

npscience commented Mar 4, 2017

fionabradley commented Mar 4, 2017

bmkramer commented Mar 5, 2017

npscience commented Mar 5, 2017 •

edited

Loading

npscience commented Mar 5, 2017

bkramer commented Mar 5, 2017 via email

npscience commented Mar 5, 2017 •

edited

Loading

npscience commented Mar 5, 2017

HKLondon commented Mar 5, 2017

npscience commented Mar 5, 2017

HKLondon commented Mar 5, 2017

pherterich commented Mar 5, 2017

npscience commented Mar 5, 2017 •

edited

Loading

rossmounce commented Mar 7, 2017

npscience commented Mar 8, 2017

rossmounce commented Mar 8, 2017

Map out the workflow from data collection to interactive and reproducible data publication #26

Map out the workflow from data collection to interactive and reproducible data publication #26

Comments

npscience commented Feb 27, 2017

Daniel-Mietchen commented Feb 27, 2017

bmkramer commented Mar 4, 2017 • edited Loading

npscience commented Mar 4, 2017

goodwingibbins commented Mar 4, 2017

edsaperia commented Mar 4, 2017

npscience commented Mar 4, 2017

npscience commented Mar 4, 2017

npscience commented Mar 4, 2017 • edited Loading

gpa-smith commented Mar 4, 2017

bmkramer commented Mar 4, 2017

Daniel-Mietchen commented Mar 4, 2017 • edited Loading

bmkramer commented Mar 4, 2017

npscience commented Mar 4, 2017 • edited Loading

gpa-smith commented Mar 4, 2017

npscience commented Mar 4, 2017

fionabradley commented Mar 4, 2017

bmkramer commented Mar 5, 2017

npscience commented Mar 5, 2017 • edited Loading

npscience commented Mar 5, 2017

bkramer commented Mar 5, 2017 via email

npscience commented Mar 5, 2017 • edited Loading

npscience commented Mar 5, 2017

HKLondon commented Mar 5, 2017

npscience commented Mar 5, 2017

HKLondon commented Mar 5, 2017

pherterich commented Mar 5, 2017

npscience commented Mar 5, 2017 • edited Loading

rossmounce commented Mar 7, 2017

npscience commented Mar 8, 2017

rossmounce commented Mar 8, 2017

bmkramer commented Mar 4, 2017 •

edited

Loading

npscience commented Mar 4, 2017 •

edited

Loading

Daniel-Mietchen commented Mar 4, 2017 •

edited

Loading

npscience commented Mar 4, 2017 •

edited

Loading

npscience commented Mar 5, 2017 •

edited

Loading

npscience commented Mar 5, 2017 •

edited

Loading

npscience commented Mar 5, 2017 •

edited

Loading