Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where to save results/processed data? Or is that not expected to be saved because it can be reproduced? #8

Open
daattali opened this issue Jun 7, 2015 · 9 comments

Comments

@daattali
Copy link
Contributor

daattali commented Jun 7, 2015

(Is this the proper place to ask such questions or is there anywhere else I should be voicing my thoughts?)

What is the suggested solution for dealing with intermediate useful output/results data? It might be argued that results should not be included since the whole point is that you can technically reproduce them with the repository, but in many cases it can be beneficial to 1. have some results in the repo that can be easily shown, and 2. maybe save some important intermediate outputs if they can be regarded as interesting results on their own. You might disagree, and I'm fine with that, but in either case I feel like it would be helpful to at least include a small blurb on the topic - either include them or don't include them or list the pros/cons.

From my own personal (and limited) experience, I like having not only the raw data but also the clean version if it takes a long time to produce it and is more useful.

@cboettig
Copy link
Member

Great question. I agree that intermediate results should be included. Where they live depends on what kind of results we're talking about though -- tidy versions of data, figures, etc.

For instance, @hadley has introduced the convention of a data-raw/ in the devtools package, arguing that you place the scripts that tidy the data that ultimately goes in data/ in this new directory. Is that appropriate for this context, or should data/ be the raw data?

Figures are another intermediate output (kinda -- as they are embedded in pdf output, but might be part of the final html output). I think a sub-directory in analysis called figures or something is appropriate here (and plays nicely with things like knitr).

I like to save the data frames corresponding to each figure (it's nice that ggplot means I'm always plotting some kind of data frame) as a useful intermediate output. I have no idea where these things should live. (I tend to put them in a subdirectory of analysis, like I do with figures, but perhaps that's confusing. Like you say, this should be well documented at any rate, which I haven't usually done.

For a more thorough/general example of preserving intermediate states, I highly recommend looking at @richfitz 's remake tool, e.g. as used by his team in their baad project workflow, described very nicely in this blog post: https://ropensci.org/blog/2015/06/03/baad/. The key point here is that this saves all intermediate objects, which works as a kind of caching mechanism (based on the object hashes, instead of just the modification times like make does) as well as providing intermediate results for scrutiny.

@daattali
Copy link
Contributor Author

I haven't read that blog post yet but it looks really good, I'll take a
look tomorrow when I get to a desktop, thanks for the link. Re: saving
outputs: I don't like to save output files/figures inside the analysis
folder. IMO it feels like then the analysis folder just gets treated as a
large container for all files that are anything other than data and helper
functions. It becomes too generic without a clear purpose - I like to have
an "outputs" or "results" or something similar. Although maybe then we're
starting to have too many folders and overcomplicate this? Just my personal
preference.


http://deanattali.com

On 9 June 2015 at 23:03, Carl Boettiger [email protected] wrote:

Great question. I agree that intermediate results should be included.
Where they live depends on what kind of results we're talking about though
-- tidy versions of data, figures, etc.

For instance, @hadley https://github.com/hadley has introduced the
convention http://r-pkgs.had.co.nz/data.html#data-extdata of a data-raw/
in the devtools package, arguing that you place the scripts that tidy the
data that ultimately goes in data/ in this new directory. Is that
appropriate for this context, or should data/ be the raw data?

Figures are another intermediate output (kinda -- as they are embedded in
pdf output, but might be part of the final html output). I think a
sub-directory in analysis called figures or something is appropriate here
(and plays nicely with things like knitr).

I like to save the data frames corresponding to each figure (it's nice
that ggplot means I'm always plotting some kind of data frame) as a useful
intermediate output. I have no idea where these things should live. (I tend
to put them in a subdirectory of analysis, like I do with figures, but
perhaps that's confusing. Like you say, this should be well documented at
any rate, which I haven't usually done.

For a more thorough/general example of preserving intermediate states, I
highly recommend looking at @richfitz https://github.com/richfitz 's
remake tool, e.g. as used by his team in their baad project workflow,
described very nicely in this blog post:
https://ropensci.org/blog/2015/06/03/baad/. The key point here is that
this saves all intermediate objects, which works as a kind of caching
mechanism (based on the object hashes, instead of just the modification
times like make does) as well as providing intermediate results for
scrutiny.


Reply to this email directly or view it on GitHub
#8 (comment).

@cboettig
Copy link
Member

@daattali Yup, I see your point and I don't have a good answer for the outputs issue.

For example, a user cannot just open the .Rmd file and hit the knit pdf button in RStudio and get output pdf (or html or whatever) in anything but the same working directory. (Actually there are fancy hacks to get around this which I link in the other thread, but not sure I'd recommend them).

@daattali
Copy link
Contributor Author

You're right, I forgot that many (most?) people use the knit button instead
of running the function manually, which means specifying output directories
or using a wrapper such as my package will be very hard to adopt. Good point


http://deanattali.com

On 9 June 2015 at 23:41, Carl Boettiger [email protected] wrote:

@daattali https://github.com/daattali Yup, I see your point and I don't
have a good answer for the outputs issue.

For example, a user cannot just open the .Rmd file and hit the knit pdf
button in RStudio and get output pdf (or html or whatever) in anything but
the same working directory. (Actually there are fancy hacks to get around
this which I link in the other thread, but not sure I'd recommend them).


Reply to this email directly or view it on GitHub
#8 (comment).

@karthik
Copy link
Member

karthik commented Jun 10, 2015

Sorry if I missed this, but did anyone bring up John Myles White (now new maintainer's) Project Template?
It does quite a bit of what this compendium hopes to achieve.

@benmarwick
Copy link
Contributor

Hi @karthik, there are a few opinions on ProjectTemplate shared over here swcarpentry/DEPRECATED-site#806 and @rmflight has some interesting notes on a comparison of R packages and ProjectTemplate. For me, ProjectTemplate too complicated for my typical use cases, and lacks the ubiquity and recognition that the R package structure has. But maybe others find the ProjectTemplate approach suits their research more, I'd be interested to see examples of research compendia using ProjectTemplate (do you know of any?)

@daattali, yes, I think there is an ethic that results (ie. output artefacts) should be treated as disposable or overwritable (cf. @richfitz's post http://nicercode.github.io/blog/2013-04-05-projects/), in a similar way that Docker containers are encouraged to be considered disposable. The most appropriate management of this might end up varying considerably across different research communities, for my work the most sensible approach (to me at the moment) seems to be a sub-directory in analysis called figures or output, as @cboettig described above. I suppose a more generic version of that might be

analysis/
  ├─ input/
  |   └─ my_rmd.Rmd
  └─ output/
    ├─ data-generated/
    └─ figures/

But I'm keen to see more real-world examples of what emerges from 'mangle of practice' when others organise their research to accompany publication. I think that's where we'll get a lot of useful insight into some of these questions. I've been bothered by knitr's caching recently, so I'm giving remake a try on a current project as an alternative to that, though I'm wary of adding yet another dependency to my workflow...

@richfitz
Copy link
Member

Perhaps this belongs better in #3, not sure. But given that I'm being pinged in here I'll post this here :)

We tend not to commit any generated products, even those that take hours to generate. Instead we've started using github releases as a way of storing those (which helps keep the git repos slim). Following from @jennybc's comments at rrhack, we try to be willing to run "make clean" at will and treat generated products as totally disposable. In another project we have a separate (fairly disposable) git repo for generated data, but that's as much about making sure we're looking at the same copy of generated data in multiple places -- that's a much longer running project than I think any framework works well for (multiple CPU years).

All our recent cases use remake and despite being fairly raw to use, has worked well in practice as a more reliable caching layer than knitr offers. With a distributed teams we have multiple people pushing code and can reliably regenerate each other's outputs locally.

Examples, in reverse chronological order:

We have two more like the plant_paper one linked above that should be public in the next couple of weeks - this is a format that seems to work well for us. The idea is that our projects should be runnable with:

remake::install_missing_packages()
remake::make()

(which will be a lot easier once remake is on CRAN but that's not going to happen for a while).

I'm generally not a fan of the idea of R packages as research units (I don't want to install an analysis) so the closest these get to R packages is that almost all contain a directory with functions in (R, like in a package) and all contain something with metadata describing dependencies (DESCRIPTION for wood, remake.yml for everything else).

In general: don't commit outputs, but be inventive with other places to store them.

@jimbrig
Copy link

jimbrig commented Mar 19, 2020

Any new opinions or updates on this? I have turned many of my "data processing" projects/pipelines into R packages and want to show an overview of the process as vignettes. My current approach is to make separate folders within data-raw for each phase along with a cache folder (data-raw/cache, data-raw/raw-data, data-raw/processed-data, data-raw/clean-data, data-raw/support-data, etc.) but I do not really like this as it is not transparent and sustainable.

@benmarwick
Copy link
Contributor

@jimbrig2011 There are some new options provided by the drake pkg, which appeared since the last post on this thread. Take a look, that might give you some good ideas on how to organise your workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants