-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Where to save results/processed data? Or is that not expected to be saved because it can be reproduced? #8
Comments
Great question. I agree that intermediate results should be included. Where they live depends on what kind of results we're talking about though -- tidy versions of data, figures, etc. For instance, @hadley has introduced the convention of a Figures are another intermediate output (kinda -- as they are embedded in pdf output, but might be part of the final html output). I think a sub-directory in I like to save the data frames corresponding to each figure (it's nice that ggplot means I'm always plotting some kind of data frame) as a useful intermediate output. I have no idea where these things should live. (I tend to put them in a subdirectory of For a more thorough/general example of preserving intermediate states, I highly recommend looking at @richfitz 's remake tool, e.g. as used by his team in their baad project workflow, described very nicely in this blog post: https://ropensci.org/blog/2015/06/03/baad/. The key point here is that this saves all intermediate objects, which works as a kind of caching mechanism (based on the object hashes, instead of just the modification times like |
I haven't read that blog post yet but it looks really good, I'll take a On 9 June 2015 at 23:03, Carl Boettiger [email protected] wrote:
|
@daattali Yup, I see your point and I don't have a good answer for the outputs issue. For example, a user cannot just open the |
You're right, I forgot that many (most?) people use the knit button instead On 9 June 2015 at 23:41, Carl Boettiger [email protected] wrote:
|
Sorry if I missed this, but did anyone bring up John Myles White (now new maintainer's) Project Template? |
Hi @karthik, there are a few opinions on ProjectTemplate shared over here swcarpentry/DEPRECATED-site#806 and @rmflight has some interesting notes on a comparison of R packages and ProjectTemplate. For me, ProjectTemplate too complicated for my typical use cases, and lacks the ubiquity and recognition that the R package structure has. But maybe others find the ProjectTemplate approach suits their research more, I'd be interested to see examples of research compendia using ProjectTemplate (do you know of any?) @daattali, yes, I think there is an ethic that results (ie. output artefacts) should be treated as disposable or overwritable (cf. @richfitz's post http://nicercode.github.io/blog/2013-04-05-projects/), in a similar way that Docker containers are encouraged to be considered disposable. The most appropriate management of this might end up varying considerably across different research communities, for my work the most sensible approach (to me at the moment) seems to be a sub-directory in
But I'm keen to see more real-world examples of what emerges from 'mangle of practice' when others organise their research to accompany publication. I think that's where we'll get a lot of useful insight into some of these questions. I've been bothered by knitr's caching recently, so I'm giving |
Perhaps this belongs better in #3, not sure. But given that I'm being pinged in here I'll post this here :) We tend not to commit any generated products, even those that take hours to generate. Instead we've started using github releases as a way of storing those (which helps keep the git repos slim). Following from @jennybc's comments at rrhack, we try to be willing to run "make clean" at will and treat generated products as totally disposable. In another project we have a separate (fairly disposable) git repo for generated data, but that's as much about making sure we're looking at the same copy of generated data in multiple places -- that's a much longer running project than I think any framework works well for (multiple CPU years). All our recent cases use remake and despite being fairly raw to use, has worked well in practice as a more reliable caching layer than knitr offers. With a distributed teams we have multiple people pushing code and can reliably regenerate each other's outputs locally. Examples, in reverse chronological order:
We have two more like the
(which will be a lot easier once remake is on CRAN but that's not going to happen for a while). I'm generally not a fan of the idea of R packages as research units (I don't want to install an analysis) so the closest these get to R packages is that almost all contain a directory with functions in (R, like in a package) and all contain something with metadata describing dependencies ( In general: don't commit outputs, but be inventive with other places to store them. |
Any new opinions or updates on this? I have turned many of my "data processing" projects/pipelines into R packages and want to show an overview of the process as vignettes. My current approach is to make separate folders within data-raw for each phase along with a cache folder (data-raw/cache, data-raw/raw-data, data-raw/processed-data, data-raw/clean-data, data-raw/support-data, etc.) but I do not really like this as it is not transparent and sustainable. |
@jimbrig2011 There are some new options provided by the drake pkg, which appeared since the last post on this thread. Take a look, that might give you some good ideas on how to organise your workflow. |
(Is this the proper place to ask such questions or is there anywhere else I should be voicing my thoughts?)
What is the suggested solution for dealing with intermediate useful output/results data? It might be argued that results should not be included since the whole point is that you can technically reproduce them with the repository, but in many cases it can be beneficial to 1. have some results in the repo that can be easily shown, and 2. maybe save some important intermediate outputs if they can be regarded as interesting results on their own. You might disagree, and I'm fine with that, but in either case I feel like it would be helpful to at least include a small blurb on the topic - either include them or don't include them or list the pros/cons.
From my own personal (and limited) experience, I like having not only the raw data but also the clean version if it takes a long time to produce it and is more useful.
The text was updated successfully, but these errors were encountered: