-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathintro_recordr.Rmd
261 lines (201 loc) · 11 KB
/
intro_recordr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
---
title: "recordr Package Introduction"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{recordr Package Introduction}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
library(recordr)
knitr::opts_chunk$set(fig.dpi = 96)
```
## Overview
The *recordr* package collects information about R script executions (also refered to as "runs"). The recorded information
includes the files that were read and written by the R script, and details of the execution
environment, such as the operating system version, R packages loaded, etc.
The recorded information for a script constitutes data provenance for the data products and analysis outputs (graphs, .csv files, etc) generated by a script execution, by providing information to describe how the data products were created.
## Using recordr
### Recording a Script Execution
The *record()* method takes an R script filename as an argument and sources it, recording files that
were read and written by R functions that are registered with *recordr*. It is not necessary to
modify an R script in order to use *record*.
The following example runs a sample script that is included with the *recordr* package:
```{r echo=F,warning=F,message=F,eval=F}
# For recordr package maintainer only
library(recordr)
rc <- new("Recordr")
deleteRuns(rc, seq=1:1000)
```
```{r record,warning=F,message=F}
library(recordr)
if(require("ggplot2")) {
rc <- new("Recordr")
sampleScript <- system.file("extdata/EmCoverage.R", package="recordr")
firstRunId <- record(rc, sampleScript , tag="first recordr run")
}
```
Information about the script execution is stored in the *recordr* cache (~/.recordr). *recordr* provides methods to
search and view items stored in the cache. It is not recommended that files or directories be manually edited or deleted from the cache directories, with the exception of the items mentioned in this document.
### Listing Script Executions
Script runs that have been recorded can be listed using the *listRuns()* method. The
listing can be filtered by the tag value specified when a run was recorded.
Runs can also be filtered by run start time, run end time, the text of error messages for a run and by a sequence number, which is an integer
value assigned to each run to assist in easily specifying a particular run for listing or viewing.
In this example, all runs with a tag containing the string "first" are listed. Because
recordr has only run once in this demo, only one run is listed:
```{r listRuns,warning=F,message=F}
listRuns(rc, tag="first")
```
If no search parameters are specified to *listRuns*, then all recorded runs are listed.
### Preparing Metdata for a Run
The first time that recordr() is run, an initial metadata template file is copied to
the file "~/.record/package_metadata_template.R". Each time that record() is called, a
metdata file is created for the current run by using the template file as a starting
point and generating an EML document, creating EML elements for the items in
the template file. In addition 'otherEntity' element is created for each each data
object that is created by the run and the R script that was run.
The metadata template file can be edited before a run, using the values you specify
to affect the generated EML document.
If you are using Rstudio, click on File->Open File (Ctrl-O)
and open ~/.recordr, then click on "package_metadata_template.R" in the File pane.
Currently only the items that are in the template file can be updated, and new elements
cannot be added, so for example, the 'title', 'abstract' and 'creators' can be edited.
### Recording An R Console Session
*recordr* can also collect information during an R console session using the
*startRecord()* and *endRecord()* methods. When *startRecord()* is typed in
the R console, information capture begins. Information will be captured for any
function registered with *recordr*, while all other console input will not cause
any information capture. Information capture is terminated when *endRecord()* is
entered in the console, and execution information is written to the *recordr*
cache.
```{r console session,warning=F,message=F,eval=F}
startRecord(rc, tag="first console run")
df <- read.csv(file = system.file("./extdata/coverages_2001-2010.csv", package="recordr"))
endocladia_coverage <- df[df$final_classification=="endocladia muricata",]
myDir <- tempdir()
csvOutFile <- sprintf("%s/Endocladia_muricata.csv", myDir)
write.csv(endocladia_coverage, file = csvOutFile)
endRecord(rc)
```
The history of all statements typed during this recorded console session is
saved in the *recordr* cache and will be included in the data package uploaded
to a data repository when publishRun() is called.
### Viewing Script Executions
More detailed information can be retrieved and viewed for a run or set of runs
using the *viewRuns()* method, for example:
```{r viewRuns,warning=F,message=F,eval=T}
viewRuns(rc, id=firstRunId, sections=c("details", "used", "generated"))
```
Information for all matching runs is retrieved and displayed,
The output displayed by *viewRuns* is divided into the sections "details", "used"
and "generated", which can be selectively displayed using the _sections_ parameter.
## Information Collected By The *recordr* package
The *record()* method will currently record information for the following methods:
package | function
---------- | --------------
dataone | getObject
dataone | create
dataone | updateObject
utils | read.csv
utils | write.csv
ggplot2 | ggsave
base | readLines
base | writeLines
png | readPNG
png | writePNG
base | scan
Other information about the execution environment is also recorded, such as the R packages that were
loaded, the operating system, system user name.
## Disk usage
Recordr can save copies of files that were read and written by R scripts that are run with `record`. In
Addition, the R script run is also retained.
You may wish to do this so that you have copies of the files as they existed when the program was run.
This provides reproducibility, so that your scripts can be re-rerun with the same inputs. Or you may wish
to create a package of the set of files that were read or written by a particular script run, and archive
the package locally, or publish it to a data repository.
By default, Recordr does not archive copies of files that were read or written by the R scripts that
are run with `recordr`.
You must set the R option `max_archive_file_size` to the maximum size of a file that can be copied
to the Recordr archive. If this option is unset or set to `0` then no files will be copied to the
archive. If files are not copied to the archive, then recordr will try to access them in the disk
location that they were in when `record()` ran.
Setting `max_archive_file_size`
# Max file size to archive, in bytes
#options(recordr_max_archive_file_size=1000000.0)
#options(blocked_replica_node_list = TRUE)
#options(capture_dataone_reads = TRUE)
#options(capture_dataone_writes = TRUE)
#options(capture_file_reads = TRUE)
#options(capture_file_writes = TRUE)
#options(certificate_path = "")
#options(dataone_env = "DEV")
#options(dataone_env = "DEV2")
options(dataone_env = "SANDBOX2")
#options(dataone_env = "STAGING")
#options(dataone_env = "STAGING2") # mnTestKNB
##options(foaf_name = as.character(NA))
#options(number_of_replicas = 3)
##options(orcid_identifier = "orcid.org/0000-0002-2192-403X")
##options(package_metadata_template_path = "~/.recordr/package_metadata_template.R")
#options(preferred_replica_node_list = list())
##options(provenance_storage_directory = "~/.recordr")
#options(public_read_allowed = TRUE)
#options(replication_allowed = TRUE)
##options(rights_holder = "CN=Peter Slaughter A34456,O=Google,C=US,DC=cilogon,DC=org")
#options(source_member_node_id = "urn:node:KNB")
##options(submitter = "CN=Peter Slaughter A34456,O=Google,C=US,DC=cilogon,DC=org")
#options(target_member_node_id = "urn:node:mnDevUCSB2")
#options(target_member_node_id = "urn:node:mnTestKNB")
#options(target_member_node_id = "urn:node:mnStageUCSB2")
options(target_member_node_id = "urn:node:mnDemo2")
## *recordr* Internals
The following description is provided for informational purposes only and is not required to
use the *recordr* package.
The *recordr* package can record execution information for the commonly used R functions mentioned in the
previous section by using *wrapper* functions that are called before a requested function is called. This overriding
of functions is only in effect when the *record()* function is running. This overridding is accomplished by
temporarily adding an entry to the R search path so that the *recordr* wrapper functions are first in the
search path. For example, if a script that is run with the *record()* function calls the following function:
```{r, eval=F}
df <- read.csv(file = ("/usr/smith/data/coverages_2001-2010.csv", package="recordr"))
```
then the wrapper function *recordr_read.csv* is first called because *record()* has temporarily bound
*recordr_read.csv* to the function name *read.csv* in the temporary environment named ".recordr" that is
attached to the search path, so that the overridden function appears first in the search path, regardless
of package load order. The function *recordr_read.csv* records that the file
`/usr/smith/data/coverages_2001-2010.csv` was read. Then this wrapper function searches for the next function
`read.csv` in the search path, which is the function that would have been run if *record()* was not active.
When the script has completed, the *record()* function unattaches the ".recordr" environment from the
search path, thereby restoring the R search path to it's previous state, as it was before *record()* was
called.
Note that this mechanism that *record()* used to override functions doesn't work for function calls
that are fully qualified, i.e. the package name is included in the call. For example, the following
function call would not be recorded:
```{r, eval=F}
df <- utils::read.csv(file = ("/usr/smith/data/coverages_2001-2010.csv", package="recordr"))
```
Also, the *record()* function currently cannot record information for input or output files that are opened as a connection, for
example, the following call to `writeLines` would not be recorded:
```
# Write out to a file using a connection
sbuf <- paste(LETTERS, collapse="")
tfile <- sprintf("%s/letters.dat", tempdir())
fcon <- file(description=tfile, open="w")
writeLines(sbuf, fcon)
close(fcon)
```
This problem will be addressed in the next feature release of *recordr*.
```{r cleanup, echo=F,warning=F,message=F,eval=T}
# Remove the runs created by the examples.
rc <- new("Recordr")
deleteRuns(rc, id=firstRunId, quiet=T)
```
## Customizing Recordr
in demo mode - recordr stores information in the R temp directory, so any information recorded
will be lost when the current R session ends
in order to retain information permanently:
recordrConfig(rc, "homedir")
or
recordrConfig(rc, "homedir", "/Users/smith/recordr")