Skip to content

Chunk Inputs and Globals

Ben Bond-Lamberty edited this page Sep 14, 2017 · 7 revisions

Most Chunks require inputs of various sorts, and the Driver will not run your chunk until these are all available.

Chunks inputs can be global constants, inputs generated by another chunk, or inputs read in from a saved file. These are discussed in turn below.

Global constants

If your code chunk references a global constant, you should (i) check to see whether it's already been defined by someone else, and if necessary (ii) define it.

Package-wide constants are defined in R/constants.R. You'll see that constants in this are separated into sections; put yours in a sensible place. Constants need to be ALL CAPS with, optionally, an initial lowercase qualifier (note this is tested). See the many examples in the file.

Input generated by another chunk

This is easy: just tell the driver what inputs your chunk needs:

  } else if(command == driver.DECLARE_INPUTS) {
    return(c("L100.gdp_mil90usd_ctry_Yh"))
  } else if(command == driver.MAKE) {

This chunk requires a single input, named L100.gdp_mil90usd_ctry_Yh. If your chunk doesn't require any inputs at all, just return NULL in this line.

Input read in from a file

This extends the example above:

  } else if(command == driver.DECLARE_INPUTS) {
    return(c("L100.gdp_mil90usd_ctry_Yh",
             FILE = "energy/A13.MSW_curves",
             FILE = "common/iso_GCAM_regID"))
  } else if(command == driver.MAKE) {

In this case, the chunk needs three inputs before it can run: One (L100.gdp_mil90usd_ctry_Yh) that will be generated by some other chunk, and two read from disk (and so must be marked using FILE = as shown). Note if no extension is given with the filenames, as in the example above, the driver will look for files ending in .csv or .csv.gz or .csv.zip.

Data files should be put in to inst/exdata/, following the existing folder structure there. Any file larger than ~1 MB should be compressed. More information on input files and their documentation can be found on this page.

Temporary 'injection' data

Because the data system chunks frequently depend on data produced by other chunks, there's a dependency constraint (see graphic on the main page of the repo). To work around this problem, i.e. allow folks to translate chunks in any order, we have a 'data injection' mechanism. This simply means that what will be an input from another chunk is temporarily treated as a file input, i.e. we use the old data system data. In this case, we frequently need to do a little transformation on the temporary data that will be taken out later, when the chunk is 'stitched' together with the upstream chunk. Most frequently, data that's in wide-X format (wide, with X's in front of the years) should be reshaped. For example:

# We see from the 'temp-data-inject/' prefix that this is a temporary data injection
get_data(all_data, "temp-data-inject/L122.LC_bm2_R_HarvCropLand_Yh_GLU") %>%
      # The following two lines of code will be removed later, when we're using 'real' data
      gather(year, value, -GCAM_region_ID, -Land_Type, -GLU) %>%   # reshape
      mutate(year = as.integer(substr(year, 2, 5))) ->   # change Xyear to year
      L122.LC_bm2_R_HarvCropLand_Yh_GLU

# From here, the code proceeds as normal.

Note that temporary inputs are files, therefore in the 'if(command == driver.DECLARE_INPUTS)' section of the chunk, these need to be declared as 'FILE = 'temp-data-inject/L122.LC_bm2_R_HarvCropLand_Yh_GLU', or else the system will throw an error. The 'FILE =' and 'temp-data-inject/' prefixes need to be removed once this input is being generated by the appropriate chunk.

Silencing package check notes

One consequence using dplyr is that the R CMD CHECK process can't 'resolve' non standard evaluation variables names in the pipelines, and thus during the package check process, will issue NOTEs that these appear to be undefined. (I.e., dplyr functions allow you to use bare symbols in expressions like select(somedata, iso, GCAM_region_ID), but R CMD CHECK doesn't 'know' about these pipeline semantics.)

The workaround for this is to set any such variables to NULL at the beginning of the chunk, and each developer is responsible for suppressing those messages in any chunks that they develop. In the example above, it would look like iso <- GCAM_region_ID <- NULL. Consider this the last step in your pre-push checks. Picking up your litter helps keep the place clean for everyone.