BeyondSingleCore.Rpres

Beyond Single-Core R 
========================================================
author: Jonathan Dursi
date: http://github.com/ljdursi/beyond-single-core-R
font-import: http://fonts.googleapis.com/css?family=Average
font-import: http://fonts.googleapis.com/css?family=Oswald
font-family: 'Average'
autosize: false
transition-speed: fast
css: assets/oicr.css

```{r setup, include="false"}
library(knitr)
set.seed(1)
opts_knit$set(root.dir='.')
trunc.gc <- function() { gc(TRUE)[,c(1:2,5:6)] }
```

Today's Outline
========================================================

Today will look something like this:

- How to think about scaling 
- Parallel Package
    - Multicore
        - mcparallel/mccollect/mclapply
        - parallel RNG
        - load balancing, chunking
        - pvec
    - Snow
        - makecluster/stopcluster/clusterExport
        - clusterSplit
- Foreach
    - chunking, iterators
- Scalable Data Analysis Best Practices

Extra material online
========================================================
- R and Memory
- Data file formats
- BigMemory 
- Rdsm 
- pbdR 
  
Not Covered
========================================================
- R in other frameworks
  - SparkR (R + Apache Spark)
  - RHadoop
  - Cool but more about the other framework than R

Thinking about scaling
============================================
type: sub-section

Some hardware terms
============================================

Hardware:

- Node: A single motherboard, with possibly multiple sockets
- Processor/Socket: the silicon containing likely multiple cores
- Core: the unit of computation; often has hardware support for
- Pseudo-cores: can appear to the OS as multiple cores but share much functionality between other pseudo-cores on the same core

***

![Sockets, Cores, and Hardware therads](images/sockets-cores.png)

Some software terms
============================================

Processes and threads:

- Process: Data and code in memory
- One or more **threads** of execution within a process
- Threads in the same process can see most of the same memory
- Processes generally cannot peer into another processes memory

Interpreted languages: generally you can only directly work with processes

Can call libraries that invoke threads (BLAS/LAPACK)

***

![Processes vs threads](images/process-threads.png)

Parallel computing: faster, bigger, more
============================================
One turns to parallel computing to solve one of three problems:

My program is too **slow**.

Perhaps using more cores --- *e.g.*, all cores on my desktop --- will make things  **faster**.
- Compute bound.
- Tools:
  - parallel/multicore
  - Rdsm
  - GPUs
    
***

![Rack of Computers](images/rack.png)

Parallel computing: faster, bigger, more
============================================
My problem is too **big**.

Perhaps splitting the problem up onto multiple computers in a cluster will give it access to enough memory to run effectively.
- Memory bound
- Tools: 
  - parallel/snow
  - pbdR
    
***

![Rack of Computers](images/rack.png)

Parallel computing: faster, bigger, more
============================================
There are too **many** computations to do - one task runs in a reasonable amount of time, but I have to run thousands!

Perhaps splitting the problem up onto multiple computers in a cluster will give it access to enough memory to run.
    
- Tools: 
  - gnu-parallel
  - parallel
  - job queues...
    
***

![Rack of Computers](images/rack.png)


Concurrency: Multiple Independant Computations
========================================================

For more cores/nodes to help, there has to be something for them to do.

Find largely independent computations to occupy them.

Classic example of this is a parameter study, or set of simulations with
different seeds:

"More" case

*** 
![Parameter Study](images/paramstudy.png)

Scaling of parameter study: Througput
========================================================
In this example, no individual task runs any faster with more processors,
but the workload as a whole can.

How long it takes to process the N tasks you want done

Throughput: how many tasks/time

For completely independent tasks, P processors can increase throughput by factor P!
 
***
![Throughput](images/throughput-sm.png)


Scaling of parameter study with number of processors
========================================================
How a problem scales: how throughput behaves as processor number increases.

In this case, the throughput scales linearly with the number of processors.

This is the best case: "Perfect scaling"

*** 
```{r echo=FALSE}
time.per.task <- 2.
p <- 1:12
throughput <- p/time.per.task
plot(p, throughput, pch=16, xlab="Number of Processors", ylab="Throughput (tasks/time)")
```


Scaling of parameter study with number of processors
========================================================
Another way to look at it: time it takes to get some fixed amount of work done

More usual (and more important!)

Perfect scaling: time to completion ~ 1/P

P procesors - P times faster

*** 
```{r echo=FALSE}
time.per.task <- 2.
ntasks <- 12
total.time <- (time.per.task*ntasks)/p
plot(p, total.time, pch=16, xlab="Number of Processors", ylab="Total Time Required")
```

Scaling of parameter study with number of processors
========================================================
Another way to look at it: time it takes to get some fixed amount of work done

More usual (and more important!)

Perfect scaling: time to completion ~ 1/P

P procesors - P times faster

*** 
```{r echo=FALSE}
plot(p, total.time, log="xy", pch=16, xlab="Number of Processors", ylab="Total Time Required")
```

Scaling of parameter study with number of processors
========================================================
Finally, how efficient is the scaling - when you throw 10 processors at the problem,
are you getting 10 times the processing?

Perfect scaling - efficiency = 100%

*** 
```{r echo=FALSE}
ideal.time <- (time.per.task*ntasks)/p
efficiency <- ideal.time/total.time
plot(p, efficiency, pch=16, xlab="Number of Processors", ylab="Parallel Efficiency)")
```


Finding concurrency: Split, Apply, Combine
========================================================

Popularized by Hadley Wickham, this has
become a model for thinking about data analysis in R in the tidyverse.

Split the data set up into relevant sub-sets; apply some analysis to it; combine the
results.

This is exactly the way to think about scalable data analysis.  Split the data - 
or tasks on that data - up between computing elements; do the analyses; then combine
the results somehow.

The details depend a great deal on the analyses (and the nature of the data.)

*** 

![Split, Apply, Combine](images/split-apply-combine.png)

Imperfect parallelism
========================================================

"Split" and "Combine" aren't free!

Partitioning the work, and assembling the final results from the partial results,
represents some overhead - some fraction of the work that must be done in serial.

Splitting the work and distributing it over a network takes even more time.

Amdal's Law:
$$
T \approx \left ( f + \frac{1 - f}{P} \right )
$$

*** 
```{r echo=FALSE}
amdall.efficiency <- function(f,p) (1./p) / (f + (1.-f)/p)
serial.fracs <- seq(0.,.75,by=.15)
eff <- matrix(nrow=length(serial.fracs), ncol=length(p))
for (i in 1:length(serial.fracs)) {
  eff[i,] <- sapply(p, function(np) amdall.efficiency(serial.fracs[i], np))
}
matplot(t(eff), type = c("b"), pch=16, xlab="Number of Processors", ylab="Efficiency", col=1:6)
legend("topright", legend = serial.fracs,  pch=16, col=1:6, title="Serial Fraction")
```

Task vs data parallelism
========================================================

What we've described has been finding concurrency by analyzing
different chunks of data the same way

- seeds for simulation
- parameters for parameter sweep
- subsets of data 

Also possible is to identify different tasks that must be done and perform those in parallel:
multiple fits, summary analysis + generating several plots, etc.

Generally more manual but can work very well.

Dependencies limit parallelism

***

![Diagram showing dependencies](images/gantt.png)

Existing parallelism (BLAS, package support)
========================================================
type: sub-section

Existing parallelism
========================================================

It's important to realize that many fundamental routines as well as higher-level packages come with some degree of scalability and parallelism "baked in".  

Running `top` (or `glances`, or...) while executing the following in R:

```{r eval=FALSE}
n <- 4*1024
A <- matrix( rnorm(n*n), ncol=n, nrow=n )
B <- matrix( rnorm(n*n), ncol=n, nrow=n )
C <- A %*% B
```

Existing parallelism
========================================================

![Top while running matrix mult](images/BLAS-parallel-sm.png)

One R process using 458% of a processor.

R can be built using high performance threaded libraries for math in general, and linear algebra
--- which underlies *many* data analysis algorithms --- in particular.

Here the single R process has launched several threads of execution -- all of which are part of the same process, and so can see the same memory, eg the large matrices.


Packages that explicitly use parallelism
========================================================

For a complete list, see 

http://cran.r-project.org/web/views/HighPerformanceComputing.html .

- Biopara
- BiocParallel for Bioconductor
- bigrf - Random Forests
- caret - cross-validation, bootstrap characterization of predictive models
- GAMBoost - boosting glms

Plus packages that use linear algebra or other expensive math operations which 
can be implicitly multithreaded.

When at all possible, don't do the hard work yourself --- look to see if a package already
exists which will do your analysis at scale.

Caret
========================================================

Caret is a widely-used machine learning package, that uses `foreach` (which we'll learn about)
to parallelize things like CV-folds, etc:

```{r caret}
data(iris)
library(caret)

control <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"

system.time(fit.lda.ser <- train(Species~., data=iris, method="svmRadial", metric=metric, trControl=control)
)
```

Caret
========================================================
```{r caretparallel}
library(doParallel)
registerDoParallel(4)

system.time(fit.lda.par <- train(Species~., data=iris, method="svmRadial", metric=metric, trControl=control)
)

stopImplicitCluster()
```

Packages for Implementing Parallel Workflows
========================================================
type: sub-section

The Parallel Package
========================================================

Since R 2.14.0 (late 2011), the `parallel` package has been part of core R. 

Incorporates - and mostly supersedes - two other packages:

- `multicore`: for using all processors on a single processor.  Not on windows.
- `snow`: for using any group of processors, possibly across a cluster.

Many packages which use parallelism use one of these two, so worth understanding.

Both create new *processes* (not threads) to run on different processors; but in importantly
different ways.

Multicore - forking
========================================================
    
Multicore creates new processes by forking --- cloning -- the original process.  

That means the new processes starts off seeing a copy of exactly the same data as the original.
*E.g.*, first process can read a file, and then fork two new processes - each will see copy of the
file.

*Not* shared memory; changes in one process will not be reflected in others.

Windows doesn't have fork(), so windows can't use these routines.

*** 

![Multicore: fork()](images/fork-sm.png)


Multicore - forking
========================================================
    
**Performance Tip**: Modern OSs are lazy - the copy of memory isn't made unless it has to be, and
it doesn't have to be until one process or the other writes to the memory.  

That copy is slow, and takes new memory.  

So in `multicore`, don't overwrite old variables if possible.

***

![Multicore: fork()](images/fork-sm.png)

Snow - Spawning
========================================================
Snow creates entirely new R processes to run the jobs.

A downside is that you need to explicitly copy over any needed data, functions.

But the upsides are that spawning a new process can be done on a remote machine, 
not just current machine.  So you can in principle use entire clusters.

In addition, the flipside of the downside: new processes don't have any unneeded data 
- less total memory footprint.

***

![SNOW: spawn()](images/spawn-sm.png)


mcparallel/mccollect
========================================================

The simplest use of the `multicore` package is the pair of functions `mcparallel()` and 
`mccollect()`.

`mcparallel()` forks a task to run a given function; it then runs in the background.  
`mccollect()` waits for and gets the result.

Let's pick an example: reading the airlines data set, we want --- for a particular month --- to 
know both the total number of planes in the data (by tail number) and the median elapsed flight
time.  These are two independant calculations, and so can be done independantly.

mcparallel/mccollect
========================================================

We start the two tasks with `mcparallel`, and collect the answers with `mccollect`:
```{r}
library(parallel, quiet=TRUE)
source("data/airline/read_airline.R")
jan2010 <- read.airline("data/airline/airOT201001.csv.gz")
unique.planes <- mcparallel( length( unique( sort(jan2010$TAIL_NUM) ) ) ) 
median.elapsed <- mcparallel( median( jan2010$ACTUAL_ELAPSED_TIME, na.rm=TRUE ) )
ans <- mccollect( list(unique.planes, median.elapsed) )
ans
```
We get a list of answers, with each element "named" by the process ID that ran the job.  We find that there are 4555 planes in the data set, and the median flight in the data set is 110 minutes 
in the air.

mcparallel/mccollect
========================================================

Does this save any time?  Let's do some independent fits to the data.  Let's try to see what
the average in-flight speed is by fitting time in the air to distance flown; and let's see how the 
arrival delay correlates with the departure delay.  (Do planes, on average, make up some time
in the air, or do delays compound?)

```{r}
system.time(fit1 <-  lm(DISTANCE ~ AIR_TIME, data=jan2010))
system.time(fit2 <-  lm(ARR_DELAY ~ DEP_DELAY, data=jan2010))
```

mcparallel/mccollect
========================================================

So the time to beat is about 0.7s:
```{r}
parfits <- function() {
  pfit1 <- mcparallel(lm(DISTANCE ~ AIR_TIME, data=jan2010))
  pfit2 <- mcparallel(lm(ARR_DELAY ~ DEP_DELAY, data=jan2010))
  mccollect( list(pfit1, pfit2) )
}
system.time( parfits() )
```
Clearly actually forking the processes and waiting for them to rejoin itself takes some time.  

This overhead means that we want to launch jobs that take a significant length of
time to run - much longer than the overhead (hundredths to tenths of seconds for fork().)

Clustering
========================================================

Typically we want to do more than an itemized list of independent tasks - we have a list
of similar tasks we want to perform.

`mclapply` is the multicore equivalent of `lapply` - apply a function to a list, get 
a list back.

Let's say we want to see what similarities there are between delays at O'Hare airport 
in Chicago in 2010.  Clustering methods attempt to uncover "similar" rows in a dataset
by finding points that are near each other in some $p$-dimensional space, where $p$ is the
number of columns.

$k$-Means is a particularly simple, randomized, method; it picks $k$ cluster centre-points
at random, finds the rows closest to them, assigns them to the cluster, then moves the
cluster centres towards the centre of mass of their cluster, and repeats.

Quality of result depends on number of random trials.

Clustering
========================================================

Let's try that with our subset of data:

```{r}
# columns listing various delay measures
delaycols <- c(18, 28, 40, 41, 42, 43, 44)
air2010 <- readRDS("data/airline/airOT2010.RDS")
ord.delays <- air2010[(air2010$ORIGIN=="ORD"), delaycols]
rm(air2010)
ord.delays <- ord.delays[(ord.delays$ARR_DELAY_NEW > 0),]
ord.delays <- ord.delays[complete.cases(ord.delays),]

system.time( serial.res   <- kmeans(ord.delays, centers=2, nstart=40) )
serial.res$withinss
``` 

Clustering with lapply
========================================================

Running 40 random trials is the same as running 10 random trials 4 times.  Let's try that approach with `lapply`:

```{r}
do.n.kmeans <- function(n) { kmeans(ord.delays, centers=2, nstart=n) }
system.time( list.res <- lapply( rep(10,4), do.n.kmeans ) )

res <- sapply( list.res, function(x) x$tot.withinss )
lapply.res <- list.res[[which.min(res)]]
lapply.res$withinss
```
Get the same answer, but a little longer - bit of overhead from splitting it up and starting the
process four times.  We could make the overhead less important by using more trials, which would be 
better anyway.

Clustering with mclapply
========================================================

`mclapply` works the same way as lapply, but forking off the processes (as with
`mcparallel`)

```{r}
system.time( list.res <- mclapply( rep(10,4), do.n.kmeans, mc.cores=4 ) )

res <- sapply( list.res, function(x) x$tot.withinss )
mclapply.res <- list.res[[which.min(res)]]
mclapply.res$tot.withinss
```

Clustering with mclapply
========================================================

Note what the output of top looks like when this is running:

![top with mclapply](images/top-mclapply.png)

There are four separate processes running - not one process using multiple CPUs via threads.

Clustering with mclapply
========================================================

Looks good!  Let's take a look at the list of results:
```{r}
res
```
What happened here?

Parallel RNG 
========================================================

Depending on what you are doing, it may be very important to have different (or the same!)
random numbers generated in each process.

Here, we definitely want them different - the whole point is to generate different
random realizations.

`parallel` has a good RNG suitable for parallel work based on the work of Pierre L'Ecuyer 
in Montr&eacute;al:
```{r}
RNGkind("L'Ecuyer-CMRG")
mclapply( rep(1,4), rnorm, mc.cores=4, mc.set.seed=TRUE)
```

Load balancing
========================================================

Let's say that, instead of running multiple random trials to find the best given a set of clusters,
we were unsure of how many clusters we wanted to run:
```{r}
do.kmeans.nclusters <- function(n) { kmeans(ord.delays, centers=n, nstart=10) }
time.it <- function(n) { system.time( res <- do.kmeans.nclusters(n)) }
lapply(1:4, time.it)
```

Load balancing
========================================================

More clusters takes longer.  If we were to `mclapply` these four tasks on 2 CPUs, the first
CPU would get the two short tasks, and the second CPU would get the second, longer tasks - 
bad _load_ _balance_.  

Normally, we want to hand multiple tasks of work off to each processor and only hear back when
they're completely done - minimal overhead.  But that works best when all tasks have similar 
lengths of time.

If you don't know that this is true, you can do _dynamic_ scheduling - give each processor one
task, and when they're done they can ask for another task.

More overhead, but better distribution of work.

Load balancing
========================================================

```{r}
system.time( res <- mclapply(1:4, time.it, mc.cores=2) )
system.time( res <- mclapply(1:4, time.it, mc.cores=2, mc.preschedule=FALSE) )
```


Splitting the data set
========================================================

So far we've seen splitting the tasks; let's consider splitting the dataset.

Let's make a histogram of the times flights took off - say, binned by the hour.

```{r}
get.hour <- function(timeInt) timeInt %/% 100
count.hours <- function(range) {
  counts <- rep(0,24)
  hours <- sapply(jan2010$DEP_TIME[range], get.hour)
  hist <- rle( sort(hours) )
  for (i in 1:length(hist$values)) {
    j <- hist$values[i] + 1
    if (j == 25) j = 1
    counts[j] <- hist$lengths[i]
  }
  counts
}
```


Splitting the data set
========================================================

We can count up *all* flight hours like so:
```{r}
system.time(scounts <- count.hours(1:nrow(jan2010)))
scounts
```

Splitting the data set
========================================================

Can we split this up between tasks?  Let's try this:
```{r}
nr <- nrow(jan2010)
ncores <- 4
chunks <- split(1:nr, rep(1:ncores, each=nr/ncores))
system.time(counts <- mclapply( chunks, count.hours, mc.cores=ncores) )
```

Splitting the data set
========================================================

That was definitely faster - how do the answers look?

```{r}
str(counts)
Reduce("+", counts)
```

To be fair, we'd have to include the Reduction time in the total time - but that's just the sum of 
four short vectors, probably not a big deal.

pvec - simplified mclapply
========================================================

For the simple and common case of applying a function to each element of a vector and returning a
vector, the parallel package has a simplified version of mclapply called `pvec`.
```{r}
fx <- function(x) x^5-x^3+x^2-1
maxn <- 1e6
system.time( res <- sapply(1:maxn, fx) )
system.time( res <- vapply(1:maxn, fx, 0.) )
```

pvec - simplified mclapply
========================================================

```{r}
system.time( res <- pvec(1:maxn, fx, mc.cores=2) )
system.time( res <- pvec(1:maxn, fx, mc.cores=4) )
system.time( res <- mclapply(1:maxn, fx, mc.cores=4) )
```

Test your skills: parallel/multicore
========================================================

Using the entire 2010 dataset, and the examples above, examine one of the following questions:

- In 2010, what airport (with more than say 10 outgoing flights) had the largest fraction of 
outgoing flights delayed?
- For some given airport - what hour of the day had the highest relative fraction of delayed flights?
- For all airports?
- What is the effect of including the `split()` and the `Reduce()` on the serial-vs-parallel timings for this histogram?  Is there a better way of doing the splitting?

Summary: parallel/multicore 
========================================================

The `mc*` routines in parallel work particularly well when:

- You want to make full use of the processors on a single computer
- Each task only reads from some big common data structure and produces modest-sized results
- `mcparallel` works very well for task parallelism; the `mclapply` for data parallelism.

Things to watch for:

- Modifying the big common data structure:
    - Won't be seen by other processes,
    - But will blow up the memory requirements
- You can only use one machine's processors
- Won't work on Windows (but what does?)
- `mc.cores` is a lie.  It's the number of  _tasks_, not _cores_.  Can easily oversubscribe cores explicitly or implicitly

Multiple computers with parallel/snow
========================================================
type: sub-section

```{r include=FALSE}
# Get rid of the data up to this point
rm(list=ls())
```

Multiple computers with parallel/snow
========================================================

The other half of parallel, routines that were in the still-active `snow` package, allow you
to again launch new R processes --- by default, on the current computer, but also on any computer
you have access to.   (SNOW stands for "Simple Network of Workstations", which was the original 
use case).

The recipe for doing computations with snow looks something like:

```{r eval=FALSE}
library(parallel)
cl <- makeCluster(nworkers,...)
results1 <- clusterApply(cl, ...)
results2 <- clusterApply(cl, ...)
stopCluster(cl)
```

other than the `makeCluster()`/`stopCluster()`, it looks very much like multicore and `mclapply`.

Hello world
========================================================

Let's try starting up a "cluster" (eg, a set of workers) and generating some random numbers 
from each:

```{r}
library(parallel)
cl <- makeCluster(4)
clusterCall(cl, rnorm, 5)
stopCluster(cl)
```

Hello world
========================================================

`clusterCall()` runs the same function (here, `rnorm`, with argument `5`) on all workers in
the cluster.  A related helper function is `clusterEvalQ()` which is handier to use for some
setup tasks - eg, 
```{r eval=FALSE}
clusterEvalQ(cl, {library(parallel); library(foreach); NULL} )
```

Clustering on Clusters
========================================================

Emboldened by our success so far, let's try re-doing our $k$-means calculations:

```{r}
delaycols <- c(18, 28, 40, 41, 42, 43, 44)

source("data/airline/read_airline.R")
jan2010 <- read.airline("data/airline/airOT201001.csv.gz")
jan2010 <- jan2010[,delaycols]
jan2010 <- jan2010[complete.cases(jan2010),]
do.n.kmeans <- function(n) { kmeans(jan2010, centers=4, nstart=n) }
```
```{r eval=FALSE}
library(parallel)
cl <- makeCluster(4)
res <- clusterApply(cl, rep(5,4), do.n.kmeans)
stopCluster(cl)
```
```{r eval=FALSE}
 Error in checkForRemoteErrors(val) : 
-----------------------------
   4 nodes produced errors; first error: object 'jan2010' not found
-----------------------------
```
Ah!  Failure.

Clustering on Clusters
========================================================

Recall that we aren't forking here; we are creating processes from scratch.  These processes,
new to this world, are not familiar with our ways, customs, or datasets.  We actually have to 
ship the data out to the workers:

```{r}
cl <- makeCluster(4)
system.time(clusterExport(cl, "jan2010"))
system.time(cares <- clusterApply(cl, rep(5,4), do.n.kmeans))
stopCluster(cl)
system.time( mcres <- mclapply(rep(5,4), do.n.kmeans, mc.cores=4) )
```

Clustering on Clusters
========================================================

Note that the costs of shipping out data back and forth, and creating the processes from
scratch, is relatively costly - but this is the price we pay for being able to spawn the processes
anywhere.  

(And if our computations take hours to run, we don't really care about several-second delays.)

Running across machines
========================================================

The default cluster is a sockets-based cluster; you can run on multiple machines by 
specifying them to a different call to makeCluster:

```{r eval=FALSE}
hosts <- c( rep("localhost",8), rep("192.168.0.10", 8) )
cl <- makePSOCKcluster(names=hosts)
clusterCall(cl, rnorm, 5)
clusterCall(cl, system, "hostname")
stopCluster(cl)
```

Once it is done, you have succcessfully run random number
generators across multiple hosts.

Cluster types
========================================================

`parallel` has several different cluster types:

- PSOCK (Posix sockets): the default type
- Fork workers: but if you're going to use this, you may as well just use multicore.
- MPI: this is similar in a way to PSOCK clusters, but startup and communications can be much faster once you start going to large numbers (say >64) of hosts.  We won't cover this today; using the MPI cluster type is conceptually identical to PSOCK clusters.

Work distribution and Load Balancing
========================================================

Because of the need to send (possibly large) data to the workers, the scheduling of
workers is even more important than with multicore.

The `snow` library has very nice visualization tools for timing that are inexplicably absent
from `parallel`; so let's temporarily use snow:

```{r}
library(snow,quiet=TRUE)
```

Work distribution and Load Balancing
========================================================

```{r}
do.kmeans.nclusters <- function(n) { kmeans(jan2010, centers=n, nstart=10) }

cl <- makeCluster(2)
clusterExport(cl,"jan2010")
tm <- snow.time( clusterApply(cl, 1:6, do.kmeans.nclusters) )
```

Work distribution and Load Balancing
========================================================

```{r}
plot(tm)
```

Work distribution and Load Balancing
========================================================

```{r}
tm.lb <- snow.time(clusterApplyLB(cl, 1:6, do.kmeans.nclusters))
plot(tm.lb)
stopCluster(cl)
```

Work distribution and Load Balancing
========================================================

The default `clusterApply` sends off one task to each worker, waits until they're both done, then
sends off another.  (Question: why?)

`clusterApplyLB` does something more like `mc.preschedule=FALSE`; it fires off tasks to each worker
as needed.  

Sending off one task at a time can be inefficient if there is a lot of commnication involved. But it allows
flexibility in scheduling, which is vitally important if the tasks are of widely varying durations.

clusterSplit and Hour Histogram
========================================================

Of course, for some applications, we don't need to send the entire data structure across.  Let's consider
the departure-time histogram again.  This time, we're only going to send across the data that's going t
be computed:

```{r}
jan2010 <- read.airline("data/airline/airOT201001.csv.gz")
jan2010 <- jan2010[complete.cases(jan2010),]

get.hour <- function(timeInt) timeInt %/% 100
count.hours <- function(timesInt) {
  counts <- rep(0,24)
  hours <- sapply(timesInt, get.hour)
  hist <- rle( sort(hours) )
  for (i in 1:length(hist$values)) {
    j <- hist$values[i] + 1
    if (j == 25) j = 1
    counts[j] <- hist$lengths[i]
  }
  counts
}
```

clusterSplit and Hour Histogram 
========================================================

This time, rather than exporting the entire data set, we'll just send across the bits we need:
```{r}
cl <- makeCluster(2)
clusterExport(cl,"get.hour")  # have to export _functions_, too.
datapieces <- clusterSplit(cl,jan2010$DEP_TIME)
str(datapieces)
ans <- clusterApply(cl, datapieces, count.hours)
Reduce("+", ans)
```

clusterSplit and Hour Histogram
========================================================

To look a little more closely at some communciations and load balance issues, I'm going to 
split the data up into more pieces than workers, and distribute them:
```{r}
stopCluster(cl)
cl <- makeCluster(6)
datapieces <- clusterSplit(cl,jan2010$DEP_TIME)
stopCluster(cl)

cl <- makeCluster(2)
clusterExport(cl,"get.hour")  # have to export _functions_, too.
str(datapieces)
```

clusterSplit and Hour Histogram
========================================================

```{r}
tm <- snow.time( ans <- clusterApply(cl, datapieces, count.hours) )
plot(tm)
```

clusterSplit and Hour Histogram
========================================================
```{r}
tm <- snow.time( ans <- parLapply(cl, datapieces, count.hours) )
plot(tm)
stopCluster(cl)
```

clusterSplit and Hour Histogram
========================================================

If the list you are operating on consists of big chunks of data, the relevant piece is sent to the
worker for its task.

Sometimes that's exactly what you want:

- The chunks nearly fill up memory
- You don't know which task will do which chunk (`clusterApplyLB`)

But if it's not necessary, it adds a delay to the task.  If you know ahead of time the tasks are
of similar duration:

- clusterExport the whole data set (if everyone needs the whole data set)
- Use clusterSplit to split the data set into exactly what each worker needs
- or use `parLapply` to chunk up the data for you and send all the data for one task all at once.

Back to parallel
========================================================
```{r}
detach("package:snow", unload=TRUE)
```

Summary: parallel/snow
========================================================

The `cluster` routines in `parallel` are good if you know you will eventually have to move to using
multiple computers (nodes in a cluster, or desktops in a lab) for a single computation. 

- Use `clusterExport` for functions and data that will be needed by everyone.
- Communicating data is slow, but *much* faster than having every worker read the same data from a file.
- Use `clusterApplyLB` if the tasks vary greatly in runtime.
- Use `clusterApply` if each task requires an enormous amount of data.
- Use `parLapply` if tasks are similar duration and data from multiple tasks will fit in memory.
- `snow::snow.time` is great for understanding performance.
- Use `makePSOCKcluster` for small clusters; consider `makeMPIcluster` for larger (but see `pbdR` section online)

Test your skills: parallel/snow (1/2)
========================================================

On a set of workstations (or AWS instances) you have access to, try a session with two nodes,
and setup a PSOCK cluster across both nodes.
Call `unlist(clusterCall(cl, system, "hostname"))` to make sure that you have workers on both nodes.

Load the 2010 data and break it up by month (look up the `split` command) and see which month had the
highestfraction of cancelled flights.  

Then split the data up by airline and see which airline had the highest fraction of cancelled flights.

Test your skills: parallel/snow (2/2)
========================================================

There are two big downsides with how we are doing this: the master is doing a huge amount of the work by
doing the pre-splitting, and the whole data set has to be in memory.   Tackle one or the other of them:

- Master doing too much work: Just partition the data into chunks, and let each worker do the
splitting up and counting itself.  For the combined results to be meaningful, the worker will
need to know the full set of airlines (or the full set of months, which is somewhat easier.)  
How to do that?
- Master having whole problem in memory: use bigmemory along with parallel.

foreach and doparallel
========================================================
type: sub-section

foreach and doparallel
========================================================

The "master/worker" approach that `parallel` enables works extremely well for moderately sized problems,
and isn't that difficult to use.  It is all based on one form of R iteration, apply, which is well understood.  

However, going from serial to parallel requires some re-writing, and even going from one method of
parallelism to another (eg, `multicore`-style to `snow`-style) requires some modification of code.

The `foreach` package is based on another style of iterating through data - a for loop - and is designed
so that one can go from serial to several forms of parallel relatively easily.  There are then a number of tools one can use in the library to improve performance.

foreach - serial
========================================================

The standard R for loop looks like this:
```{r}
for (i in 1:3) print(sqrt(i))
```

The foreach operator looks similar, but returns a list of the iterations:
```{r}
library(foreach)
foreach (i=1:3) %do% sqrt(i)
```

foreach - serial
========================================================

```{r eval=FALSE}
library(foreach)
foreach (i=1:3) %do% sqrt(i)
```
The foreach function creates an object, and the `%do%` operator operates on the code (here just one statement,
but it can be multiple lines between braces, as with a for loop) and the foreach object.

foreach + doParallel
========================================================

Foreach works with a variety of backends to distribute computation - `doParallel`, which allows snow- and
multicore-style parallelism, and `doMPI` (not covered here).   

Switching the above loop to paralleljust requires registering a backend and using `%dopar%` rather than `%do%`:
```{r}
library(doParallel)
registerDoParallel(3)  # use multicore-style forking
foreach (i=1:3) %dopar% sqrt(i)
stopImplicitCluster()
```

foreach + doParallel
========================================================

One can also use a PSOCK cluster:
```{r}
cl <- makePSOCKcluster(3)
registerDoParallel(cl)  # use the just-made PSOCK cluster
foreach (i=1:3) %dopar% sqrt(i)
stopCluster(cl)
```

Combining results
========================================================

While returning a list is the default, `foreach` has a number of ways to combine the individual results:
```{r}
foreach (i=1:3, .combine=c) %do% sqrt(i)
foreach (i=1:3, .combine=cbind) %do% sqrt(i)
foreach (i=1:3, .combine="+") %do% sqrt(i)
foreach (i=1:3, .multicombine=TRUE, .combine="sum") %do% sqrt(i)
```

Combining results
========================================================

Most of these are self explanatory.

`multicombine` is worth mentioning: by default, `foreach` will combine
each new item into the final result one-at-a-time.

If `.multicombine=TRUE`, then you are saying that you're passing a function
which will do the right thing even if foreach gives it a whole wack of new results as a list or vector -
*e.g.*, a whole chunk at a time.

Composing foreach Objects
========================================================

There's one more operator: `%:%`.  This lets you compose or nest foreach objects:
```{r}
foreach (i=1:3, .combine="c") %:% 
  foreach (j=1:3, .combine="c") %do% {
    i*j
  }
```

Filtering Items
========================================================

And you can also filter items, using `when`:
```{r}
foreach (a=rnorm(25), .combine="c") %:%
  when(a >= 0) %do%
    sqrt(a)
```

Histogram
========================================================

Let's consider our hour histogram again:
```{r}
system.time(
  foreach (i=1:2000, .combine="+") %do% {
    hrs <- rep(0,24)
    hr <- get.hour(jan2010$DEP_TIME[i])
    hrs[hr+1] = hrs[hr+1] + 1
    hrs
  }
)
```
Note: like a function, we have to make sure the function we want to return is the last line (or explicitly
returned).

Parallel Histogram
========================================================

What's more, this automatically works in parallel:
```{r}
cl <- makePSOCKcluster(3)
registerDoParallel(cl,cores=3)
system.time(
  foreach (i=1:2000, .combine="+") %dopar% {
    hrs <- rep(0,24)
    hr <- get.hour(jan2010$DEP_TIME[i])
    hrs[hr+1] = hrs[hr+1] + 1
    hrs
  }
)
stopCluster(cl)
```
Which is actually sort of magic; PSOCK clusters don't share memory!  `foreach` does a good job of
exporting necessary variables; if something isn't automatically exported, it can be exported explicitly
in the foreach line with, eg, `foreach(..., .export=c("jan2010"))`.

Histogram Performance
========================================================

But this is incredibly slow:
```{r}
system.time(
  ans <- foreach (i=1:2000, .combine="+") %do% {
    hrs <- rep(0,24)
    hr <- get.hour(jan2010$DEP_TIME[i])
    hrs[hr+1] = hrs[hr+1] + 1
    hrs
  }
)
system.time(ans <- count.hours(jan2010$DEP_TIME[1:2000]))
```
Mainly because it's not vectorized; by looping over the data one item at a time we've avoided using our
lovely fast vector routines.  Plus allocating a 24-hour-long vector per item!

Histogram Performance
========================================================

Another problem - we've created a vector `1:2000` which in general is the same size as the data set
we're working on.  For large data sets, big memory.

Foreach has _iterators_ that can iterate through an object without creating something the size of the
object.  For instance, `icount()` is like the difference between Python 2.x range and xrange:
```{r eval=FALSE}
cl <- makePSOCKcluster(3)
registerDoParallel(cl,cores=3)
system.time(
  ans <- foreach (i=icount(2000), .combine="+") %do% {
    hrs <- rep(0,24)
    hr <- get.hour(jan2010$DEP_TIME[i])
    hrs[hr+1] = hrs[hr+1] + 1
    hrs
  }
)
```
But that doens't help with the performance issue here.

Histogram Performance
========================================================

We do a little bit better by avoiding the intermediate index; we don't care about $i$ at all, all
we care about is the data.  We can implicitly create an iterator on the object with
```{r eval=FALSE}
 foreach (time=jan2010$DEP_TIME[1:2000],...
```
 
or explicitly, setting the chunk size to distribute between tasks:
```{r}
system.time(
  ans <- foreach (time=iter(jan2010$DEP_TIME[1:2000],chunksize=500), .combine="+") %do% {
    hrs <- rep(0,24)
    hr <- get.hour(time)
    hrs[hr+1] = hrs[hr+1] + 1
    hrs
  }
)
ans
```

Histogram Performance
========================================================

As you can tell by the chunking, foreach can adjust the iteration scheduling in a number of ways.  Chunking
is one of them.

The underlying back-end obviously has a lot to do with the scheduling.  For multicore, for instance, one
can pass familiar options to multicore if we are using a multicore "cluster":

```{r eval=FALSE}
foreach( ..., .options.multicore=list(preschedule=FALSE,set.seed=TRUE))
```

**Performance Tip:** If you don't care about the order that the results come back in, specifying 
`.inorder=FALSE` gives the scheduler more flexibility in sending out tasks.  Otherwise, you're guaranteed
that the first result back is from the first iteration, etc.

Histogram Performance
========================================================

But really, we want to work on entire slices of the data at once.  For objects like matricies or data frames,
you can send out a row, column, etc at a time; we can re-cast the data as a matrix and send it out one row
at a time:
```{r}  
jan.matrix = matrix(jan2010$DEP_TIME[1:2000], ncol=500)
system.time(
  ans <- foreach (times=iter(jan.matrix,by="row"), .combine="+") %do% {
    count.hours(times)
  }
)
ans
```

Histogram Performance
========================================================

And this works in parallel, as well
```{r}  
cl <- makePSOCKcluster(3)
registerDoParallel(cl,cores=3)
jan.matrix = matrix(jan2010$DEP_TIME[1:2000], ncol=500)
system.time(
  ans <- foreach (times=iter(jan.matrix,by="row"), .combine="+") %dopar% {
    count.hours(times)
  }
)
stopCluster(cl)
ans
```

isplit
========================================================

If we want each task to only work on some subset of the data, the `isplit` iterator will split the
data at the master, and send off the partitioned data to workers: 
```{r}  
ans <- foreach (byAirline=isplit(jan2010$DEP_TIME, jan2010$UNIQUE_CARRIER), 
                .combine=cbind) %do% {
  df <- data.frame(count.hours(byAirline$value)); colnames(df) <- byAirline$key; df
}
ans$UA
ans$OH
```

Stock prices example
========================================================

In `data/stocks/stocks.csv`, we have 419 daily closing stock prices going back to 2000 (3654 prices). For stocks, it's often useful to deal with "log returns", rather than absolute price numbers.
We use:
```{r}
stocks <- read.csv("data/stocks//stocks.csv")
log.returns <- function(values) { nv=length(values); log(values[2:nv]/values[1:nv-1]) }
```
How would we parallelize this with `foreach`?  (Imagine we had thousands of stocks and decades of data,
which isn't implausable.)

Stock Prices Example
========================================================

```{r}
registerDoParallel(4)
mat.log <- 
  foreach(col=iter(stocks[,-c(1,2)],by="col"), .combine="cbind")  %dopar% 
      log.returns(col)
stopImplicitCluster()

stocks.log <- as.data.frame(mat.log)
colnames(stocks.log) <- colnames(stocks)[-c(1,2)] 
stocks.log$date <- stocks$date[-1]   # get rid of the first day; no "return" for then
```

Stock Correlations
========================================================

A quantity we might be interested in is the correlation between the log returns of various stocks:
we can use R's `cor()` function to do this.
```{r}
nstocks <- 419
cors <- matrix(rep(0,nstocks*nstocks), nrow=nstocks, ncol=nstocks)
system.time(
for (i in 1:419) {
  for (j in 1:419) {
    cors[i,j] <- cor(stocks.log[[i]],stocks.log[[j]])    
  }
}
)
```

Summary: foreach
========================================================

Foreach is a wrapper for the other parallel methods we've seen, so it inherits some of the advantages 
and drawbacks of each.  

Use `foreach` if:
- Your code already relys on `for`-style iteration; transition is easy
- You don't know if you want multicore vs. snow style `parallel` use (or other kinds, like batch jobs): you can switch just by registering a different backend!
- You want to be able to incrementally improve the performance of your code.

Note that you can have portions of your analysis code use `foreach` with `parallel` and portions using the
backend with apply-style parallelism; it doesn't have to be all one or the other.

Test your skills: Stock Correlations
========================================================

Parallelize the stock correlation matrix calculation with `foreach`.  You should get a proper speedup here.
Try working on just the first 10 stocks until you get things working.

Note: you can nest `foreach()` loops using the '%:%' operator:
```{r eval=FALSE}
foreach(...) %:%
  foreach(...)
```

When you're done that, take a look at a random year's airline data.  Of the flights that have a
departure delay, is the arrival delay (on average) less than or greater than the departure delay?
Is: "This is the captain: Sorry for the delay folks, but we'll make it up in the air" a lie?

How would you use foreach to loop over the various years' data?

Scalable Data Analysis: Best Practices
============================================
type: sub-section

Best Practices: Don't reinvent wheels
============================================

There's no one-size-fits-all-analysis approach to taking advantage of multiple processors or nodes
(or accellerators).

Check to see if there already exists packages for doing your analysis in parallel.

Best Practices: Big chunks better than little chunks
============================================

It's almost always easier to efficiently take advantage of **coarse-grained** parallelism.

_E.g._ running each cross-validation fold on a different processor will almost certainly work 
better than running them sequentially but with parts of the training method parallelized.  

Amdahl's Law!

Best Practices: Parallelism gives you more _compute_
============================================

For throwing more cores at the problem to help, the calculation has to have been limited by
compute power.

I/O bound analyses, or network-bound analyses, or even memory-bandwidth-constrained analyses
can easily have _worse_ performance with more cores, not _better_.

Best Practices: One task per core
============================================

Compute bound calculations slow down if something else is sharing the processor with them.

Except for debugging, you generally don't want to run (say) doParallel with more cores than you
actually have on your computer.

Sometimes using the number of pseudo-cores/hyperthreads/hardware threads is helpful if memory bandwidth
is a limiting factor, or not if not - you can test.

Best Practices: Don't trip over your own feet
============================================

Because you only want one task for core, be careful about accidentally using parallelism!

Eg, a threaded blas/lapack is great for interactive use, _but_:

If you're 
- doing four tasks at once with multicore on your laptop, and
- Each of those tasks calls a matrix solve which uses a threaded BLAS to launch 4 threads

Now you're competing with yourself for your own laptop's cores.

Summary
============================================

R comes with an increasingly rich set of tools for taking advantage of more compute power:
- parallel
- foreach/doParallel

Keep in mind what we talked about in terms of overhead, and:
- Don't reinvent wheels
- Big chunks are better than little chunks
- Parallelism gives you more _compute_, not I/O
- One task per core
- Don't trip over your own feet


Extra material
============================================
type: sub-section

A Few Words on R and Memory
============================================
type: sub-section

A Few Words on R and Memory
============================================

Because R frequently needs to make temporary copies (R works best as a functional programming language),
hitting memory limit frequently becomes a problem.

Avoiding hitting that limit too early requires some caution.

Need to know how R handles variables and memory.

Garbage Collection
============================================
left: 66%
Like a lot of dynamic languages, R relies on _garbage_ _collection_ to limit memory usage.

"Every so often", a garbage collection task runs and deletes variables that won't be used any more.

You can force the garbage collection to run at any given time by calling `gc()`, but this almost never fixes anything significant.

How can the gc know that you're not going to use that big variable in the next line?

Gc needs your help to be effective.
***
![Image](images/GCNeedsYou-75.png)

Useful commands for memory management
============================================

- `gc(verbose=TRUE)`, or just `gc(TRUE)`.
    - gc() alone probably won't help anything.  This calls gc() - likely not very useful - but gives verbose output, returning current memory usage as a matrix.
- `ls()`
    - Lists current variables
- `object.size()`
    - Pass it a variable, it prints out its size.  Pass it `get("variablename")` (*eg*, quoted) and it will get that variable and print its size.
- `rm()`
    - Deletes a variable you're not going to use.  Lets gc() go to work.
- Fun little one-liner which prints out all variables by size in bytes:
```{r eval=FALSE}
sort( sapply( ls(), function(x) { object.size(get(x))} ),decreasing=TRUE )
```

Object.size() and gc(TRUE)
============================================

Let's play with object.size() and gc(TRUE):
```{r}
gc(TRUE)
old.mem <- gc(TRUE)[,c(1:2,5:6)]
x <- rep(0.,(16*1024)**2)
xsize <- object.size(x)
xsize
```

Object.size() and gc(TRUE)
============================================

Let's play with object.size() and gc(TRUE):
```{r}
xsize
print(xsize,units="MB")
new.mem <- gc(TRUE)[,c(1:2,5:6)]
new.mem-old.mem
```

Object.size() and gc(TRUE)
============================================

Now let's delete the object and see how system memory behaves:
```{r}
rm(x)
final.mem <- gc(TRUE)[,c(1:2,5:6)]
final.mem
final.mem-old.mem
```


Better to Use Functions
============================================

... but if you use functions:

- Variables deleted as they fall out of scope
- Code is more readable, maintable, reusable


```{r}
trunc.gc <- function() { gc(TRUE)[,c(1:2,5:6)] }
rnorm.sum <- function(n) {
  x <- rnorm(n)
  sum(x)
}
orig.gc <- trunc.gc()
rnorm.sum(16*1024*1024)
after.gc <- trunc.gc()
after.gc - orig.gc
```


Out Of Core / External Memory Computation
========================================================
type: sub-section

Out Of Core / External Memory Computation
========================================================
Some problems require doing fairly simple analysis on a data set too large to fit into memory.

- Min/mean/max
- Data cleaning
- Even linear fitting is pretty simple

In that case, one processor may be enough; you just want a way to not overrun memory.

_Out of Core_, or _External Memory_ computation leaves the data on disk, bringing 
in to memory only what is needed/fits at any given point.

For some computations, this works well (but note: disk access is always ***much***
slower than memory access.)

Bigmemory Package
========================================================

The `bigmemory` package defines a generalization of a matrix class, `big.matrix`, which can be _file-backed_ - that is, can exist primarily on disk, with parts brought into memory when necessary.

This approach works fairly well when one's data access involves passing through the
entire data set once or a very small number of times, either combining data or 
extracting a subset.

Packages like `bigalgebra` or `biganalytics` (not covered here) build on`bigmemory`.

*** 

![External Memory](images/external-memory-sm.png)

Ideal gas Data Set
========================================================

In `data/idealgas`, we have a set of synthetic data files describing an ideal gas experiment - setting temperature, amount of material, and volume, and measuring the pressure.  

Simple data sets:
```{r}
small.data <- read.csv("data/idealgas/ideal-gas-fixedT-small.csv")
small.data[1:2,]
```

Row name, pressure (Pa), volume (m^3^), N (moles), and temperature (K).

A larger data set consisting of 124M rows, 5.8GB, is sitting in ideal-gas-fixedT-large.csv, and we'd like to do some analysis of this data set.  But the size is a problem.

A Note on File Formats
========================================================

Let's consider the humble .csv file:
```
$ ls -sh1 airOT2010.*
151M airOT2010.RDS
151M airOT2010.Rdata
1.4G airOT2010.csv

$ Rscript  timeexamples.R
[1] "Reading Rdata file"
   user  system elapsed
 11.697   0.616  12.319
[1] "Reading RDS file"
   user  system elapsed
 11.041   0.644  11.694
[1] "Reading CSV file"
   user  system elapsed
140.640   3.352 144.142
```

A Note on File Formats
========================================================

CSV --- or really, any text-based format --- is the worst possible format for quantiative data. It manages 
the trifecta of being:

- Slow to read
- Huge
- Inaccurate

Converting floating point numbers back and forth between internal represenatations and strings is slow and
prone to truncation.  

Use binary formats whenver possible.  .Rdata is a bit prone to change; .RDS is modestly better.  Portable
file formats like HDF5 (for data frames) or NetCDF4 (for matrices and arrays) are compact, accurate, fast
(not as fast as .Rdata/.RDS), and can be read by tools other than R.


Creating a file-backed big matrix
========================================================

We've already created a big.matrix file from this data set, using
```{r eval=FALSE}
data <- read.big.matrix("data/idealgas/ideal-gas-fixedT-large.csv", 
                        header=TRUE,  
                        backingfile="data/idealgas/ideal-gas-fixedT-large.bin", 
                        descriptorfile="ideal-gas-fixedT-large.desc")
```

This reads in the .csv file and outputs a binary equivalent (the "backingfile") and a descriptor (in the "descriptorfile") which contains all of the information which describes the binary blob.

You can read the descriptorfile: `more ideal-gas-fixedT-large.desc`

Done for you since initial convertion takes 12 minutes for this set - kind of boring.

**Note**: converts into a _matrix_, which is a less flexible data type than a data frame; homogeneous type.   Here, we'll use all numeric.

Using a big.matrix
========================================================

Let's do some simple analysis on the data set and see how memory behaves.
```{r}
library(bigmemory, quiet=TRUE)
orig.gc <- trunc.gc()
data <- attach.big.matrix("data/idealgas/ideal-gas-fixedT-large.desc")
trunc.gc()-orig.gc
```

Using a big.matrix
========================================================

Let's do some simple analysis on the data set and see how memory behaves.
```{r}
data[1:2,]
system.time(min.p <- min(data[,"pres"]))
trunc.gc()-orig.gc
```

Using a big.matrix
========================================================

That only took ~7 seconds to scan through 124M records to find a minimum.  Let's try a few other calculations:
Let's do some simple analysis on the data set and see how memory behaves.
```{r}
min.p
system.time(max.p <- max(data[,"pres"]))
system.time(mean.t <- mean(data[,"temp"]))
```

Using a big.matrix
========================================================

Going through the same column a second time was faster, because some of the data was cached; going through a new column was about the same speed as the first.  What has that done to memory?
```{r}
trunc.gc()-orig.gc
```

Using a big.matrix
========================================================

Let's try something more complicated: we know that averaged over our data, we should have $$p V = n R T$$.  Let's try to infer the gas constant$:

```{r}
system.time(sum.pv <- sum(data[,"pres"]*data[,"vol"]))
system.time(sum.nt <- sum(data[,"n"]*data[,"temp"]))
sum.pv/sum.nt
```

Using a big.matrix
========================================================

And we're still not using that much memory.

```{r}
trunc.gc()-orig.gc
```

Using a big.matrix
========================================================

Let's extract a subset of the data and analyze it.  

The `mwhich` command in `bigmemory` lets us search through the data for **m**ultiple conditions, and extract that data:

```{r}
system.time(subset.data <- data[mwhich(data, cols=c("n", "pres"), 
                                       vals=list(c(1,1.1), c(99000,101000)),
                                       comps=list(c("ge","le"),c("ge","le")), 
                                       op="AND"),])
class(subset.data)
fit <- lm(vol ~ temp, data=as.data.frame(subset.data))
```

Using a big.matrix
========================================================

```{r}
summary(fit)
```

Using a big.matrix
========================================================

```{r}
object.size(subset.data)
trunc.gc()-orig.gc
```

Using a big.matrix
========================================================

Other options:

- `morder` or `mpermute` allow you to sort the data in memory or on disk
- `head` and `tail` let you get the start/end rows
- `mwhich` allows all sorts of slicing and dicing
- `sub.big.matrix` lets you extract contiguous regions of the matrix

Summary: bigmemory
========================================================

If you just have a data file much larger than memory that you have to crunch and the amount of actual computation is not a bottleneck, the `bigmemory` and related packages may be all you need.

Works best if:

- Data is of homogeneous type - eg, all integer, all numeric, all string
- Just need to work on a subset of data at a time, or
- Just need to make one or two passes through the data to complete analysis

Test your skills: Bigmemory
========================================================

`lm()` doesn't work natively on a `big.matrix` - but we can write our own.

If we have an OLS model $$\hat{y_i} = a x_i + b + \epsilon$$, we can fit it with 

$$b = \bar{y} - a \bar{x}$$

$$a = \frac{\sum_i{ x_i y_i } - n \bar{x} \bar{y}}{\sum_i{x_i^2} - n \bar{x} \bar{x}}$$

Using the examples above, fit a couple of columns of the ideal gas data set.  Do the results make
sense?  (Once it's working, try fitting $pV \propto nT$.)  How much memory is used?

(Note: there is a `biglm` package)


Advanced R: Rdsm, pbdR
========================================================
type: sub-section

Advanced R: Rdsm, pbdR
========================================================

We've looked at some of the standard scalable computing packages for R.

Here are two somewhat more advanced packages, that solve very different problems.

- **Rdsm**: Get the most (performance, memory) out of a single-computer computation by using shared memory.
- **pbdR**: Get the most (performance, scale) out of a cluster computation by ditching master-worker, and using very large-scale distributed routines.

Rdsm
========================================================

While it's generally true that processes can't peer into each other's memory, there
is an exception.

Processes can explicitly make a window of memory shared - visible to other processes.

This isn't necessary for threads within a process; but it *is* necessary for multiple
processes working on the same data.

Only works on-node; can't share memory across a network.

***

![Shared Memory](images/shared-mem-sm.png)

Rdsm
========================================================

Rdsm allows you to share a matrix across processes on a node - for reading *and* for
writing.

Normally, when we split a data structure up across tasks, we make copies (PSOCK), or
we use read-only (multicore/fork).

If output is also going to be large, we now have 2-3 copies of the data structure floating
around.

Rdsm allows (on-node) cluster tasks to collaboratively make a large output without copies.


Rdsm
========================================================

Simple example - let's create a shared matrix, and have everyone fill it.

Create PSOCK cluster, an Rdsm instance, shared matrix, and a barrier:

```{r}
library(parallel)
library(Rdsm)

nrows <- 7

cl <- makePSOCKcluster(3)       # form 3-process PSOCK (share-nothing) cluster
init <- mgrinit(cl)             # initialize Rdsm
mgrmakevar(cl,"m",nrows,nrows)  # make a 7x7 shared matrix
bar <- makebarr(cl)
```

Rdsm
========================================================

Everyone gets their task id, and which rows are "theirs";

```{r}
# at each thread, set id to Rdsm built-in ID variable for that thread
clusterEvalQ(cl,myid <- myinfo$id)
clusterExport(cl,c("nrows"))
dmy <- clusterEvalQ(cl,myidxs <- getidxs(nrows))
dmy <- clusterEvalQ(cl, m[myidxs,1:nrows] <- myid)
dmy <- clusterEvalQ(cl,"barr()")
```
...then fills it with their id.

Rdsm
========================================================
Now, print the results.
```{r}
print(m[,])
stoprdsm(cl)  # stops cluster
```

Summary: Rdsm
========================================================

Allows collaborative use of a single pool of memory.

Avoids performance and memory problems of making copies to send back and forth.

Works well when:

- Outputs are as large/larger than inputs.  (Correlation matrix of stocks).
- Inputs are very large, and want to do transformation in-place (values to log-returns).

pbdR
========================================================

The master-worker approach that all the methods we've used so far take works very well for interactive work,
is easy to loadbalance, and is easy to understand.

But there's a fairly narrow range of number of workers where master-worker works well.  

For a small number of total processors (2-4, say), it really hurts to have one processor
doing nothing except some small amount of coordination.

For a very large number of processors (hundreds or more, depending on the size of each
task), the worker scan easily overwhelm the master, meaning all of the workers are 
sitting around waiting while the master  catches up.
***
![Master Worker Imbalances](images/master-worker-sm.png)


pbdR
========================================================

At scale, idea of a single master isn't helpful.

Better: Coordinating peers. 

Rather than a single master parcelling out work, the workers themselves decide which part of the problem
they should be working on, and combine their results cooperatively.

More efficient and can scale better; Downsides:

- Dynamic load-balancing is substantially trickier (but doable)
- Can't really do this interactively; need to write a script

![Workers control the means of data production](images/peers-sm.png)

Departure Hour Histogram Example
========================================================

In `pbd/mpi-histogram.R` we have a script that does hour histogram for eight full years of data, sifting through
40 million flights, in about a minute:
```
$ time mpirun -np 8 Rscript mpi-histogram.R
COMM.RANK = 0
 [1]    4081  118767   27633    7194    9141  194613 2235007 2902703 3003510
[10] 2649823 2373934 2473105 2757256 2772498 2362334 2485699 2503423 2794298
[19] 2626931 2282125 2074739 1386485  649392  344257
COMM.RANK = 0
[1] 41038948

real  1m15.357s
user	9m39.943s
sys	0m10.910s
```

Departure Hour Histogram Example
========================================================

What sorcery is this?

```{r eval=FALSE}
# count.hours and get.hour definitions...
start.year <- 1990

init()
rank <- comm.rank()
my.year <- start.year + rank

myfile <- paste0("data/airline/airOT",as.character(my.year),".RDS")
data <- readRDS(myfile); data <- data$DEP_TIME
myhrs <- count.hours(data)

hrs <- allreduce( myhrs, op="sum" )
comm.print( hrs )
comm.print( sum(hrs) )

finalize()
```

Departure Hour Histogram Example
========================================================

Let's take a look at the first few lines
```{r eval=FALSE}
# count.hours and get.hour definitions...
start.year <- 1990

init()
rank <- comm.rank()
my.year <- start.year + rank

myfile <- paste0("data/airline/airOT",as.character(my.year),".RDS")
data <- readRDS(myfile); data <- data$DEP_TIME
```
In this case, each task decides which year's data to work on.  First ("zero^th^") task works on 1990, next on
1991, etc.

Every task has to call the `init()` routine when starting, and `finalize()` routine when done.

Then reads in the file.

Departure Hour Histogram Example
========================================================

```{r eval=FALSE}
data <- readRDS(myfile); data <- data$DEP_TIME
myhrs <- count.hours(data)

hrs <- allreduce( myhrs, op="sum" )
comm.print( hrs )
comm.print( sum(hrs) )

finalize()
```
Once the file is read, we use our trusty count.hours routine again to work on the entire vector.  

Then an `allreduce` function sums each workers hours, and returns the sum to all processors.  We then
print it out.

Rather than only the master running the main program and handing off bits to workers, every task runs 
this identical program; the only difference is the value of `comm.rank()`.

Reductions
========================================================

Reductions are one way of combining results, and they're very powerful:

```{r eval=FALSE}
init()
rank <- comm.rank()
my.year <- start.year + rank

myfile <- paste0("../data/airline/airOT",as.character(my.year),".RDS")
data <- readRDS(myfile); data <- data$CRS_ELAPSED_TIME
data <- data[!is.na(data)]

data.median <- pbd.quantile(data,0.5)
data.min <- allreduce(min(data), op="min")
data.max <- allreduce(max(data), op="max")
data.N <- allreduce(length(data), op="sum")
data.mean <- allreduce(sum(data), op="sum")/data.N

comm.print(data.min)
comm.print(data.median)
comm.print(data.mean)
comm.print(data.max)

finalize()
````

Reductions
========================================================

```
$ mpirun -np 4 Rscript ./min-median-mean-max.R
COMM.RANK = 0
[1] -70
COMM.RANK = 0
[1] 93.00004
COMM.RANK = 0
[1] 112.8207
COMM.RANK = 0
[1] 1613
```

Median finding:
========================================================

R's higher-level functions plus reductions are very powerful ways to do otherwise
tricky distributed problems - like median of distributed data:
```{r eval=FALSE}
pbd.quantile <- function( data, q=0.5 ) {
    if (q < 0 | q > 1) {
        stop("q should be between 0 and 1.")
    }

    N <- allreduce(length(data), op="sum")
    data.max <- allreduce(max(data), op="max")
    data.min <- allreduce(min(data), op="min")

    f.quantile <- function(x, prob=0.5) {
        allreduce(sum(data <= x), op="sum" )/N - prob
    }

    uniroot(f.quantile, c(data.min, data.max), prob=q)$root
}
```


pbd*apply
========================================================

`pbd` also has its parallel apply functions, but it's important to realize that these aren't being
farmed out by some master task; the tasks themselves decide which ones in the list are "theirs".

`pbd/histogram-pbdsapply.R`
```{r eval=FALSE}
year.hours <- function(my.year) {
    myfile <- paste0("data/airline/airOT",as.character(my.year),".RDS")
    data <- readRDS(myfile)$DEP_TIME
    count.hours(data)
}

init()
years <- 1990:1993
all.hours.list <- pbdLapply(years, year.hours)
all.hours <- Reduce("+", all.hours.list)

comm.print( all.hours )
comm.print( sum(all.hours) )

finalize()
```

pbd Data Distributions
========================================================

pbd has a couple of ways of distributing data.  

What we've used before is their so-called "GBD" distribution - globaly distributed data.
It's split up by rows.

However, for linear algebra computations, a block-cyclic distribution is much more
useful.

***

![pbd Data Distributions](images/pbd-distributions-sm.png)


Reading a pbdR Ddmatrix
========================================================

pbdR comes with several packages for reading a data file and distributing it as a
ddmatrix:

- `read.csv.ddmatrix()` for reading from csv
- `nc_get_dmat()` to read from a NetCDF4 file
- `gbd2dmat()` for conversions from row-oriented to a ddmatrix.


pbd lm
========================================================

And the reason that you'd use a ddmatrix is that several operations defined
on regular R matrices also work transparently on ddmatrix: `lm`, `solve`, `chol`.

`pbd-lm.R`:
```{r eval=FALSE}
init.grid()
rank <- comm.rank()
my.year <- start.year + rank

data <- cleandata(my.year)
Y <- data[[1]]
X <- as.matrix(data[,-1])

X.dm <- gbd2dmat(X)
Y.dm <- gbd2dmat(Y)

fit <- lm(Y ~ X)
comm.print(summary(fit))

finalize()
```

pbd lm
========================================================
```
$ mpirun -np 4 Rscript pbd-lm.R
Using 2x2 for the default grid size

COMM.RANK = 0

Call:
lm(formula = Y ~ X)

Residuals:
     Min       1Q   Median       3Q      Max
-1307.62    -6.03    -2.29     3.53  1431.70

Coefficients: (6 not defined because of singularities)
                       Estimate Std. Error t value Pr(>|t|)
(Intercept)           1.152e+01  9.616e-02  119.77   <2e-16 ***
XORIGIN_AIRPORT_ID   -1.895e-04  5.193e-06  -36.50   <2e-16 ***
XDEST_AIRPORT_ID     -2.257e-04  5.213e-06  -43.29   <2e-16 ***
XDEP_TIME            -3.382e-04  1.724e-05  -19.61   <2e-16 ***
XDEP_DELAY_NEW        1.426e+00  9.594e-03  148.68   <2e-16 ***
...
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.43 on 2741063 degrees of freedom
Multiple R-squared:  0.7809,  Adjusted R-squared:  0.7809
F-statistic: 1.628e+06 on 6 and 2741063 DF,  p-value: < 2.2e-16
```