We consider the following problem: let A
and B
be two matrices of size d-by-n (assumed too large to fit in main memory), the goal is to find a rank-r approximation of their product A^TB using a few passes over the data. A naive way is to compute A^TB first, and then perform truncated SVD. This algorithm suffers from O(n^2 d) time and O(n^2) memory and is hard to scale in the large-scale setting. An alternative option is to directly run power method without explicitly computing the product. This requires to access the data matrices O(r) times and potentially incur a huge disk IO overhead.
For this problem we present Spark implementations for two pass-efficient algorithms: LELA
and OnePassPCA
. LELA
is a two-pass algorithm, proposed by S. Bhojanapalli et al. in their paper Tighter low-rank approximation via sampling the leveraged elements. OnePassPCA
, as its name suggests, is a one-pass algorithm, and hence can be used when the matrices are coming from live data streams. It implements the algorithm SMP-PCA
(which stands for Streaming Matrix Product PCA) proposed in our paper Single Pass PCA of Matrix Prodcuts.
Note: Both of OnePassPCA
and SMP-PCA
refer to the same single-pass algorithm. OnePassPCA
is the old name.
The directory follows a typical layout: the source file is located under /src/main/scala
Here are two flow diagrams illustrating LELA
(upper) and OnePassPCA
(lower).
Note: 1) For ease of computation, A
and B
are stored as a single RDD[index,(blockMatrixOfA, blockMatrixOfB)]. The RDDs are stored in a slightly different way: LELA
stores row blocks while OnePassPCA
stores column blocks. 2) SRHT is implemented as the sketching step in OnePassPCA
. It requires O(ndlogd) complexity, independent of the sketching size.
Current version: Aug 10, 2016.
We now present simulation results on a 150GB synthetic example: let n=d=100000, r=5, the matrices A
and B
are generated as DG
, where G
has entries independently drawn from standard Gaussian distribution and D
is a diagonal matrix with D_ii = 1/i. Other parameters are set as #RDD partitions=80, #samples = 2nrlogn, sketching size = 2000, and #ALS iterations = 10.
We run Spark-1.6.2 on an Amazon EC2 cluster with two m3.2xlarge instances. We use the spark-ec2 script to lauch clusters. The results are shown in the following table. Here we compare the relative spectral norm error, which is calculated as ||A^TB- UV^T||/||A^TB||.
Methods | Error | Runtime |
---|---|---|
Exact SVD | 0.0271 | 23 hrs |
LELA | 0.0274 | 56mins |
OnePassPCA | 0.0280 | 34mins |
Note: 1) For Exact SVD
, we adapt the source code of private object EigenValueDecomposition for our setting: compute B^TAA^TBv distributedly and send it to ARPACK's dsaupd to compute the top eigenvalues and eigenvectors. We set tol=1e-5 as the convergence tolerance. 2) Sometimes we will encounter the error "Remote RPC client disassociated"
. The exact cause is still unknown.
The following figure illustrates the runtime breakdown on clusters of 2, 5, and 10 nodes. Each node is an m3.2xlarge instance. We see that the speedup achieved by OnePassPCA is more prominent for small clusters (possibly due to the increasing spark overheads at larger clusters, see this paper for more explanation on spark overheads).
There are two ways to run in Spark: spark-shell
or spark-submit
.
Launch the spark-shell (make sure your Spark is [configured][sparkConfig] properly), and then copy and paste our source code in the terminal. [sparkConfig]: http://spark.apache.org/docs/latest/configuration.html
Build the JAR packge using sbt package
. The generated JAR package is located under /target/scala-2.10
Then copy the JAR pacakge to YOU_SPARK_HOME, and run application using spark-submit.
For example, the following scripts will run spark locally on 2 cores with memory 2g, with parameters #colns=5000, #rows=5000, rank=5, partitions=2, sketching size=1000, ALS iterations=10, ALS lambda=0, #samples=2nrlogn.
$bin/spark-submit --class "LELA_ATB" --master local[2] --driver-memory 2g lela_2.10-1.0.jar 5000 5000 5 2 10 0 2
$bin/spark-submit --class "OnePassPCA" --master local[2] --driver-memory 2g onepasspca_2.10-1.0.jar 5000 5000 5 2 1000 10 0 2
We also provide Matlab code for the two algorithms. Note that the provided Matlab code uses standard JL (Gaussian matrix) for the sketching step, while our Spark code implementes Hadamard sketch (SRHT) which runs faster.