Merge pull request #259 from shunanzhang/shunan-doc-1.3.0-new

[DOCS]Add ftrl_lr_spark_en.md (#245)
Angel-ML · Nov 25, 2017 · 22caa5e · 22caa5e
2 parents f10864d + 08e34be
commit 22caa5e
Show file tree

Hide file tree

Showing 2 changed files with 107 additions and 0 deletions.
diff --git a/docs/algo/ftrl_lr_spark_en.md b/docs/algo/ftrl_lr_spark_en.md
@@ -0,0 +1,107 @@
+# [Spark Streaming on Angel] FTRL
+
+>FTRL is a common online-learning optimization method with demonstrated good result in practice. Traditionally implemented in Storm, FTRL is widely used for online models of CTR prediction. In practice, chances are data are high-dimensional and sparse solution is desired, in that case, implementing FTRL in Spark Streaming on Angel can actually achieve better result with robust performance in just a few lines of code.
+
+
+## 1. Introduction to FTRL
+
+`FTRL` blends the benefits of `FOBOS` and `RDA`: it guarantees comparatively high precision as FOBOS does, and can yield better sparcity in result with a reasonably loss in precision. 
+
+The equation used for updating weights of features by FTRL is:
+
+![](../img/ftrl_lr_w.png)
+
+where
+
+* G function is the gradient of the loss function:
+
+	![](../img/ftrl_lr_g_en.png)
+
+* w's updating is separated to N independent scalar minimization problems, depending on the specific dimension:
+
+	![](../img/ftrl_lr_w_update.png)
+
+* considering individual learning rate for each dimension, w's updating equation becomes:
+
+	![](../img/ftrl_lr_d_t.png)
+
+
+
+## 2. Distributed Implementation
+
+Google has provided the implementation of Logistic Regression with L1/L2 terms using FTRL:
+
+![](../img/ftrl_lr_project.png)
+
+Integrating characteristics of Spark Streaming and Angel into the above reference, the distributed implementation has the following framework:
+
+![](../img/ftrl_lr_framework.png)
+
+
+## 3. Execution & Performance
+
+
+
+###  **Input Format**
+* dim: dimension of the input data
+* Only supports the standard ["libsvm"](./data_format_en.md) format for message 
+* Uses kafka messaging mechanism, thus kafka needs to be configured
+
+### **Parameters**
+
+* **Algorithm Parameters**
+	* alpha: alpha in w's updating equation 
+	* beta: beta in w's updating equation
+	* lambda1: lambda1 in w's updating equation
+	* lambda2: lambda2 in w's updating equation
+
+* **I/O Parameters**
+	 * checkPointPath: checkpoint path for the data stream
+	 * modelPath: save path for model (trained by each batch)
+	 * actionType: "train" and "predict"
+	 * sampleRate: input rate for samples used for "predict"
+	 * zkQuorum: configuration for Zookeeper with format: "hostname:port"
+	 * topic: kafka topic
+	 * group: kafka group
+	 * streamingWindow: size of spark streaming batch
+
+* **Resource Parameters**
+	* num-executors: number of executors
+	* executor-cores: number of executor cores
+	* executor-memory: executor memory
+	* driver-memory: driver memory
+	* spark.ps.instances: number of Angel PS nodes
+	* spark.ps.cores: number of cores in each PS node
+	* spark.ps.memory: PS node memory
+
+###  **Submission Command**
+
+Submit the FTRL_SparseLR training job to Yarn using the following sample command:
+
+```shell
+./bin/spark-submit \
+--master yarn-cluster \
+--conf spark.yarn.allocation.am.maxMemory=55g \
+--conf spark.yarn.allocation.executor.maxMemory=55g \
+--conf spark.ps.jars=$SONA_ANGEL_JARS \
+--conf spark.ps.instances=2 \
+--conf spark.ps.cores=2 \
+--conf spark.ps.memory=10g \
+--jars $SONA_SPARK_JARS \
+--name "spark-on-angel-sparse-ftrl" \
+--driver-memory 1g \
+--num-executors 5 \
+--executor-cores 2 \
+--executor-memory 2g \
+--class com.tencent.angel.spark.ml.classification.SparseLRWithFTRL \
+spark-on-angel-mllib-2.1.0.jar \
+partitionNum:3 \
+actionType:train \
+sampleRate:1.0 \
+modelPath:$modelPath \
+checkPointPath:$checkpoint \
+group:$group \
+zkquorum:$zkquorum \
+topic:$topic
+```
+
diff --git a/docs/img/ftrl_lr_g_en.png b/docs/img/ftrl_lr_g_en.png