Merge pull request rapidsai#230 from tfeher/fea-ext-svm-notebook

Add SVM demo notebook
Salonijain27 · Oct 4, 2019 · 665acc5 · 665acc5
2 parents 882ee2c + 982e243
commit 665acc5
Show file tree

Hide file tree

Showing 2 changed files with 205 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -20,7 +20,8 @@
 | cuML      | [PCA Demo](cuml/pca_demo.ipynb)               | This notebook showcases principal component analysis (PCA) algorithm where the model can be used for prediction (using `fit_transform`) as well as converting the transformed data into the original dataset (using `inverse_transform`).                                                                                                                |
 | cuML      | [Random Forest Multi-node / Multi-GPU](cuml/random_forest_mnmg_demo.ipynb)   | Demonstrates how to fit Random Forest models using multiple GPUs via Dask. |
 | cuML      | [Ridge Regression Demo](cuml/ridge_regression_demo.ipynb)  | This notebook includes code examples of ridge regression and it showcases the `fit` and `predict` functions.                                                                                                                                          |
-| cuML      | [SGD_Demo](cuml/sgd_demo.ipynb)               | The stochastic gradient descent algorithm is demostrated in the notebook using `fit` and `predict` functions                                                                        |
+| cuML      | [SGD_Demo](cuml/sgd_demo.ipynb)               | The stochastic gradient descent algorithm is demonstrated in the notebook using `fit` and `predict` functions                                                                        |
+| cuML      | [SVM_Demo](cuml/svm_demo.ipynb)               | Binary Support Vector Machine classification is demonstrated in this notebook using `fit` and `predict` functions. |
 | cuML      | [TSNE_Demo](cuml/tsne_demo.ipynb)               | In this notebook, T-Distributed Stochastic Neighborhood Embedding is demonstrated applying the Barnes Hut method on the Fashion MNIST dataset using our `fit_transform` function                                                                      |
 | cuML      | [TSVD_Demo](cuml/tsvd_demo.ipynb	)              | This notebook showcases truncated singular value decomposition (tsvd) algorithm which like PCA performs both prediction and transformation of the converted dataset into the original data using `fit_transform` and `inverse_transform` functions respectively                                                                                                     |
 | cuML      | [UMAP_Demo](cuml/umap_demo.ipynb)              | The uniform manifold approximation & projection algorithm is compared with the original author's equivalent non-GPU Python implementation using `fit` and `transform` functions                       |

diff --git a/cuml/svm_demo.ipynb b/cuml/svm_demo.ipynb
@@ -0,0 +1,203 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Support Vector Machine\n",
+    "Support vector machines are supervised machine learning methods that can be used for classification and regression. Currently cuML supports binary classification.\n",
+    "\n",
+    "The SVC classifier can take array-like objects, either in host as NumPy arrays or in device (as Numba or cuda_array_interface-compliant), as well as cuDF DataFrames/Series as the input.\n",
+    "\n",
+    "For information on converting your dataset to cuDF documentation: https://docs.rapids.ai/api/cudf/stable/\n",
+    "\n",
+    "For more information about cuML's Support Vector Classifier: https://docs.rapids.ai/api/cuml/stable/"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import cuml.svm \n",
+    "import sklearn.svm\n",
+    "\n",
+    "from sklearn.datasets.samples_generator import make_gaussian_quantiles\n",
+    "from sklearn.model_selection import train_test_split"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Generate data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "n_samples = 20000\n",
+    "n_features = 200\n",
+    "\n",
+    "X, y = make_gaussian_quantiles(n_samples=n_samples, n_features=n_features, n_classes=2)\n",
+    "\n",
+    "#X, y = make_classification(\n",
+    "#   n_rows, n_cols, n_informative=2, n_redundant=0,\n",
+    "#   n_classes=n_classes, n_clusters_per_class=2, shuffle=False)\n",
+    "\n",
+    "X_train, X_test, y_train, y_test = train_test_split(X, y)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Define parameters"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "C = 1\n",
+    "tol = 1e-3\n",
+    "kernel = 'rbf'\n",
+    "gamma = 'scale'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## cuML Model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "cumlSVC = cuml.svm.SVC(kernel=kernel, C=C, tol=tol, gamma=gamma)\n",
+    "cumlSVC.fit(X_train, y_train)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Scikit-learn Model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "sklSVC = sklearn.svm.SVC(kernel=kernel, C=C, tol=tol, gamma=gamma)\n",
+    "sklSVC.fit(X_train, y_train)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Prediction"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "cuml_pred = cumlSVC.predict(X_test)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "skl_pred = sklSVC.predict(X_test)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Compare Accuracy"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cuml_accuracy = np.sum(cuml_pred.to_array()==y_test) / y_test.shape[0] * 100\n",
+    "skl_accuracy = np.sum(skl_pred==y_test) / y_test.shape[0] * 100\n",
+    "print(\"Accuracy: cumlSVC {}%, sklSVC {}%\".format(cuml_accuracy, skl_accuracy))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Notes\n",
+    "- The time measurements will be inaccurate for the first run. You can re-run the cells to get a better estimate of the execution time.\n",
+    "\n",
+    "- Currently the output of the prediction is a cuDF Series object. You can use the `to_array()` method to create a numpy array.\n",
+    "\n",
+    "- The training algorithm uses a cache in GPU memory to accelerate training. You can specify the size (in MiB) using the cache_size argument. This is more relevant for training with larger input size.\n",
+    "\n",
+    "- Similar to other cuML algorithms, cuML SVC is optimized both for single and double precision input data. If your problem allows it, then using single precision input can improve the execution time"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "cumlSVC = cuml.svm.SVC(kernel=kernel, C=C, tol=tol, gamma='scale', cache_size=2000)\n",
+    "cumlSVC.fit(X_train.astype(np.float32), y_train)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}