Skip to content

Commit

Permalink
Merge pull request rapidsai#230 from tfeher/fea-ext-svm-notebook
Browse files Browse the repository at this point in the history
Add SVM demo notebook
  • Loading branch information
taureandyernv authored Oct 4, 2019
2 parents 882ee2c + 982e243 commit 665acc5
Show file tree
Hide file tree
Showing 2 changed files with 205 additions and 1 deletion.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@
| cuML | [PCA Demo](cuml/pca_demo.ipynb) | This notebook showcases principal component analysis (PCA) algorithm where the model can be used for prediction (using `fit_transform`) as well as converting the transformed data into the original dataset (using `inverse_transform`). |
| cuML | [Random Forest Multi-node / Multi-GPU](cuml/random_forest_mnmg_demo.ipynb) | Demonstrates how to fit Random Forest models using multiple GPUs via Dask. |
| cuML | [Ridge Regression Demo](cuml/ridge_regression_demo.ipynb) | This notebook includes code examples of ridge regression and it showcases the `fit` and `predict` functions. |
| cuML | [SGD_Demo](cuml/sgd_demo.ipynb) | The stochastic gradient descent algorithm is demostrated in the notebook using `fit` and `predict` functions |
| cuML | [SGD_Demo](cuml/sgd_demo.ipynb) | The stochastic gradient descent algorithm is demonstrated in the notebook using `fit` and `predict` functions |
| cuML | [SVM_Demo](cuml/svm_demo.ipynb) | Binary Support Vector Machine classification is demonstrated in this notebook using `fit` and `predict` functions. |
| cuML | [TSNE_Demo](cuml/tsne_demo.ipynb) | In this notebook, T-Distributed Stochastic Neighborhood Embedding is demonstrated applying the Barnes Hut method on the Fashion MNIST dataset using our `fit_transform` function |
| cuML | [TSVD_Demo](cuml/tsvd_demo.ipynb ) | This notebook showcases truncated singular value decomposition (tsvd) algorithm which like PCA performs both prediction and transformation of the converted dataset into the original data using `fit_transform` and `inverse_transform` functions respectively |
| cuML | [UMAP_Demo](cuml/umap_demo.ipynb) | The uniform manifold approximation & projection algorithm is compared with the original author's equivalent non-GPU Python implementation using `fit` and `transform` functions |
Expand Down
203 changes: 203 additions & 0 deletions cuml/svm_demo.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Support Vector Machine\n",
"Support vector machines are supervised machine learning methods that can be used for classification and regression. Currently cuML supports binary classification.\n",
"\n",
"The SVC classifier can take array-like objects, either in host as NumPy arrays or in device (as Numba or cuda_array_interface-compliant), as well as cuDF DataFrames/Series as the input.\n",
"\n",
"For information on converting your dataset to cuDF documentation: https://docs.rapids.ai/api/cudf/stable/\n",
"\n",
"For more information about cuML's Support Vector Classifier: https://docs.rapids.ai/api/cuml/stable/"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import cuml.svm \n",
"import sklearn.svm\n",
"\n",
"from sklearn.datasets.samples_generator import make_gaussian_quantiles\n",
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Generate data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"n_samples = 20000\n",
"n_features = 200\n",
"\n",
"X, y = make_gaussian_quantiles(n_samples=n_samples, n_features=n_features, n_classes=2)\n",
"\n",
"#X, y = make_classification(\n",
"# n_rows, n_cols, n_informative=2, n_redundant=0,\n",
"# n_classes=n_classes, n_clusters_per_class=2, shuffle=False)\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define parameters"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"C = 1\n",
"tol = 1e-3\n",
"kernel = 'rbf'\n",
"gamma = 'scale'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## cuML Model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"cumlSVC = cuml.svm.SVC(kernel=kernel, C=C, tol=tol, gamma=gamma)\n",
"cumlSVC.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Scikit-learn Model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"sklSVC = sklearn.svm.SVC(kernel=kernel, C=C, tol=tol, gamma=gamma)\n",
"sklSVC.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prediction"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"cuml_pred = cumlSVC.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"skl_pred = sklSVC.predict(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Compare Accuracy"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cuml_accuracy = np.sum(cuml_pred.to_array()==y_test) / y_test.shape[0] * 100\n",
"skl_accuracy = np.sum(skl_pred==y_test) / y_test.shape[0] * 100\n",
"print(\"Accuracy: cumlSVC {}%, sklSVC {}%\".format(cuml_accuracy, skl_accuracy))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Notes\n",
"- The time measurements will be inaccurate for the first run. You can re-run the cells to get a better estimate of the execution time.\n",
"\n",
"- Currently the output of the prediction is a cuDF Series object. You can use the `to_array()` method to create a numpy array.\n",
"\n",
"- The training algorithm uses a cache in GPU memory to accelerate training. You can specify the size (in MiB) using the cache_size argument. This is more relevant for training with larger input size.\n",
"\n",
"- Similar to other cuML algorithms, cuML SVC is optimized both for single and double precision input data. If your problem allows it, then using single precision input can improve the execution time"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"cumlSVC = cuml.svm.SVC(kernel=kernel, C=C, tol=tol, gamma='scale', cache_size=2000)\n",
"cumlSVC.fit(X_train.astype(np.float32), y_train)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

0 comments on commit 665acc5

Please sign in to comment.