slides/spark-mllib.html

<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">

    <title>Apache Spark™ Workshop | Machine Learning with Spark MLlib</title>

    <meta name="description" content="Apache Spark™ Workshop | Machine Learning with Spark MLlib">
    <meta name="author" content="Jacek Laskowski">

    <link rel="stylesheet" href="reveal.js/css/reveal.css">
    <link rel="stylesheet" href="reveal.js/css/theme/beige.css" id="theme">

    <!-- Theme used for syntax highlighting of code -->
    <link rel="stylesheet" href="reveal.js/lib/css/zenburn.css">

    <!-- Jacek: custom formatting -->
    <link rel="stylesheet" href="revealjs-css/jacek.css">

    <!-- Printing and PDF exports -->
    <script>
      var link = document.createElement('link');
      link.rel = 'stylesheet';
      link.type = 'text/css';
      link.href = window.location.search.match(/print-pdf/gi) ? 'reveal.js/css/print/pdf.css' : 'reveal.js/css/print/paper.css';
      document.getElementsByTagName('head')[0].appendChild(link);
    </script>
  </head>

  <body>
    <div class="reveal">

      <div class="footer">
        <footer style="font-size: small;">&copy;<a href="https://medium.com/@jaceklaskowski">Jacek Laskowski</a> 2018 / <a href="https://twitter.com/jaceklaskowski">@JacekLaskowski</a> / jacek@japila.pl</footer>
      </div>

      <div class="slides">

        <section class="intro" data-transition="zoom" id="home">
          <p>
            <img width="12%" style="background:none; border:none; box-shadow:none;" data-src="images/spark-logo.png">
            <img width="6%" src="images/jacek_laskowski_20141201_512px.png" style="border: 0">
          </p>
          <h1 style="font-size: 2.99em;">Machine Learning with Spark MLlib</h1>
          <h3>Apache Spark 2.4</h3>

          <h4 style="font-size: smaller;">
            <a href="https://twitter.com/jaceklaskowski">@jaceklaskowski</a> / <a href="http://stackoverflow.com/users/1305344/jacek-laskowski">StackOverflow</a> / <a href="https://github.com/jaceklaskowski">GitHub</a>
            <br>
            Books: <a href="http://bit.ly/mastering-apache-spark">Mastering Apache Spark</a> / <a href="http://bit.ly/mastering-spark-sql">Mastering Spark SQL</a> / <a href="http://bit.ly/spark-structured-streaming">Spark Structured Streaming</a>
          </h4>
        </section>

        <section id="agenda" data-markdown>
          <textarea data-template>
            ## Agenda

            1. [Spark MLlib](#/intro)
            1. [Motivation](#/motivation)
            1. [ML Pipeline](#/ml-pipeline)
            1. [ML Pipeline Design](#/ml-pipeline-design)
            1. [Demo: Email Classification](#/demo)
          </textarea>
        </section>

        <section>
          <section id="intro" style="font-size: 90%">
            <h2>Spark MLlib</h2>
            <ol>
              <li>Spark library for <b>distributed machine learning</b></li>
              <li>Simplifies the development and usage of large-scale machine learning</li>
              <li>Uses Spark SQL for data access
                <ul>
                  <li><b>org.apache.spark.ml</b> is the primary API</li>
                  <li><a href="http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api">The MLlib RDD-based API is now in maintenance mode</a></li>
                  <li>Avoid the RDD-based MLlib API in <b>org.apache.spark.mllib</b> package</li>
                </ul>
              </li>
            </ol>
          </section>
          <section id="features" style="font-size: 85%">
            <h2>Features of Spark MLlib</h2>
            <ol>
              <li><b>Machine learning algorithms</b>
                <ul>
                  <li>Classification, regression, clustering, collaborative filtering</li>
                </ul>
              </li>
              <li><b>Featurization</b>
                <ul>
                  <li>Feature extraction, transformation, dimensionality reduction, selection</li>
                </ul>
              </li>
              <li><b>Pipelines</b>
                <ul>
                  <li>Constructing, evaluating, and tuning machine learning pipelines</li>
                </ul>
              </li>
              <li><b>Persistence</b>
                <ul>
                  <li>Saving and loading algorithms, models, and pipelines</li>
                </ul>
              </li>
              <li><b>Utilities</b>
                <ul>
                  <li>Linear algebra, statistics, data handling</li>
                </ul>
              </li>
            </ol>
          </section>
        </section>

        <section>
          <section id="motivation">
            <h1>Motivation</h1>
          </section>
          <section id="predictive-analytic-workflow">
            <h2>Predictive analytic workflow</h2>
            <ol>
              <li>Use of a machine learning algorithm is only one component of a <b>predictive analytic workflow</b></li>
              <li>There may also be <b>pre-processing steps</b> for the machine learning algorithm to work</li>
              <li>Expectations of <b>data scientists</b> and <b>data engineers</b></li>
            </ol>
          </section>
          <section>
            <h2>Typical machine learning workflow</h2>
            <ol>
              <li>Loading data (aka <b>data ingestion</b>)</li>
              <li>Preparing data (aka <b>data cleanup</b>)</li>
              <li>Extracting features (aka <b>feature extraction</b>)</li>
              <li>Fitting model (aka <b>model training</b>)</li>
              <li>Scoring (or <b>predictionize</b>)</li>
            </ol>
          </section>
          <section>
            <h2>Before Going To Production</h2>
            <ol>
              <li>Testing model (aka <b>model testing</b>)</li>
              <li>Selecting the best model (aka <b>model selection</b> or <b>model tuning</b>)</li>
              <li>Deploying model (aka <b>model deployment</b> and <b>integration</b>)</li>
            </ol>
          </section>
        </section>

        <section>
          <section id="ml-pipeline">
            <h1>ML Pipeline</h1>
          </section>
          <section id="ml-pipeline-goal">
            <h2>Goal of ML Pipeline</h2>
            Assemble and configure <br>
            practical distributed machine learning pipelines <br>
            as easy-to-use pieces <br>
            to compose more complex ones with ease (like Lego™ blocks)
          </section>
          <section id="ml-pipeline-features">
            <h2>Features of ML Pipeline</h2>
            <ol>
              <li>DataFrame as a dataset format</li>
              <li>ML Pipelines API is similar to scikit-learn</li>
              <li>Easy debugging (via inspecting columns added during execution)</li>
              <li>Parameter tuning</li>
              <li>Compositions (to build more complex pipelines out of existing ones)</li>
            </ol>
          </section>
          <section id="ml-pipeline-components">
            <h2>Components of ML Pipeline</h2>
            <ol>
              <li>Pipelines and PipelineStages</li>
              <li>Transformers</li>
              <li>Estimators</li>
              <li>Models</li>
              <li>Evaluators</li>
              <li>Cross Validators</li>
              <li>Params and ParamMaps</li>
            </ol>
          </section>
        </section>

        <section>
          Pipeline with Transformers, Estimator, and Model
          <img width="100%" src="images/spark-mllib-pipeline.png" style="border: 0">
        </section>

        <section>
          <section id="ml-pipeline-design">
            <h2>ML Pipeline Design</h2>
            <ol>
              <li>Choose <b>Transformers</b></li>
              <li>Select <b>Estimator</b> (to produce a <b>Model</b>)</li>
              <li>Create <b>Pipeline</b></li>
              <li><b>Fit</b> the pipeline to a training <b>Dataset</b></li>
              <li>Use the Model</li>
            </ol>
          </section>
          <section id="ml-pipeline-design-estimator">
            <h2>ML Pipeline Design Applied - Step 1</h2>
            <ul>
              <li>Selecting <a href="http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.Estimator">Estimator</a> (to produce a <b>Model</b>)</li>
              <ol>
                <li><a href="http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.classification.LogisticRegression">LogisticRegression</a></li>
                <li><a href="http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.clustering.LDA">LDA</a> (with <a href="http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer">CountVectorizer</a>)</li>
                <li><a href="http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.recommendation.ALS">ALS</a></li>
                <li><a href="http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.clustering.KMeans">KMeans</a></li>
                <li><a href="http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.classification.NaiveBayes">NaiveBayes</a></li>
                <li><a href="http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.Word2Vec">Word2Vec</a></li>
              </ol>
            </ul>
          </section>
          <section id="ml-pipeline-design-transformers">
            <h2>ML Pipeline Design Applied - Step 2</h2>
            <ul>
              <li>Selecting <a href="http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.Transformer">Transformers</a> (for feature selection)</li>
              <ol>
                <li><a href="http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.Tokenizer">Tokenizer</a> (or <a href="http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.RegexTokenizer">RegexTokenizer</a>)</li>
                <li><a href="http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.HashingTF">HashingTF</a></li>
              </ol>
            </ul>
          </section>
          <section id="ml-pipeline-design-pipeline">
            <h2>ML Pipeline Design Applied - Step 3</h2>
            <ul>
              <li>Creating <a href="http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.Pipeline">Pipeline</a></li>
              <ol>
                <li><b>new Pipeline()</b></li>
                <li><b>setStages</b></li>
              </ol>
            </ul>
          </section>
          <section id="ml-pipeline-design-fitting-model">
            <h2>ML Pipeline Design Applied - Step 4</h2>
            <ul>
              <li><b>Fitting</b> a model</li>
              <ol>
                <li>Requires a <b>training Dataset</b></li>
                <li><a href="http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.Pipeline">Pipeline.fit</a></li>
              </ol>
            </ul>
          </section>
          <section id="ml-pipeline-design-using-model">
            <h2>ML Pipeline Design Applied - Step 5</h2>
            <ul>
              <li>Using trained <b>Model</b> (to generate predictions)</li>
              <ol>
                <li>Requires a <b>real Dataset</b></li>
                <li><a href="http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.Pipeline">Pipeline.transform</a></li>
              </ol>
            </ul>
          </section>
        </section>

        <section id="demo">
          <h1>Demo</h1>
          <h1>Email Classification</h1>
          <h3>Using Logistic Regression</h3>
        </section>

        <section id="recap" data-markdown>
          <textarea data-template>
            ## Recap

            1. [Spark MLlib](#/intro)
            1. [Motivation](#/motivation)
            1. [ML Pipeline](#/ml-pipeline)
            1. [ML Pipeline Design](#/ml-pipeline-design)
            1. [Demo: Email Classification](#/demo)
          </textarea>
        </section>

        <section style="text-align: left" data-markdown id="questions">
          <textarea data-template>
            # Questions?

            * Read [Mastering Apache Spark](http://bit.ly/mastering-apache-spark)
            * Read [The Internals of Spark SQL](https://bit.ly/mastering-spark-sql)
            * Read [The Internals of Spark Structured Streaming](http://bit.ly/spark-structured-streaming)
            * Follow [@jaceklaskowski](https://twitter.com/jaceklaskowski) on twitter
            * Upvote [my questions and answers on StackOverflow](http://stackoverflow.com/users/1305344/jacek-laskowski)
          </textarea>
        </section>

      </div>
    </div>

    <script src="reveal.js/lib/js/head.min.js"></script>
    <script src="reveal.js/js/reveal.js"></script>

    <script>
      // More info about config & dependencies:
      // - https://github.com/hakimel/reveal.js#configuration
      // - https://github.com/hakimel/reveal.js#dependencies
      Reveal.initialize({
        controls: true,
        progress: true,
        history: true,
        center: true,
        slideNumber: true,

        transition: 'slide', // none/fade/slide/convex/concave/zoom

        menu: {
          markers: true,
          openSlideNumber: true
        },
        dependencies: [
          { src: 'reveal.js/lib/js/classList.js', condition: function () { return !document.body.classList; } },
          { src: 'reveal.js/plugin/markdown/marked.js' },
          { src: 'reveal.js/plugin/markdown/markdown.js' },
          { src: 'reveal.js/plugin/zoom-js/zoom.js', async: true },
          { src: 'reveal.js/plugin/notes/notes.js', async: true },
          { src: 'reveal.js/plugin/highlight/highlight.js', async: true, callback: function () { hljs.initHighlightingOnLoad(); } }
        ]
      });
    </script>
    <script>
      (function (i, s, o, g, r, a, m) {
        i['GoogleAnalyticsObject'] = r; i[r] = i[r] || function () {
          (i[r].q = i[r].q || []).push(arguments)
        }, i[r].l = 1 * new Date(); a = s.createElement(o),
          m = s.getElementsByTagName(o)[0]; a.async = 1; a.src = g; m.parentNode.insertBefore(a, m)
      })(window, document, 'script', '//www.google-analytics.com/analytics.js', 'ga');

      ga('create', 'UA-45999426-3', 'auto');
      ga('send', 'pageview');

    </script>
  </body>
</html>