slides/spark-structured-streaming.html

<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">

    <title>Apache Spark™ Workshop | Spark Structured Streaming</title>

    <meta name="description" content="Apache Spark™ Workshop | Spark Structured Streaming">
    <meta name="author" content="Jacek Laskowski">

    <link rel="stylesheet" href="reveal.js/css/reveal.css">
    <link rel="stylesheet" href="reveal.js/css/theme/beige.css" id="theme">

    <!-- Theme used for syntax highlighting of code -->
    <link rel="stylesheet" href="reveal.js/lib/css/zenburn.css">

    <!-- Jacek: custom formatting -->
    <link rel="stylesheet" href="revealjs-css/jacek.css">

    <!-- Printing and PDF exports -->
    <script>
      var link = document.createElement('link');
      link.rel = 'stylesheet';
      link.type = 'text/css';
      link.href = window.location.search.match(/print-pdf/gi) ? 'reveal.js/css/print/pdf.css' : 'reveal.js/css/print/paper.css';
      document.getElementsByTagName('head')[0].appendChild(link);
    </script>
  </head>

  <body>
    <div class="reveal">

      <div class="footer">
        <footer style="font-size: small;">
          &copy; <a href="https://medium.com/@jaceklaskowski">Jacek Laskowski</a> 2019 / <a href="https://twitter.com/jaceklaskowski">@JacekLaskowski</a>
          / jacek@japila.pl
        </footer>
      </div>

      <div class="slides">

        <section class="intro" data-transition="zoom" id="home">
          <p>
            <img width="12%" style="background:none; border:none; box-shadow:none;" data-src="images/spark-logo.png">
            <img width="6%" src="images/jacek_laskowski_20141201_512px.png" style="border: 0">
          </p>
          <h1 style="font-size: 3.25em;">Spark<br/>Structured Streaming</h1>
          <h3>Apache Spark 2.4.4</h3>

          <hr />
          <h4 style="font-size: smaller;">
            <a href="https://twitter.com/jaceklaskowski">@jaceklaskowski</a> / <a href="https://stackoverflow.com/users/1305344/jacek-laskowski">StackOverflow</a> / <a href="https://github.com/jaceklaskowski">GitHub</a>
            <br>
            The "Internals" Books: <a href="https://bit.ly/apache-spark-internals">Apache Spark</a> / <a href="https://bit.ly/spark-sql-internals">Spark SQL</a> / <a href="https://bit.ly/spark-structured-streaming">Spark Structured Streaming</a>
          </h4>
        </section>

        <section id="agenda" data-markdown>
          <textarea data-template>
            ## Agenda

            1. [Spark Structured Streaming](#/intro)
            1. [DataStreamReader](#/datastreamreader)
            1. [DataStreamWriter](#/datastreamwriter)
            1. [Streaming Source](#/streaming-source)
            1. [Streaming Sink](#/streaming-sink)
            1. [Streaming Query](#/streaming-query)
            1. [StreamingQueryManager](#/streaming-query-manager)
            1. [Demo: Streaming Structured Query](#/demo)
          </textarea>
        </section>

        <section>
          <section id="intro" style="font-size: 90%">
            <h2>Spark Structured Streaming <small>(1 of 2)</small></h2>
            <ol>
              <li><b>Structured Streaming</b> is a computation model that attempts to unify streaming, interactive, and batch query execution engines</li>
              <li>Structured Streaming is a <b>stream processing engine</b> with a high-level declarative streaming API built on top of Spark SQL</li>
              <li>Continuous incremental execution of a structured query</li>
              <li>Switch to <a href="https://bit.ly/spark-structured-streaming">The Internals of Spark Structured Streaming</a>
                <ul>
                  <li><a href="https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-structured-streaming.html">Structured Streaming &mdash; Streaming Datasets</a></li>
                </ul>
              </li>
            </ol>
          </section>
          <section>
            <h2>Spark Structured Streaming <small>(2 of 2)</small></h2>

            <ol>
              <li>Spark Structured Streaming is part of Spark SQL</li>
              <li>When developing streaming applications use the following dependency</li>
            </ol>
              <pre style="margin-left: 0px;"><code style="width: 900px; padding-left: 40px;" class="lang-scala hljs">
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"
              </code></pre>
          </section>
        </section>

        <section id="datastreamreader" style="font-size: 90%">
          <h2>DataStreamReader</h2>
          <ol>
            <li><b>DataStreamReader</b> is the interface for loading data from streaming data source
              <pre style="margin-left: 0px;"><code style="width: 900px; padding-left: 50px;" class="lang-scala hljs">
    import org.apache.spark.sql.streaming.DataStreamReader
    val streamReader: DataStreamReader = spark.readStream
    // source + options
    val dataset: DataFrame = streamReader.load
              </code></pre>
            </li>
            <li>Streaming DataFrame represents an unbounded table</li>
            <li>Streaming query is described using Dataset API</li>
            <li>Switch to <a href="https://bit.ly/spark-structured-streaming">The Internals of Spark Structured Streaming</a>
              <ul>
                <li><a href="https://jaceklaskowski.gitbooks.io/spark-structured-streaming/content/spark-sql-streaming-DataStreamReader.html">DataStreamReader</a></li>
              </ul>
            </li>
          </ol>
        </section>

        <section>
          <section id="datastreamwriter">
            <h2>DataStreamWriter</h2>
            <ol>
              <li><b>DataStreamWriter</b> is the interface for writing result of a streaming query to a data sink
                <pre style="margin-left: 0px;"><code style="width: 900px; padding-left: 50px;" class="lang-scala hljs">
   val dataset: DataFrame = ...
   import org.apache.spark.sql.streaming.DataStreamWriter
   val streamWriter: DataStreamWriter = dataset.writeStream
                </code></pre>
              </li>
              <li>Switch to <a href="https://bit.ly/spark-structured-streaming">The Internals of Spark Structured Streaming</a>
                <ul>
                  <li><a href="https://jaceklaskowski.gitbooks.io/spark-structured-streaming/content/spark-sql-streaming-DataStreamWriter.html">DataStreamWriter</a></li>
                </ul>
              </li>
            </ol>
          </section>
          <section id="datastreamwriter-queryName" style="font-size: 85%">
            <h2>DataStreamWriter and Query Name</h2>
            <ol>
              <li><b>queryName</b> specifies the name of a streaming query</li>
              <pre style="margin-left: 0px;"><code style="width: 900px; padding-left: 50px;" class="lang-scala hljs">
queryName(queryName: String): DataStreamWriter[T]
              </code></pre>
              <li>The name must be unique among all the currently active queries in the associated SparkSession</li>
              <pre style="margin-left: 0px;"><code style="width: 900px; padding-left: 50px;" class="lang-scala hljs">
val streamWriter: DataStreamWriter = ...
val namedStreamWriter: DataStreamWriter = streamWriter.queryName("name")
              </code></pre>
            </ol>
          </section>
          <section id="datastreamwriter-outputMode" style="font-size: 85%">
            <h2>DataStreamWriter and Output Mode</h2>
            <ol>
              <li><b>Output mode</b> specifies <b>when</b> and <b>what</b> (output) rows of a streaming query are written to the sink</li>
              <pre style="margin-left: 0px;"><code style="width: 900px; padding-left: 50px;" class="lang-scala hljs">
        outputMode(outputMode: String): DataStreamWriter[T]
        outputMode(outputMode: OutputMode): DataStreamWriter[T]
              </code></pre>
              <li><b>append</b> writes only the new rows in a streaming query</li>
              <li><b>complete</b> writes all the rows in an streaming aggregation query every time there are some updates</li>
              <li><b>update</b> writes only the rows that were updated in a streaming query every time there are some updates
                <ul>
                  <li>Equivalent to append mode if the query doesn't use aggregations</li>
                </ul>
              </li>
            </ol>
          </section>
          <section id="datastreamwriter-trigger" style="font-size: 85%">
            <h2>Setting Trigger</h2>
            <ol>
              <li><b>trigger</b> sets how often a streaming query is requested for a result</li>
              <pre style="margin-left: 0px;"><code style="width: 900px; padding-left: 50px;" class="lang-scala hljs">
trigger(trigger: Trigger): DataStreamWriter[T]
              </code></pre>
              <li>Use <b>Trigger.ProcessingTime</b></li>
              <pre style="margin-left: 0px;"><code style="width: 900px; padding-left: 50px;" class="lang-scala hljs">
trigger(Trigger.ProcessingTime("10 seconds"))

import scala.concurrent.duration._
trigger(Trigger.ProcessingTime(10.seconds))

import java.util.concurrent.TimeUnit
trigger(ProcessingTime.create(10, TimeUnit.SECONDS))
              </code></pre>
              <li>Defaults to <i>as fast as possible</i>, i.e. <b>Trigger.ProcessingTime(0)</b></li>
            </ol>
          </section>
          <section id="datastreamwriter-foreach" style="font-size: 85%">
            <h2>foreach and ForeachWriter <small>(1 of 2)</small></h2>
            <ol>
              <li><b>DataStreamWriter.foreach</b> allows for defining a custom data sink and will continually send results to the given <b>ForeachWriter</b> as a new data arrives</li>
              <pre style="margin-left: 0px;"><code style="width: 900px; padding-left: 50px;" class="lang-scala hljs">
foreach(writer: ForeachWriter[T]): DataStreamWriter[T]
              </code></pre>
              <li><b>ForeachWriter</b> can be used to send the generated data to an external system</li>
              <pre style="margin-left: 0px;"><code style="width: 900px; padding-left: 50px;" class="lang-scala hljs">
abstract class ForeachWriter[T] {
  abstract def close(errorOrNull: Throwable): Unit
  abstract def open(partitionId: Long, version: Long): Boolean
  abstract def process(value: T): Unit
}
              </code></pre>
            </ol>
          </section>
          <section id="datastreamwriter-foreach-2" style="font-size: 85%">
            <h2>foreach and ForeachWriter <small>(2 of 2)</small></h2>
              <pre style="margin-left: 0px;"><code style="width: 900px; padding-left: 50px;" class="lang-scala hljs">
val streamWriter: DataStreamWriter = ...
import org.apache.spark.sql.ForeachWriter
val streamWriterWithForeachSink: DataStreamWriter =
  streamWriter.foreach(new ForeachWriter[Long] {
    override def open(partitionId: Long, version: Long) = true

    override def process(value: Long): Unit = {
      println(s">>> $value")
    }

    override def close(errorOrNull: Throwable): Unit = {}
  })
              </code></pre>
          </section>
          <section id="datastreamwriter-foreachBatch" style="font-size: 85%">
            <h2>foreachBatch <small>(1 of 2)</small></h2>
            <ol>
              <pre style="margin-left: 0px;"><code style="width: 900px; padding-left: 50px;" class="lang-scala hljs">
foreachBatch(function: (Dataset[T], Long) => Unit): DataStreamWriter[T]
              </code></pre>
              <li><b>DataStreamWriter.foreachBatch</b> allows for defining a custom function that can work with the micro-batch output as a dataframe for the following:
                <ul>
                  <li>Pass the output rows of each batch to a library that is designed for the batch jobs only</li>
                  <li>Reuse batch data sources for output whose streaming version does not exist</li>
                  <li>Multi-writes where the output rows are written to multiple outputs by writing twice for every batch</li>
                </ul>
              </li>
              <li><b>New in 2.4.0</b></li>
            </ol>
          </section>
          <section id="datastreamwriter-foreachBatch-2" style="font-size: 85%">
            <h2>foreachBatch <small>(2 of 2)</small></h2>
            <pre style="margin-left: 0px;"><code style="width: 900px; padding-left: 50px;" class="lang-scala hljs">
        import org.apache.spark.sql.Dataset
        spark.readStream
          .format("rate")
          .load
          .writeStream
          .foreachBatch { (output: Dataset[_], batchId: Long) =>
            println(s"Batch ID: $batchId")
            output.show
          }
            </code></pre>
          </section>
          <section id="datastreamwriter-start" style="font-size: 85%">
            <h2>Starting Streaming Query</h2>
            <ol>
              <li><b>start</b> starts execution of a streaming query that will continually output results to a sink as new data arrives</li>
              <pre style="margin-left: 0px;"><code style="width: 900px; padding-left: 50px;" class="lang-scala hljs">
start(): StreamingQuery
              </code></pre>
              <li>Returns <b>StreamingQuery</b> that can be used to interact with the streaming query</li>
              <pre style="margin-left: 0px;"><code style="width: 900px; padding-left: 50px;" class="lang-scala hljs">
import org.apache.spark.sql.streaming.StreamingQuery
val query: StreamingQuery = counter.writeStream.start
              </code></pre>
            </ol>
          </section>
        </section>

        <section id="streaming-source" style="font-size: 90%">
          <h2>Streaming Source</h2>
          <ol>
            <li><b>Streaming Source</b> acts as a continuous stream of data for a streaming query</li>
            <li>Defined using <b>format</b> method on <b>DataStreamReader</b>
              <ul>
                <li>Uses <b>shortName</b> of a source</li>
              </ul>
            </li>
            <li><b>FileStreamSource</b> and <b>TextSocketSource</b>
            <li><b>KafkaSource</b> for Apache Kafka 0.10+</li>
            <li><b>RateStreamSource</b> and <b>MemoryStream</b> for unit tests, PoCs, tutorials and debugging</li>
            <li>Switch to <a href="https://bit.ly/spark-structured-streaming">The Internals of Spark Structured Streaming</a>
              <ul>
                <li><a href="https://jaceklaskowski.gitbooks.io/spark-structured-streaming/content/spark-sql-streaming-Source.html">Streaming Data Source</a></li>
              </ul>
            </li>
          </ol>
        </section>

        <section id="streaming-sink" style="font-size: 90%">
          <h2>Streaming Sink</h2>
          <ol>
            <li><b>Streaming Sink</b> represents an external storage to write streaming datasets to.</li>
            <li>Defined using <b>format</b> method on <b>DataStreamWriter</b>
              <ul>
                <li>Uses <b>shortName</b> of a sink</li>
              </ul>
            </li>
            <li><b>ConsoleSink</b>, <b>FileStreamSink</b> and <b>ForeachSink</b></li>
            <li><b>KafkaSink</b> for Apache Kafka 0.10+</li>
            <li><b>MemorySink</b> for unit tests, tutorials and debugging</li>
            <li>Switch to <a href="https://bit.ly/spark-structured-streaming">The Internals of Spark Structured Streaming</a>
              <ul>
                <li><a href="https://jaceklaskowski.gitbooks.io/spark-structured-streaming/content/spark-sql-streaming-Sink.html">Streaming Sink</a></li>
              </ul>
            </li>
          </ol>
        </section>

        <section id="streaming-query" style="font-size: 95%">
          <h2>StreamingQuery</h2>
          <ol>
            <li><b>StreamingQuery</b> represents a streaming query
              <pre style="margin-left: 0px;"><code style="width: 900px; padding-left: 50px;" class="lang-scala hljs">
    import org.apache.spark.sql.streaming.StreamingQuery
    val query: StreamingQuery = counter.writeStream.start
              </code></pre>
            </li>
            <li><b>id</b> is the unique id of a query</li>
            <li><b>runId</b> is the unique id of the run of a query</li>
            <li>Use <b>awaitTermination</b> to wait for the termination of a query, either by <b>query.stop</b> or by an exception</li>
            <li>Use <b>stop</b> to stop execution of a query</li>
            <li>Switch to <a href="https://bit.ly/spark-structured-streaming">The Internals of Spark Structured Streaming</a>
              <ul>
                <li><a href="https://jaceklaskowski.gitbooks.io/spark-structured-streaming/content/spark-sql-streaming-StreamingQuery.html">StreamingQuery</a></li>
              </ul>
            </li>
          </ol>
        </section>

        <section id="streaming-query-manager">
          <h2>StreamingQueryManager — Streaming Query Management</h2>
          <ol>
            <li><b>StreamingQueryManager</b> is the Management API for streaming queries in a <b>SparkSession</b>
              <pre style="margin-left: 0px;"><code style="width: 900px; padding-left: 50px;" class="lang-scala hljs">
        val qm: StreamingQueryManager = spark.streams
              </code></pre>
            </li>
            <li>Switch to <a href="https://bit.ly/spark-structured-streaming">The Internals of Spark Structured Streaming</a>
              <ul>
                <li><a href="https://jaceklaskowski.gitbooks.io/spark-structured-streaming/content/spark-sql-streaming-StreamingQueryManager.html">StreamingQueryManager &mdash; Streaming Query Management</a></li>
              </ul>
            </li>
          </ol>
        </section>

        <section id="demo" data-markdown>
          <textarea data-template>
            ## Demo: Streaming Structured Query

            1. Basic rate-source-to-console-sink pipeline
            1. Using **rate** streaming source and **console** streaming sink
            ```scala
        spark
          .readStream
          .format("rate") // <-- rate source
          .load
          .writeStream
          .format("console") // <-- console sink
          .trigger(Trigger.ProcessingTime(5.seconds))
          .option("truncate", false)
          .queryName("rate2console")
          .start
          .awaitTermination
            ```
          </textarea>
        </section>

        <section id="recap" data-markdown>
          <textarea data-template>
            ## Recap

            1. [Spark Structured Streaming](#/intro)
            1. [DataStreamReader](#/datastreamreader)
            1. [DataStreamWriter](#/datastreamwriter)
            1. [Streaming Source](#/streaming-source)
            1. [Streaming Sink](#/streaming-sink)
            1. [Streaming Query](#/streaming-query)
            1. [StreamingQueryManager](#/streaming-query-manager)
            1. [Demo: Streaming Structured Query](#/demo)
          </textarea>
        </section>

        <section style="text-align: left" data-markdown id="questions">
          <textarea data-template>
            # Questions?

            * Read [The Internals of Apache Spark](https://bit.ly/apache-spark-internals)
            * Read [The Internals of Spark SQL](https://bit.ly/spark-sql-internals)
            * Read [The Internals of Spark Structured Streaming](https://bit.ly/spark-structured-streaming)
            * Follow [@jaceklaskowski](https://twitter.com/jaceklaskowski) on twitter
            * Upvote [my questions and answers on StackOverflow](http://stackoverflow.com/users/1305344/jacek-laskowski)
          </textarea>
        </section>

      </div>
    </div>

    <script src="reveal.js/lib/js/head.min.js"></script>
    <script src="reveal.js/js/reveal.js"></script>

    <script>
      // More info about config & dependencies:
      // - https://github.com/hakimel/reveal.js#configuration
      // - https://github.com/hakimel/reveal.js#dependencies
      Reveal.initialize({
        controls: true,
        progress: true,
        history: true,
        center: true,
        slideNumber: true,

        transition: 'slide', // none/fade/slide/convex/concave/zoom

        menu: {
          markers: true,
          openSlideNumber: true
        },
        dependencies: [
          { src: 'reveal.js/lib/js/classList.js', condition: function () { return !document.body.classList; } },
          { src: 'reveal.js/plugin/markdown/marked.js' },
          { src: 'reveal.js/plugin/markdown/markdown.js' },
          { src: 'reveal.js/plugin/zoom-js/zoom.js', async: true },
          { src: 'reveal.js/plugin/notes/notes.js', async: true },
          { src: 'reveal.js/plugin/highlight/highlight.js', async: true, callback: function () { hljs.initHighlightingOnLoad(); } }
        ]
      });
    </script>
    <script>
      (function (i, s, o, g, r, a, m) {
        i['GoogleAnalyticsObject'] = r; i[r] = i[r] || function () {
          (i[r].q = i[r].q || []).push(arguments)
        }, i[r].l = 1 * new Date(); a = s.createElement(o),
          m = s.getElementsByTagName(o)[0]; a.async = 1; a.src = g; m.parentNode.insertBefore(a, m)
      })(window, document, 'script', '//www.google-analytics.com/analytics.js', 'ga');

      ga('create', 'UA-45999426-3', 'auto');
      ga('send', 'pageview');

    </script>
  </body>
</html>