slides/spark-sql-Developing-Custom-Data-Source.html

<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">

    <title>Apache Spark™ Workshop | Spark SQL | Developing Custom Data Source</title>

    <meta name="description" content="Apache Spark™ Workshop | Spark SQL | Developing Custom Data Source">
    <meta name="author" content="Jacek Laskowski">

    <link rel="stylesheet" href="reveal.js/css/reveal.css">
    <link rel="stylesheet" href="reveal.js/css/theme/beige.css">

    <!-- Theme used for syntax highlighting of code -->
    <link rel="stylesheet" href="reveal.js/lib/css/zenburn.css">

    <!-- Jacek: custom formatting -->
    <link rel="stylesheet" href="revealjs-css/jacek.css">

    <!-- Printing and PDF exports -->
    <script>
      var link = document.createElement('link');
      link.rel = 'stylesheet';
      link.type = 'text/css';
      link.href = window.location.search.match(/print-pdf/gi) ? 'reveal.js/css/print/pdf.css' : 'reveal.js/css/print/paper.css';
      document.getElementsByTagName('head')[0].appendChild(link);
    </script>
  </head>

  <body>
    <div class="reveal">

      <div class="footer">
        <footer style="font-size: small;">
          &copy; <a href="https://medium.com/@jaceklaskowski">Jacek Laskowski</a> 2019 / <a href="https://twitter.com/jaceklaskowski">@JacekLaskowski</a>
          / jacek@japila.pl
        </footer>
      </div>

      <div class="slides">

        <section class="intro" data-transition="zoom" id="home">
          <p>
            <img width="12%" style="background:none; border:none; box-shadow:none;" data-src="images/spark-logo.png">
            <img width="6%" src="images/jacek_laskowski_20141201_512px.png" style="border: 0">
          </p>
          <h1 style="font-size: 3.37em;">Developing Custom Data Source</h1>
          <h3>Apache Spark 2.4.4 / Spark SQL</h3>

          <hr />
          <h4 style="font-size: smaller;">
            <a href="https://twitter.com/jaceklaskowski">@jaceklaskowski</a> / <a href="https://stackoverflow.com/users/1305344/jacek-laskowski">StackOverflow</a> / <a href="https://github.com/jaceklaskowski">GitHub</a>
            <br>
            The "Internals" Books: <a href="https://bit.ly/apache-spark-internals">Apache Spark</a> / <a href="https://bit.ly/spark-sql-internals">Spark SQL</a> / <a href="https://bit.ly/spark-structured-streaming">Spark Structured Streaming</a>
          </h4>
        </section>

        <section id="agenda" data-markdown>
          <textarea data-template>
            ## Agenda

            1. [Introduction](#/introduction)
            1. ["Final Product"](#/final-product)
            1. [RelationProvider](#/RelationProvider)
            1. [DataSourceRegister](#/DataSourceRegister)
            1. [BaseRelation](#/BaseRelation)
            1. [PrunedFilteredScan](#/PrunedFilteredScan)
            1. [RDD, Partitions and compute](#/rdd-partitions-compute)
            1. [Demo](#/demo)
          </textarea>
        </section>

        <section id="introduction" data-markdown style="font-size: 90%">
          <textarea data-template>
            ## Introduction

            1. **Data Source** - a Spark SQL component that is used to load or save datasets from external data storages or systems
            1. **Data Source API** - a set of interfaces (contracts) in Spark SQL to implement a custom data source
            1. **DataSource** - a class in Spark SQL that is responsible for **data source resolution**
            1. Data Source is referenced in a Spark SQL application using **DataFrameReader.format** or **DataFrameWriter.format**
            1. Switch to [The Internals of Spark SQL](https://bit.ly/spark-sql-internals)
              * [DataSource &mdash; Pluggable Data Provider Framework](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-DataSource.html)
          </textarea>
        </section>

        <section>
          <section id="final-product" data-markdown style="font-size: 90%">
            <textarea data-template>
              ## "Final Product" / Load Side (1 of 2)

              <pre style="margin-left: 0px;"><code style="width: 900px;" class="lang-scala hljs">spark
  .read
  .format("XXX") // <-- your custom data source / format
  .option("header", true) // <-- the data source supports options
  .load("data.xxx")
  .select("id", "name") // column pruning
  .where($"id" > 5)     // filter pushdown
  .show
              </code></pre>
            </textarea>
          </section>
          <section id="final-product-save-side" data-markdown style="font-size: 90%">
            <textarea data-template>
              ## "Final Product" / Save Side (2 of 2)

              <pre style="margin-left: 0px;"><code style="width: 900px;" class="lang-scala hljs">dataframe
  .write
  .format("XXX") // <-- your custom data source / format
  .option("header", true) // <-- the data source supports options
  .mode("overwrite") // modes
  .save("data.xxx")
              </code></pre>
            </textarea>
          </section>
        </section>

        <section>
          <section id="BaseRelation" data-markdown style="font-size: 85%">
            <textarea data-template>
              ## BaseRelation

              1. **BaseRelation** - the contract of relations with a known schema
                * Schema is the metadata of a data
                * Schema describes the shape of a data
              1. Does not define data itself
                * Use the data-specific contracts <small><i>(in the following slides)</i></small>


                <pre><code class="lang-scala hljs">
                  abstract class BaseRelation {
                    def sqlContext: SQLContext
                    def schema: StructType
                    def sizeInBytes: Long = sqlContext.conf.defaultSizeInBytes
                    def needConversion: Boolean = true
                    def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters
                  }
                </code></pre>

              1. Switch to [The Internals of Spark SQL](https://bit.ly/spark-sql-internals)
                * [BaseRelation &mdash; Collection of Tuples with Schema](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-BaseRelation.html)
            </textarea>
          </section>
          <section id="PrunedScan" data-markdown style="font-size: 85%">
            <textarea data-template>
              ## PrunedScan

              1. **PrunedScan** - the contract of relations that support:
                * **Column Pruning** - eliminating unneeded columns


                <pre><code class="lang-scala hljs">
                  trait PrunedScan {
                    def buildScan(requiredColumns: Array[String]): RDD[Row]
                  }
                </code></pre>

              1. Switch to [The Internals of Spark SQL](https://bit.ly/spark-sql-internals)
                * [PrunedScan Contract &mdash; Relations with Column Pruning](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-PrunedScan.html)
            </textarea>
          </section>
          <section id="PrunedFilteredScan" data-markdown style="font-size: 85%">
            <textarea data-template>
              ## PrunedFilteredScan

              1. **PrunedFilteredScan** - the contract of relations that support:
                * **Column Pruning** - eliminating unneeded columns
                * **Filter Pushdown** - filtering using selected predicates


                <pre><code class="lang-scala hljs">
                  trait PrunedFilteredScan {
                    def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]
                  }
                </code></pre>

              1. Switch to [The Internals of Spark SQL](https://bit.ly/spark-sql-internals)
                * [PrunedFilteredScan Contract &mdash; Relations with Column Pruning and Filter Pushdown](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-PrunedFilteredScan.html)
            </textarea>
          </section>
          <section id="TableScan" data-markdown style="font-size: 85%">
            <textarea data-template>
              ## TableScan

              1. **TableScan** - the contract of relations that support:
                * **Column Pruning** - eliminating unneeded columns


                <pre><code class="lang-scala hljs">
                  trait TableScan {
                    def buildScan(): RDD[Row]
                  }
                </code></pre>

              1. Switch to [The Internals of Spark SQL](https://bit.ly/spark-sql-internals)
                * [TableScan Contract &mdash; Relations with Column Pruning](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-TableScan.html)
            </textarea>
          </section>
        </section>

        <section id="RelationProvider" data-markdown>
          <textarea data-template>
            ## RelationProvider

            1. **RelationProvider** - the contract of `BaseRelation` providers with schema inference (discovery)
            1. Switch to [The Internals of Spark SQL](https://bit.ly/spark-sql-internals)
              * [RelationProvider Contract &mdash; Relation Providers With Schema Inference](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-RelationProvider.html)
          </textarea>
        </section>

        <section id="DataSourceRegister" data-markdown>
          <textarea data-template>
            ## DataSourceRegister

            1. **DataSourceRegister** is...FIXME
            1. **META-INF/services/org.apache.spark.sql.sources.DataSourceRegister**
              * DataSource uses Java's ServiceLoader to load DataSourceRegisters
            1. Switch to [The Internals of Spark SQL](https://bit.ly/spark-sql-internals)
              * FIXME
          </textarea>
        </section>

        <section id="rdd-partitions-compute" data-markdown>
          <textarea data-template>
            ## RDD, Partitions and compute

            1. **RDD** is...FIXME
            1. **Partition** is...FIXME
            1. **compute** is...FIXME
            1. Switch to [The Internals of Apache Spark](https://bit.ly/apache-spark-internals)
              * FIXME
          </textarea>
        </section>

        <section id="demo" data-markdown>
          <textarea data-template>
            ## Demo
          </textarea>
        </section>

        <section id="recap" data-markdown>
          <textarea data-template>
            ## Recap

            1. [DataSource](#/DataSource)
            1. ["Final Product"](#/final-product)
            1. [RelationProvider](#/RelationProvider)
            1. [DataSourceRegister](#/DataSourceRegister)
            1. [BaseRelation](#/BaseRelation)
            1. [PrunedFilteredScan](#/PrunedFilteredScan)
            1. [RDD, Partitions and compute](#/rdd-partitions-compute)
            1. [Demo](#/demo)
          </textarea>
        </section>

        <section style="text-align: left" data-markdown id="questions">
          <textarea data-template>
            # Questions?

            * Read [The Internals of Apache Spark](https://bit.ly/apache-spark-internals)
            * Read [The Internals of Spark SQL](https://bit.ly/spark-sql-internals)
            * Read [The Internals of Spark Structured Streaming](https://bit.ly/spark-structured-streaming)
            * Follow [@jaceklaskowski](https://twitter.com/jaceklaskowski) on twitter
            * Upvote [my questions and answers on StackOverflow](http://stackoverflow.com/users/1305344/jacek-laskowski)
          </textarea>
        </section>

      </div>
    </div>

    <script src="reveal.js/lib/js/head.min.js"></script>
    <script src="reveal.js/js/reveal.js"></script>

    <script>
      // More info about config & dependencies:
      // - https://github.com/hakimel/reveal.js#configuration
      // - https://github.com/hakimel/reveal.js#dependencies
      Reveal.initialize({
        controls: true,
        progress: true,
        history: true,
        center: true,
        slideNumber: true,

        transition: 'slide', // none/fade/slide/convex/concave/zoom

        menu: {
          markers: true,
          openSlideNumber: true
        },
        dependencies: [
          { src: 'reveal.js/lib/js/classList.js', condition: function () { return !document.body.classList; } },
          { src: 'reveal.js/plugin/markdown/marked.js' },
          { src: 'reveal.js/plugin/markdown/markdown.js' },
          { src: 'reveal.js/plugin/zoom-js/zoom.js', async: true },
          { src: 'reveal.js/plugin/notes/notes.js', async: true },
          { src: 'reveal.js/plugin/highlight/highlight.js', async: true, callback: function () { hljs.initHighlightingOnLoad(); } }
        ]
      });
    </script>
    <script>
      (function (i, s, o, g, r, a, m) {
        i['GoogleAnalyticsObject'] = r; i[r] = i[r] || function () {
          (i[r].q = i[r].q || []).push(arguments)
        }, i[r].l = 1 * new Date(); a = s.createElement(o),
          m = s.getElementsByTagName(o)[0]; a.async = 1; a.src = g; m.parentNode.insertBefore(a, m)
      })(window, document, 'script', '//www.google-analytics.com/analytics.js', 'ga');

      ga('create', 'UA-45999426-3', 'auto');
      ga('send', 'pageview');

    </script>
  </body>
</html>