Merge pull request #37 from morazow/feature/df-insert-into-exasol

Spark DataFrame insert into as Exasol table feature
exasol · Jan 21, 2019 · 3a4c43e · 3a4c43e
2 parents 79bab09 + cc6f7d6
commit 3a4c43e
Show file tree

Hide file tree

Showing 23 changed files with 1,281 additions and 198 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -45,19 +45,19 @@ matrix:
   include:
     - jdk: openjdk8
       scala: 2.11.12
-      env: SPARK_VERSION="2.1.0"
+      env: SPARK_VERSION="2.3.2"
 
     - jdk: oraclejdk8
       scala: 2.11.12
-      env: SPARK_VERSION="2.1.0"
+      env: SPARK_VERSION="2.3.2"
 
     - jdk: openjdk8
       scala: 2.11.12
-      env: SPARK_VERSION="2.3.1"
+      env: SPARK_VERSION="2.4.0"
 
     - jdk: oraclejdk8
       scala: 2.11.12
-      env: SPARK_VERSION="2.3.1" RELEASE=false
+      env: SPARK_VERSION="2.4.0" RELEASE=false
 
 script:
   - travis_wait 30 ./scripts/ci.sh

diff --git a/README.md b/README.md
@@ -2,47 +2,71 @@
 
 [![Build Status][travis-badge]][travis-link]
 [![Codecov][codecov-badge]][codecov-link]
-[![Maven Central][maven-badge]][maven-link]
+[![Maven Central][maven-img-badge]][maven-link]
 
 ###### Please note that this is an open source project which is *not officially supported* by Exasol. We will try to help you as much as possible, but can't guarantee anything since this is not an official Exasol product.
 
 ## Overview
 
 This is a connector library that supports an integration between
 [Exasol][exasol] and [Apache Spark][spark]. Using this connector, users can
-read/write data from/to Exasol using Spark.
+create Spark dataframes from Exasol queries and save Spark dataframes as Exasol
+tables.
+
+The implementation is based on Spark [DataSources API][spark-ds-api] and Exasol
+[Sub Connections][sol-546].
 
 * [Quick Start](#quick-start)
 * [Usage](#usage)
-* [Building and Testing](#building-and-testing)
 * [Configuration](#configuration)
+* [Building and Testing](#building-and-testing)
 
 ## Quick Start
 
-Here is short quick start on how to use the connector.
+Here is short code snippets on how to use the connector in your Spark / Scala
+applications.
 
-Reading data from Exasol,
+Reading data from Exasol as Spark dataframe:
 
 ```scala
-// This is Exasol SQL Syntax
-val exasolQueryString = "SELECT * FROM MY_SCHEMA.MY_TABLE"
+// An Exasol sql syntax query string
+val exasolQueryString =
+  """
+    SELECT SALES_DATE, MARKET_ID, PRICE
+    FROM RETAIL.SALES
+    WHERE MARKET_ID IN (661, 534, 667)
+  """
 
 val df = sparkSession
      .read
      .format("exasol")
-     .option("host", "localhost")
-     .option("port", "8888")
+     .option("host", "10.0.0.11")
+     .option("port", "8563")
      .option("username", "sys")
-     .option("password", "exasol")
+     .option("password", "exaPass")
      .option("query", exasolQueryString)
      .load()
+```
+
+Saving a Spark dataframe as an Exasol table:
 
-df.show(10, false)
+```scala
+val df = sparkSession
+     .write
+     .mode("append")
+     .option("host", "10.0.0.11")
+     .option("port", "8563")
+     .option("username", "sys")
+     .option("password", "exaPass")
+     .option("table", "RETAIL.ADJUSTED_SALES")
+     .format("exasol")
+     .save()
 ```
 
-Or using spark configurations: (this will have higher priority)
+Additionally, you can set the parameter on `SparkConf`:
+
 ```scala
-// config spark session
+// Configure spark session
 val sparkConf = new SparkConf()
   .setMaster("local[*]")
   .set("spark.exasol.host", "localhost")
@@ -51,45 +75,110 @@ val sparkConf = new SparkConf()
   .set("spark.exasol.password", "exasol")
   .set("spark.exasol.max_nodes", "200")
 
-val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
+val sparkSession = SparkSession
+  .builder()
+  .config(sparkConf)
+  .getOrCreate()
 
-// This is Exasol SQL Syntax
-val exasolQueryString = "SELECT * FROM MY_SCHEMA.MY_TABLE"
+val queryStr = "SELECT * FROM MY_SCHEMA.MY_TABLE"
 
 val df = sparkSession
      .read
      .format("exasol")
-     .option("query", exasolQueryString)
+     .option("query", queryStr)
      .load()
-
-df.show(10, false)
 ```
 
-For more examples you can check [docs/examples](docs/examples.md).
+Please note that parameter values set on Spark configuration will have higher
+priority.
+
+For an example walkthrough please check
+[docs/example-walkthrough](docs/example-walkthrough.md).
 
 ## Usage
 
-You can include the connector as a dependency in your projects. Please find the
-latest versions at [Maven Central
-Repositories](https://mvnrepository.com/artifact/com.exasol) for
-`spark-connector`
+The latest release version ([![Maven Central][maven-reg-badge]][maven-link]) is
+compiled against Scala 2.11 and Spark 2.1+.
 
-Using SBT:
+In order to use the connector in your Java or Scala applications, you can
+include it as a dependency to your projects by adding artifact information into
+`build.sbt` or `pom.xml` files.
+
+### build.sbt
 
 ```scala
-libraryDependencies += "com.exasol" %% "spark-connector" % "<latest-version>"
+resolvers ++= Seq("Exasol Releases" at "https://maven.exasol.com/artifactory/exasol-releases")
+
+libraryDependencies += "com.exasol" % "spark-connector_2.11" % "$LATEST_VERSION"
 ```
 
-Using Maven:
+### pom.xml
 
 ```xml
+<repository>
+    <id>maven.exasol.com</id>
+    <url>https://maven.exasol.com/artifactory/exasol-releases</url>
+</repository>
+
 <dependency>
     <groupId>com.exasol</groupId>
     <artifactId>spark-connector_2.11</artifactId>
-    <version>latest-version</version>
+    <version>$LATEST_VERSION</version>
 </dependency>
 ```
 
+### Alternative Option
+
+As an alternative, you can provide `--repositories` and `--packages` artifact
+coordinates to the **spark-submit**, **spark-shell** or **pyspark** commands.
+
+For example:
+
+```sh
+spark-shell \
+    --repositories https://maven.exasol.com/artifactory/exasol-releases \
+    --packages com.exasol:spark-connector_2.11:$LATEST_VERSION
+```
+
+### Deployment
+
+Similarly, you can submit packaged application into the Spark cluster.
+
+Using spark-submit:
+
+```sh
+spark-submit \
+    --master spark://spark-master-url:7077
+    --repositories https://maven.exasol.com/artifactory/exasol-releases \
+    --packages com.exasol:spark-connector_2.11:$LATEST_VERSION \
+    --class com.myorg.SparkExasolConnectorApp \
+    --conf spark.exasol.password=exaTru3P@ss \
+    path/to/project/folder/target/scala-2.11/sparkexasolconnectorapp_2.11-5.3.1.jar
+```
+
+This deployment example also shows that you can configure the Exasol parameters
+at startup using `--conf spark.exasol.keyName=value` syntax.
+
+<strong>Please update the `$LATEST_VERION` accordingly with the latest artifact
+version number.</strong>
+
+## Configuration
+
+The following configuration parameters can be provided mainly to facilitate a
+connection to Exasol cluster.
+
+| Spark Configuration           | Configuration    | Default       | Description
+| :---                          | :---             | :---          | :---
+|                               | ``query``        | *<none>*      | A query string to send to Exasol
+|                               | ``table``        | *<none>*      | A table name (with schema, e.g. my_schema.my_table) to save dataframe
+| ``spark.exasol.host``         | ``host``         | ``localhost`` | A host ip address to the **first** Exasol node (e.g. 10.0.0.11)
+| ``spark.exasol.port``         | ``port``         | ``8888``      | A port number to connect to Exasol nodes (e.g.  8563)
+| ``spark.exasol.username``     | ``username``     | ``sys``       | An Exasol username for logging in
+| ``spark.exasol.password``     | ``password``     | ``exasol``    | An Exasol password for logging in
+| ``spark.exasol.max_nodes``    | ``max_nodes``    | ``200``       | The number of data nodes in Exasol cluster
+| ``spark.exasol.batch_size``   | ``batch_size``   | ``1000``      | The number of records batched before running execute statement when saving dataframe
+| ``spark.exasol.create_table`` | ``create_table`` | ``false``     | A permission to create table if it does not exist in Exasol when saving dataframe
+
 ## Building and Testing
 
 Clone the repository,
@@ -141,20 +230,6 @@ This creates a jar file under `target/` folder. The jar file can be used with
 spark-shell --jars /path/to/spark-exasol-connector-assembly-*.jar
 ```
 
-## Configuration
-
-The following configuration parameters can be provided mainly to facilitate a
-connection to Exasol cluster.
-
-| Spark Configuration        | Configuration | Default       | Description
-| :---                       | :---          | :---          | :---
-|                            | ``query``     | *<none>*      | A query string to send to Exasol
-| ``spark.exasol.host``      | ``host``      | ``localhost`` | A host ip address to the **first** Exasol node (e.g. 10.0.0.11)
-| ``spark.exasol.port``      | ``port``      | ``8888``      | A port number to connect to Exasol nodes (e.g.  8563)
-| ``spark.exasol.username``  | ``username``  | ``sys``       | An Exasol username for logging in
-| ``spark.exasol.password``  | ``password``  | ``exasol``    | An Exasol password for logging in
-| ``spark.exasol.max_nodes`` | ``max_nodes`` | ``200``       | The number of data nodes in Exasol cluster
-
 ## FAQ
 
 - Getting an `Connection was lost and could not be reestablished` error
@@ -189,10 +264,12 @@ connection to Exasol cluster.
 [travis-link]: https://travis-ci.org/exasol/spark-exasol-connector
 [codecov-badge]: https://codecov.io/gh/exasol/spark-exasol-connector/branch/master/graph/badge.svg
 [codecov-link]: https://codecov.io/gh/exasol/spark-exasol-connector
-[maven-badge]: https://img.shields.io/maven-central/v/com.exasol/spark-connector_2.11.svg
+[maven-img-badge]: https://img.shields.io/maven-central/v/com.exasol/spark-connector_2.11.svg
+[maven-reg-badge]: https://maven-badges.herokuapp.com/maven-central/com.exasol/spark-connector_2.11/badge.svg
 [maven-link]: https://maven-badges.herokuapp.com/maven-central/com.exasol/spark-connector_2.11
 [exasol]: https://www.exasol.com/en/
 [spark]: https://spark.apache.org/
+[spark-ds-api]: https://databricks.com/blog/2015/01/09/spark-sql-data-sources-api-unified-data-access-for-the-spark-platform.html
 [docker]: https://www.docker.com/
 [exa-docker-db]: https://hub.docker.com/r/exasol/docker-db/
 [testcontainers]: https://www.testcontainers.org/