Skip to content

Commit

Permalink
Merge pull request #37 from morazow/feature/df-insert-into-exasol
Browse files Browse the repository at this point in the history
Spark DataFrame insert into as Exasol table feature
  • Loading branch information
morazow authored Jan 21, 2019
2 parents 79bab09 + cc6f7d6 commit 3a4c43e
Show file tree
Hide file tree
Showing 23 changed files with 1,281 additions and 198 deletions.
8 changes: 4 additions & 4 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,19 +45,19 @@ matrix:
include:
- jdk: openjdk8
scala: 2.11.12
env: SPARK_VERSION="2.1.0"
env: SPARK_VERSION="2.3.2"

- jdk: oraclejdk8
scala: 2.11.12
env: SPARK_VERSION="2.1.0"
env: SPARK_VERSION="2.3.2"

- jdk: openjdk8
scala: 2.11.12
env: SPARK_VERSION="2.3.1"
env: SPARK_VERSION="2.4.0"

- jdk: oraclejdk8
scala: 2.11.12
env: SPARK_VERSION="2.3.1" RELEASE=false
env: SPARK_VERSION="2.4.0" RELEASE=false

script:
- travis_wait 30 ./scripts/ci.sh
Expand Down
163 changes: 120 additions & 43 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,47 +2,71 @@

[![Build Status][travis-badge]][travis-link]
[![Codecov][codecov-badge]][codecov-link]
[![Maven Central][maven-badge]][maven-link]
[![Maven Central][maven-img-badge]][maven-link]

###### Please note that this is an open source project which is *not officially supported* by Exasol. We will try to help you as much as possible, but can't guarantee anything since this is not an official Exasol product.

## Overview

This is a connector library that supports an integration between
[Exasol][exasol] and [Apache Spark][spark]. Using this connector, users can
read/write data from/to Exasol using Spark.
create Spark dataframes from Exasol queries and save Spark dataframes as Exasol
tables.

The implementation is based on Spark [DataSources API][spark-ds-api] and Exasol
[Sub Connections][sol-546].

* [Quick Start](#quick-start)
* [Usage](#usage)
* [Building and Testing](#building-and-testing)
* [Configuration](#configuration)
* [Building and Testing](#building-and-testing)

## Quick Start

Here is short quick start on how to use the connector.
Here is short code snippets on how to use the connector in your Spark / Scala
applications.

Reading data from Exasol,
Reading data from Exasol as Spark dataframe:

```scala
// This is Exasol SQL Syntax
val exasolQueryString = "SELECT * FROM MY_SCHEMA.MY_TABLE"
// An Exasol sql syntax query string
val exasolQueryString =
"""
SELECT SALES_DATE, MARKET_ID, PRICE
FROM RETAIL.SALES
WHERE MARKET_ID IN (661, 534, 667)
"""

val df = sparkSession
.read
.format("exasol")
.option("host", "localhost")
.option("port", "8888")
.option("host", "10.0.0.11")
.option("port", "8563")
.option("username", "sys")
.option("password", "exasol")
.option("password", "exaPass")
.option("query", exasolQueryString)
.load()
```

Saving a Spark dataframe as an Exasol table:

df.show(10, false)
```scala
val df = sparkSession
.write
.mode("append")
.option("host", "10.0.0.11")
.option("port", "8563")
.option("username", "sys")
.option("password", "exaPass")
.option("table", "RETAIL.ADJUSTED_SALES")
.format("exasol")
.save()
```

Or using spark configurations: (this will have higher priority)
Additionally, you can set the parameter on `SparkConf`:

```scala
// config spark session
// Configure spark session
val sparkConf = new SparkConf()
.setMaster("local[*]")
.set("spark.exasol.host", "localhost")
Expand All @@ -51,45 +75,110 @@ val sparkConf = new SparkConf()
.set("spark.exasol.password", "exasol")
.set("spark.exasol.max_nodes", "200")

val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
val sparkSession = SparkSession
.builder()
.config(sparkConf)
.getOrCreate()

// This is Exasol SQL Syntax
val exasolQueryString = "SELECT * FROM MY_SCHEMA.MY_TABLE"
val queryStr = "SELECT * FROM MY_SCHEMA.MY_TABLE"

val df = sparkSession
.read
.format("exasol")
.option("query", exasolQueryString)
.option("query", queryStr)
.load()

df.show(10, false)
```

For more examples you can check [docs/examples](docs/examples.md).
Please note that parameter values set on Spark configuration will have higher
priority.

For an example walkthrough please check
[docs/example-walkthrough](docs/example-walkthrough.md).

## Usage

You can include the connector as a dependency in your projects. Please find the
latest versions at [Maven Central
Repositories](https://mvnrepository.com/artifact/com.exasol) for
`spark-connector`
The latest release version ([![Maven Central][maven-reg-badge]][maven-link]) is
compiled against Scala 2.11 and Spark 2.1+.

Using SBT:
In order to use the connector in your Java or Scala applications, you can
include it as a dependency to your projects by adding artifact information into
`build.sbt` or `pom.xml` files.

### build.sbt

```scala
libraryDependencies += "com.exasol" %% "spark-connector" % "<latest-version>"
resolvers ++= Seq("Exasol Releases" at "https://maven.exasol.com/artifactory/exasol-releases")

libraryDependencies += "com.exasol" % "spark-connector_2.11" % "$LATEST_VERSION"
```

Using Maven:
### pom.xml

```xml
<repository>
<id>maven.exasol.com</id>
<url>https://maven.exasol.com/artifactory/exasol-releases</url>
</repository>

<dependency>
<groupId>com.exasol</groupId>
<artifactId>spark-connector_2.11</artifactId>
<version>latest-version</version>
<version>$LATEST_VERSION</version>
</dependency>
```

### Alternative Option

As an alternative, you can provide `--repositories` and `--packages` artifact
coordinates to the **spark-submit**, **spark-shell** or **pyspark** commands.

For example:

```sh
spark-shell \
--repositories https://maven.exasol.com/artifactory/exasol-releases \
--packages com.exasol:spark-connector_2.11:$LATEST_VERSION
```

### Deployment

Similarly, you can submit packaged application into the Spark cluster.

Using spark-submit:

```sh
spark-submit \
--master spark://spark-master-url:7077
--repositories https://maven.exasol.com/artifactory/exasol-releases \
--packages com.exasol:spark-connector_2.11:$LATEST_VERSION \
--class com.myorg.SparkExasolConnectorApp \
--conf spark.exasol.password=exaTru3P@ss \
path/to/project/folder/target/scala-2.11/sparkexasolconnectorapp_2.11-5.3.1.jar
```

This deployment example also shows that you can configure the Exasol parameters
at startup using `--conf spark.exasol.keyName=value` syntax.

<strong>Please update the `$LATEST_VERION` accordingly with the latest artifact
version number.</strong>

## Configuration

The following configuration parameters can be provided mainly to facilitate a
connection to Exasol cluster.

| Spark Configuration | Configuration | Default | Description
| :--- | :--- | :--- | :---
| | ``query`` | *<none>* | A query string to send to Exasol
| | ``table`` | *<none>* | A table name (with schema, e.g. my_schema.my_table) to save dataframe
| ``spark.exasol.host`` | ``host`` | ``localhost`` | A host ip address to the **first** Exasol node (e.g. 10.0.0.11)
| ``spark.exasol.port`` | ``port`` | ``8888`` | A port number to connect to Exasol nodes (e.g. 8563)
| ``spark.exasol.username`` | ``username`` | ``sys`` | An Exasol username for logging in
| ``spark.exasol.password`` | ``password`` | ``exasol`` | An Exasol password for logging in
| ``spark.exasol.max_nodes`` | ``max_nodes`` | ``200`` | The number of data nodes in Exasol cluster
| ``spark.exasol.batch_size`` | ``batch_size`` | ``1000`` | The number of records batched before running execute statement when saving dataframe
| ``spark.exasol.create_table`` | ``create_table`` | ``false`` | A permission to create table if it does not exist in Exasol when saving dataframe

## Building and Testing

Clone the repository,
Expand Down Expand Up @@ -141,20 +230,6 @@ This creates a jar file under `target/` folder. The jar file can be used with
spark-shell --jars /path/to/spark-exasol-connector-assembly-*.jar
```

## Configuration

The following configuration parameters can be provided mainly to facilitate a
connection to Exasol cluster.

| Spark Configuration | Configuration | Default | Description
| :--- | :--- | :--- | :---
| | ``query`` | *<none>* | A query string to send to Exasol
| ``spark.exasol.host`` | ``host`` | ``localhost`` | A host ip address to the **first** Exasol node (e.g. 10.0.0.11)
| ``spark.exasol.port`` | ``port`` | ``8888`` | A port number to connect to Exasol nodes (e.g. 8563)
| ``spark.exasol.username`` | ``username`` | ``sys`` | An Exasol username for logging in
| ``spark.exasol.password`` | ``password`` | ``exasol`` | An Exasol password for logging in
| ``spark.exasol.max_nodes`` | ``max_nodes`` | ``200`` | The number of data nodes in Exasol cluster

## FAQ

- Getting an `Connection was lost and could not be reestablished` error
Expand Down Expand Up @@ -189,10 +264,12 @@ connection to Exasol cluster.
[travis-link]: https://travis-ci.org/exasol/spark-exasol-connector
[codecov-badge]: https://codecov.io/gh/exasol/spark-exasol-connector/branch/master/graph/badge.svg
[codecov-link]: https://codecov.io/gh/exasol/spark-exasol-connector
[maven-badge]: https://img.shields.io/maven-central/v/com.exasol/spark-connector_2.11.svg
[maven-img-badge]: https://img.shields.io/maven-central/v/com.exasol/spark-connector_2.11.svg
[maven-reg-badge]: https://maven-badges.herokuapp.com/maven-central/com.exasol/spark-connector_2.11/badge.svg
[maven-link]: https://maven-badges.herokuapp.com/maven-central/com.exasol/spark-connector_2.11
[exasol]: https://www.exasol.com/en/
[spark]: https://spark.apache.org/
[spark-ds-api]: https://databricks.com/blog/2015/01/09/spark-sql-data-sources-api-unified-data-access-for-the-spark-platform.html
[docker]: https://www.docker.com/
[exa-docker-db]: https://hub.docker.com/r/exasol/docker-db/
[testcontainers]: https://www.testcontainers.org/
Expand Down
Loading

0 comments on commit 3a4c43e

Please sign in to comment.