Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark DataFrame insert into Exasol Table #37

Merged
merged 13 commits into from
Jan 21, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,19 +45,19 @@ matrix:
include:
- jdk: openjdk8
scala: 2.11.12
env: SPARK_VERSION="2.1.0"
env: SPARK_VERSION="2.3.2"

- jdk: oraclejdk8
scala: 2.11.12
env: SPARK_VERSION="2.1.0"
env: SPARK_VERSION="2.3.2"

- jdk: openjdk8
scala: 2.11.12
env: SPARK_VERSION="2.3.1"
env: SPARK_VERSION="2.4.0"

- jdk: oraclejdk8
scala: 2.11.12
env: SPARK_VERSION="2.3.1" RELEASE=false
env: SPARK_VERSION="2.4.0" RELEASE=false

script:
- travis_wait 30 ./scripts/ci.sh
Expand Down
163 changes: 120 additions & 43 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,47 +2,71 @@

[![Build Status][travis-badge]][travis-link]
[![Codecov][codecov-badge]][codecov-link]
[![Maven Central][maven-badge]][maven-link]
[![Maven Central][maven-img-badge]][maven-link]

###### Please note that this is an open source project which is *not officially supported* by Exasol. We will try to help you as much as possible, but can't guarantee anything since this is not an official Exasol product.

## Overview

This is a connector library that supports an integration between
[Exasol][exasol] and [Apache Spark][spark]. Using this connector, users can
read/write data from/to Exasol using Spark.
create Spark dataframes from Exasol queries and save Spark dataframes as Exasol
tables.

The implementation is based on Spark [DataSources API][spark-ds-api] and Exasol
[Sub Connections][sol-546].

* [Quick Start](#quick-start)
* [Usage](#usage)
* [Building and Testing](#building-and-testing)
* [Configuration](#configuration)
* [Building and Testing](#building-and-testing)

## Quick Start

Here is short quick start on how to use the connector.
Here is short code snippets on how to use the connector in your Spark / Scala
applications.

Reading data from Exasol,
Reading data from Exasol as Spark dataframe:

```scala
// This is Exasol SQL Syntax
val exasolQueryString = "SELECT * FROM MY_SCHEMA.MY_TABLE"
// An Exasol sql syntax query string
val exasolQueryString =
"""
SELECT SALES_DATE, MARKET_ID, PRICE
FROM RETAIL.SALES
WHERE MARKET_ID IN (661, 534, 667)
"""

val df = sparkSession
.read
.format("exasol")
.option("host", "localhost")
.option("port", "8888")
.option("host", "10.0.0.11")
.option("port", "8563")
.option("username", "sys")
.option("password", "exasol")
.option("password", "exaPass")
.option("query", exasolQueryString)
.load()
```

Saving a Spark dataframe as an Exasol table:

df.show(10, false)
```scala
val df = sparkSession
.write
.mode("append")
.option("host", "10.0.0.11")
.option("port", "8563")
.option("username", "sys")
.option("password", "exaPass")
.option("table", "RETAIL.ADJUSTED_SALES")
.format("exasol")
.save()
```

Or using spark configurations: (this will have higher priority)
Additionally, you can set the parameter on `SparkConf`:

```scala
// config spark session
// Configure spark session
val sparkConf = new SparkConf()
.setMaster("local[*]")
.set("spark.exasol.host", "localhost")
Expand All @@ -51,45 +75,110 @@ val sparkConf = new SparkConf()
.set("spark.exasol.password", "exasol")
.set("spark.exasol.max_nodes", "200")

val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
val sparkSession = SparkSession
.builder()
.config(sparkConf)
.getOrCreate()

// This is Exasol SQL Syntax
val exasolQueryString = "SELECT * FROM MY_SCHEMA.MY_TABLE"
val queryStr = "SELECT * FROM MY_SCHEMA.MY_TABLE"

val df = sparkSession
.read
.format("exasol")
.option("query", exasolQueryString)
.option("query", queryStr)
.load()

df.show(10, false)
```

For more examples you can check [docs/examples](docs/examples.md).
Please note that parameter values set on Spark configuration will have higher
priority.

For an example walkthrough please check
[docs/example-walkthrough](docs/example-walkthrough.md).

## Usage

You can include the connector as a dependency in your projects. Please find the
latest versions at [Maven Central
Repositories](https://mvnrepository.com/artifact/com.exasol) for
`spark-connector`
The latest release version ([![Maven Central][maven-reg-badge]][maven-link]) is
compiled against Scala 2.11 and Spark 2.1+.

Using SBT:
In order to use the connector in your Java or Scala applications, you can
include it as a dependency to your projects by adding artifact information into
`build.sbt` or `pom.xml` files.

### build.sbt

```scala
libraryDependencies += "com.exasol" %% "spark-connector" % "<latest-version>"
resolvers ++= Seq("Exasol Releases" at "https://maven.exasol.com/artifactory/exasol-releases")

libraryDependencies += "com.exasol" % "spark-connector_2.11" % "$LATEST_VERSION"
```

Using Maven:
### pom.xml

```xml
<repository>
<id>maven.exasol.com</id>
<url>https://maven.exasol.com/artifactory/exasol-releases</url>
</repository>

<dependency>
<groupId>com.exasol</groupId>
<artifactId>spark-connector_2.11</artifactId>
<version>latest-version</version>
<version>$LATEST_VERSION</version>
</dependency>
```

### Alternative Option

As an alternative, you can provide `--repositories` and `--packages` artifact
coordinates to the **spark-submit**, **spark-shell** or **pyspark** commands.

For example:

```sh
spark-shell \
--repositories https://maven.exasol.com/artifactory/exasol-releases \
--packages com.exasol:spark-connector_2.11:$LATEST_VERSION
```

### Deployment

Similarly, you can submit packaged application into the Spark cluster.

Using spark-submit:

```sh
spark-submit \
--master spark://spark-master-url:7077
--repositories https://maven.exasol.com/artifactory/exasol-releases \
--packages com.exasol:spark-connector_2.11:$LATEST_VERSION \
--class com.myorg.SparkExasolConnectorApp \
--conf spark.exasol.password=exaTru3P@ss \
path/to/project/folder/target/scala-2.11/sparkexasolconnectorapp_2.11-5.3.1.jar
```

This deployment example also shows that you can configure the Exasol parameters
at startup using `--conf spark.exasol.keyName=value` syntax.

<strong>Please update the `$LATEST_VERION` accordingly with the latest artifact
version number.</strong>

## Configuration

The following configuration parameters can be provided mainly to facilitate a
connection to Exasol cluster.

| Spark Configuration | Configuration | Default | Description
| :--- | :--- | :--- | :---
| | ``query`` | *<none>* | A query string to send to Exasol
| | ``table`` | *<none>* | A table name (with schema, e.g. my_schema.my_table) to save dataframe
| ``spark.exasol.host`` | ``host`` | ``localhost`` | A host ip address to the **first** Exasol node (e.g. 10.0.0.11)
| ``spark.exasol.port`` | ``port`` | ``8888`` | A port number to connect to Exasol nodes (e.g. 8563)
| ``spark.exasol.username`` | ``username`` | ``sys`` | An Exasol username for logging in
| ``spark.exasol.password`` | ``password`` | ``exasol`` | An Exasol password for logging in
| ``spark.exasol.max_nodes`` | ``max_nodes`` | ``200`` | The number of data nodes in Exasol cluster
| ``spark.exasol.batch_size`` | ``batch_size`` | ``1000`` | The number of records batched before running execute statement when saving dataframe
| ``spark.exasol.create_table`` | ``create_table`` | ``false`` | A permission to create table if it does not exist in Exasol when saving dataframe

## Building and Testing

Clone the repository,
Expand Down Expand Up @@ -141,20 +230,6 @@ This creates a jar file under `target/` folder. The jar file can be used with
spark-shell --jars /path/to/spark-exasol-connector-assembly-*.jar
```

## Configuration

The following configuration parameters can be provided mainly to facilitate a
connection to Exasol cluster.

| Spark Configuration | Configuration | Default | Description
| :--- | :--- | :--- | :---
| | ``query`` | *<none>* | A query string to send to Exasol
| ``spark.exasol.host`` | ``host`` | ``localhost`` | A host ip address to the **first** Exasol node (e.g. 10.0.0.11)
| ``spark.exasol.port`` | ``port`` | ``8888`` | A port number to connect to Exasol nodes (e.g. 8563)
| ``spark.exasol.username`` | ``username`` | ``sys`` | An Exasol username for logging in
| ``spark.exasol.password`` | ``password`` | ``exasol`` | An Exasol password for logging in
| ``spark.exasol.max_nodes`` | ``max_nodes`` | ``200`` | The number of data nodes in Exasol cluster

## FAQ

- Getting an `Connection was lost and could not be reestablished` error
Expand Down Expand Up @@ -189,10 +264,12 @@ connection to Exasol cluster.
[travis-link]: https://travis-ci.org/exasol/spark-exasol-connector
[codecov-badge]: https://codecov.io/gh/exasol/spark-exasol-connector/branch/master/graph/badge.svg
[codecov-link]: https://codecov.io/gh/exasol/spark-exasol-connector
[maven-badge]: https://img.shields.io/maven-central/v/com.exasol/spark-connector_2.11.svg
[maven-img-badge]: https://img.shields.io/maven-central/v/com.exasol/spark-connector_2.11.svg
[maven-reg-badge]: https://maven-badges.herokuapp.com/maven-central/com.exasol/spark-connector_2.11/badge.svg
[maven-link]: https://maven-badges.herokuapp.com/maven-central/com.exasol/spark-connector_2.11
[exasol]: https://www.exasol.com/en/
[spark]: https://spark.apache.org/
[spark-ds-api]: https://databricks.com/blog/2015/01/09/spark-sql-data-sources-api-unified-data-access-for-the-spark-platform.html
[docker]: https://www.docker.com/
[exa-docker-db]: https://hub.docker.com/r/exasol/docker-db/
[testcontainers]: https://www.testcontainers.org/
Expand Down
Loading