Skip to content

Commit

Permalink
Updated developer guide
Browse files Browse the repository at this point in the history
Fixes #162
  • Loading branch information
morazow committed Jul 7, 2023
1 parent 51a40ed commit c02f974
Show file tree
Hide file tree
Showing 2 changed files with 21 additions and 0 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,9 @@ tmp
.project
.scala_dependencies
*.sc
**/.settings/org.eclipse.core.resources.prefs
**/.settings/org.eclipse.jdt.apt.core.prefs
**/.settings/org.eclipse.m2e.core.prefs

# Ensime
.ensime
Expand Down
18 changes: 18 additions & 0 deletions doc/development/developer_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,24 @@ userProvidedS3Bucket/

The generated intermediate write path `<UUID>-<SparkApplicationId>/<SparkQueryId>/` is validated that it is empty before write. And it is cleaned up after the write query finishes.

## S3 Staging Commit Process

The Spark job that writes data to Exasol uses AWS S3 bucket as a intermediate storage. In this process, the `ExasolS3Table` API implementation uses Spark [`CSVTable`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVTable.scala) writer to create files in S3.

The write process continues as following:

1. We ask Spark's `CSVTable` to commit data into S3 bucket
1. We commit to import this data into Exasol database using Exasol's `CSV` loader
1. And finally we ask our `ExasolS3Table` API implementation to commit the write process

If any failure occurs, each step will trigger the `abort` method and S3 bucket locations will be cleaned up. If job finishes successfully, the Spark job end listener will trigger the cleanup process.

## S3 Maximum Number of Files

For the write Spark jobs, we allow maximum of `1000` CSV files to be written as intermediate data into S3 bucket. The main reason for this is that S3 SDK `listObjects` command returns up to 1000 objects from a bucket path per each request.

Even though we use could improve it to list more objects from S3 bucket with multiple requests, we wanted to keep this threshold for now.

## Integration Tests

The integration tests are run using [Docker](https://www.docker.com) and [exasol-testcontainers](https://github.com/exasol/exasol-testcontainers/)

0 comments on commit c02f974

Please sign in to comment.