Updated developer guide

Fixes #162
exasol · Jul 7, 2023 · c02f974 · c02f974
1 parent 51a40ed
commit c02f974
Show file tree

Hide file tree

Showing 2 changed files with 21 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -58,6 +58,9 @@ tmp
 .project
 .scala_dependencies
 *.sc
+**/.settings/org.eclipse.core.resources.prefs
+**/.settings/org.eclipse.jdt.apt.core.prefs
+**/.settings/org.eclipse.m2e.core.prefs
 
 # Ensime
 .ensime

diff --git a/doc/development/developer_guide.md b/doc/development/developer_guide.md
@@ -14,6 +14,24 @@ userProvidedS3Bucket/
 
 The generated intermediate write path `<UUID>-<SparkApplicationId>/<SparkQueryId>/` is validated that it is empty before write. And it is cleaned up after the write query finishes.
 
+## S3 Staging Commit Process
+
+The Spark job that writes data to Exasol uses AWS S3 bucket as a intermediate storage. In this process, the `ExasolS3Table` API implementation uses Spark [`CSVTable`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVTable.scala) writer to create files in S3.
+
+The write process continues as following:
+
+1. We ask Spark's `CSVTable` to commit data into S3 bucket
+1. We commit to import this data into Exasol database using Exasol's `CSV` loader
+1. And finally we ask our `ExasolS3Table` API implementation to commit the write process
+
+If any failure occurs, each step will trigger the `abort` method and S3 bucket locations will be cleaned up. If job finishes successfully, the Spark job end listener will trigger the cleanup process.
+
+## S3 Maximum Number of Files
+
+For the write Spark jobs, we allow maximum of `1000` CSV files to be written as intermediate data into S3 bucket. The main reason for this is that S3 SDK `listObjects` command returns up to 1000 objects from a bucket path per each request.
+
+Even though we use could improve it to list more objects from S3 bucket with multiple requests, we wanted to keep this threshold for now.
+
 ## Integration Tests
 
 The integration tests are run using [Docker](https://www.docker.com) and [exasol-testcontainers](https://github.com/exasol/exasol-testcontainers/)