From c02f974524adf64b72086e37c92c051e26afdec0 Mon Sep 17 00:00:00 2001 From: morazow Date: Fri, 7 Jul 2023 10:35:19 +0200 Subject: [PATCH 1/2] Updated developer guide Fixes #162 --- .gitignore | 3 +++ doc/development/developer_guide.md | 18 ++++++++++++++++++ 2 files changed, 21 insertions(+) diff --git a/.gitignore b/.gitignore index 46b8a441..051f2f4e 100644 --- a/.gitignore +++ b/.gitignore @@ -58,6 +58,9 @@ tmp .project .scala_dependencies *.sc +**/.settings/org.eclipse.core.resources.prefs +**/.settings/org.eclipse.jdt.apt.core.prefs +**/.settings/org.eclipse.m2e.core.prefs # Ensime .ensime diff --git a/doc/development/developer_guide.md b/doc/development/developer_guide.md index 41fa3411..b9ba50f9 100644 --- a/doc/development/developer_guide.md +++ b/doc/development/developer_guide.md @@ -14,6 +14,24 @@ userProvidedS3Bucket/ The generated intermediate write path `-//` is validated that it is empty before write. And it is cleaned up after the write query finishes. +## S3 Staging Commit Process + +The Spark job that writes data to Exasol uses AWS S3 bucket as a intermediate storage. In this process, the `ExasolS3Table` API implementation uses Spark [`CSVTable`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVTable.scala) writer to create files in S3. + +The write process continues as following: + +1. We ask Spark's `CSVTable` to commit data into S3 bucket +1. We commit to import this data into Exasol database using Exasol's `CSV` loader +1. And finally we ask our `ExasolS3Table` API implementation to commit the write process + +If any failure occurs, each step will trigger the `abort` method and S3 bucket locations will be cleaned up. If job finishes successfully, the Spark job end listener will trigger the cleanup process. + +## S3 Maximum Number of Files + +For the write Spark jobs, we allow maximum of `1000` CSV files to be written as intermediate data into S3 bucket. The main reason for this is that S3 SDK `listObjects` command returns up to 1000 objects from a bucket path per each request. + +Even though we use could improve it to list more objects from S3 bucket with multiple requests, we wanted to keep this threshold for now. + ## Integration Tests The integration tests are run using [Docker](https://www.docker.com) and [exasol-testcontainers](https://github.com/exasol/exasol-testcontainers/) From e52d2410c3b1a835e47cfdbd017ca8acac1b2b6d Mon Sep 17 00:00:00 2001 From: Muhammet Orazov <916295+morazow@users.noreply.github.com> Date: Fri, 7 Jul 2023 11:34:18 +0200 Subject: [PATCH 2/2] Apply suggestions from code review Co-authored-by: Christoph Pirkl --- doc/development/developer_guide.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/development/developer_guide.md b/doc/development/developer_guide.md index b9ba50f9..b741c521 100644 --- a/doc/development/developer_guide.md +++ b/doc/development/developer_guide.md @@ -16,7 +16,7 @@ The generated intermediate write path `-/