Skip to content

Commit

Permalink
#1465 Updated readme and created combined job helper scripts
Browse files Browse the repository at this point in the history
  • Loading branch information
Adrian-Olosutean committed Jul 30, 2020
1 parent 946ccec commit 0e78f95
Show file tree
Hide file tree
Showing 3 changed files with 80 additions and 2 deletions.
45 changes: 43 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,9 @@ This is a Spark job which reads an input dataset in any of the supported formats
### Conformance
This is a Spark job which **applies the Menas-specified conformance rules to the standardized dataset**.

### Standardization and Conformance
This is a Spark job which executes Standardization and Conformance together

## How to build
#### Build requirements:
- **Maven 3.5.4+**
Expand Down Expand Up @@ -137,9 +140,31 @@ password=changeme
--report-date <date> \
--report-version <data_run_version>
```

#### Running Standardization and Conformance together
```
<spark home>/spark-submit \
--num-executors <num> \
--executor-memory <num>G \
--master yarn \
--deploy-mode <client/cluster> \
--driver-cores <num> \
--driver-memory <num>G \
--conf "spark.driver.extraJavaOptions=-Dmenas.rest.uri=<menas_api_uri:port> -Dstandardized.hdfs.path=<path_for_standardized_output>-{0}-{1}-{2}-{3} -Dspline.mongodb.url=<mongo_url_for_spline> -Dspline.mongodb.name=<spline_database_name> -Dhdp.version=<hadoop_version>" \
--class za.co.absa.enceladus.standardization_conformance.StandardizationAndConformanceJob \
<spark-jobs_<build_version>.jar> \
--menas-auth-keytab <path_to_keytab_file> \
--dataset-name <dataset_name> \
--dataset-version <dataset_version> \
--report-date <date> \
--report-version <data_run_version> \
--raw-format <data_format> \
--row-tag <tag>
```

* In case Menas is configured for in-memory authentication (e.g. in dev environments), replace `--menas-auth-keytab` with `--menas-credentials-file`

#### Helper scripts for running Standardization and Conformance
#### Helper scripts for running Standardization, Conformance or both together

The Scripts in `scripts` folder can be used to simplify command lines for running Standardization and Conformance jobs.

Expand Down Expand Up @@ -179,6 +204,20 @@ The basic command to run Conformance becomes:
--report-version <data_run_version>
```

The basic command to run Standardization and Conformance combined becomes:
```
<path to scripts>/run_standardization_conformance.sh \
--num-executors <num> \
--deploy-mode <client/cluster> \
--menas-auth-keytab <path_to_keytab_file> \
--dataset-name <dataset_name> \
--dataset-version <dataset_version> \
--report-date <date> \
--report-version <data_run_version> \
--raw-format <data_format> \
--row-tag <tag>
```

The list of options for configuring Spark deployment mode in Yarn and resource specification:

| Option | Description |
Expand All @@ -197,7 +236,7 @@ The list of options for configuring Spark deployment mode in Yarn and resource s
For more information on these options see the official documentation on running Spark on Yarn:
[https://spark.apache.org/docs/latest/running-on-yarn.html](https://spark.apache.org/docs/latest/running-on-yarn.html)

The list of all options for running both Standardization and Conformance:
The list of all options for running Standardization, Conformance and the combined Standardization and Conformance:

| Option | Description |
|---------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
Expand Down Expand Up @@ -239,6 +278,8 @@ The list of additional options available for running Conformance:
| --catalyst-workaround **true/false** | Turns on (`true`) or off (`false`) workaround for Catalyst optimizer issue. It is `true` by default. Turn this off only is you encounter timing freeze issues when running Conformance. |
| --autoclean-std-folder **true/false** | If `true`, the standardized folder will be cleaned automatically after successful execution of a Conformance job. |

All the additional options for Standardization and Conformance can be specified when running the combined StandardizationAndConformance job

## Plugins

Standardization and Conformance support plugins that allow executing additional actions at certain times of the computation.
Expand Down
2 changes: 2 additions & 0 deletions scripts/bash/enceladus_env.template.sh
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,8 @@ CONF_DEFAULT_DRA_MIN_EXECUTORS=0
CONF_DEFAULT_DRA_ALLOCATION_RATIO=0.5
CONF_DEFAULT_ADAPTIVE_TARGET_POSTSHUFFLE_INPUT_SIZE=134217728

STD_CONF_CLASS="za.co.absa.enceladus.standardization_conformance.StandardizationAndConformanceJob"

DEFAULT_DEPLOY_MODE="client"

LOG_DIR="/tmp"
Expand Down
35 changes: 35 additions & 0 deletions scripts/bash/run_standardization_conformance.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
/bin/bash

# Copyright 2018 ABSA Group Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

SRC_DIR=`dirname "$0"`

source ${SRC_DIR}/enceladus_env.sh

export CLASS=${STD_CONF_CLASS}

export DEFAULT_DRIVER_MEMORY="$STD_DEFAULT_DRIVER_MEMORY"
export DEFAULT_DRIVER_CORES="$STD_DEFAULT_DRIVER_CORES"
export DEFAULT_EXECUTOR_MEMORY="$STD_DEFAULT_EXECUTOR_MEMORY"
export DEFAULT_EXECUTOR_CORES="$STD_DEFAULT_EXECUTOR_CORES"
export DEFAULT_NUM_EXECUTORS="$STD_DEFAULT_NUM_EXECUTORS"

export DEFAULT_DRA_ENABLED="$STD_DEFAULT_DRA_ENABLED"

export DEFAULT_DRA_MIN_EXECUTORS="$STD_DEFAULT_DRA_MIN_EXECUTORS"
export DEFAULT_DRA_MAX_EXECUTORS="$STD_DEFAULT_DRA_MAX_EXECUTORS"
export DEFAULT_DRA_ALLOCATION_RATIO="$STD_DEFAULT_DRA_ALLOCATION_RATIO"
export DEFAULT_ADAPTIVE_TARGET_POSTSHUFFLE_INPUT_SIZE="$STD_DEFAULT_ADAPTIVE_TARGET_POSTSHUFFLE_INPUT_SIZE"

source ${SRC_DIR}/run_enceladus.sh

0 comments on commit 0e78f95

Please sign in to comment.