Feature/1556 file access PoC using Hadoop FS API #1586

dk1844 · 2020-11-02T09:05:07Z

File access PoC using Hadoop FS API

SDK S3 approach has been removed completely. In theory, there could be 3 different FileSystems (hdfs/s3 - based on the string path supplied for raw, std, and publish), these FSs are held in the FileSystems case class. Therefore, implicit filesystem passing is now less frequent, because we should be passing the correct FS to work with in different cases (raw vs std vs publish).
The Filesystems are cached and reused (FileSystemUtils.FileSystemExt.toFsUtils).

Test-ran on EMR (EMRFS-backed with having all input/output/intermediates on S3). No changes in output from the previous atum-3.0.0-based version that leveraged sdk-s3 approach.

~~Blocked by: AbsaOSS/atum#44~~

Closes #1556

… is derived form the path internally)

AdrianOlosutean

Code reviwed. Looks good to me

AdrianOlosutean · 2020-11-03T13:13:34Z

spark-jobs/src/main/scala/za/co/absa/enceladus/common/CommonJobExecution.scala

    dao.authenticate()
+
+    implicit val hadoopConf = spark.sparkContext.hadoopConfiguration


I'm thinking if this would make more sense to be a field

field of what? Do you mean a non-implicit parameter of the method prepareJob[T]()?

benedeki

Nice!

utils/src/test/scala/za/co/absa/enceladus/utils/fs/HdfsUtilsSpec.scala

benedeki · 2020-11-04T22:50:58Z

utils/src/main/scala/za/co/absa/enceladus/utils/fs/HadoopFsUtils.scala

@@ -31,10 +28,10 @@ import scala.util.Try
 * A set of functions to help with the date partitioning and version control
 */

-class HdfsUtils(conf: Configuration) extends DistributedFsUtils {
+class HadoopFsUtils()(implicit fs: FileSystem) extends DistributedFsUtils {


Why the class name change? Generates so much changes all around and the old ones doesn't seem bad/wrong?

I was trying to insinuate to the user/reader that this is not bound to the "standard" HDFS - the usually used distributed FS used with Spark, but backed by Hadoop FS API - which can turn out to be HDFS, S3, on anything else.

I know that the stack layering can be confusing, that is roughly what I was trying to demonstrate.

What is your option on this when you know my reasoning? Would you suggest keeping it as is, even though it may be a bit misleading?

utils/src/main/scala/za/co/absa/enceladus/utils/fs/FileSystemUtils.scala

benedeki · 2020-11-04T23:01:52Z

utils/src/main/scala/za/co/absa/enceladus/utils/fs/FileSystemUtils.scala

+
+  /**
+   * Will yeild a [[FileSystem]] for path. If path prefix suggest S3, S3 FS is returned, HDFS otherwise.
+   * @param path


Missing descriptions for parameters - I wonder if it's better to have them empty or not at all.

Thanks, desc added.

benedeki · 2020-11-04T23:18:48Z

...jobs/src/test/scala/za/co/absa/enceladus/conformance/interpreter/ArrayConformanceSuite.scala


-class ArrayConformanceSuite extends AnyFunSuite with SparkTestBase with BeforeAndAfterAll {
+class ArrayConformanceSuite extends AnyFunSuite with SparkTestBase with BeforeAndAfterAll with HadoopFsTestBase {


What about having EnceladusTestBase as a combination of SparkTestBase and HadoopFsTestBase as they are combined so often?

I have checked the numbers, with SparkTestBase is used about 64 times, and in conjunction of with HadoopFsTestBase it is used 16 times. This means, that I could only cover a minority of SparkTestBase encounters. I could change those 16 instances of SparkTestBase with HadoopFsTestBase for EnceladusTestBase.

I actually find this way much more explanatory. With this, you could directly see what is needed (spark, Hadoop FS), for other Enceladus-related suites, there may be other necessary traits, ... Or the other point - I don't think that all Enceladus test suites always need Spark + Hadoop FS.

So I am siding to keeping it as-is. But if you see where I am wrong, go ahead and point it out to me, please.

spark-jobs/src/main/scala/za/co/absa/enceladus/common/config/FileSystems.scala

utils/src/main/scala/za/co/absa/enceladus/utils/fs/FileSystemUtils.scala

spark-jobs/src/main/scala/za/co/absa/enceladus/common/CommonJobExecution.scala

spark-jobs/src/main/scala/za/co/absa/enceladus/conformance/ConformanceExecution.scala

spark-jobs/src/main/scala/za/co/absa/enceladus/standardization/StandardizationExecution.scala

utils/src/main/scala/za/co/absa/enceladus/utils/testUtils/HadoopFsTestBase.scala

PR touch ups:

+ moved to utils to stand be used for PerformanceMetricTools.addJobInfoToAtumMetadata etc.

…ystem) -> readStandardizationInputData(input: PathWithFs)

…OrCreate(fs: FileSystem)` - hadoopFsUtils can only be created this way (from the outside PoV) which is to force cache control.

spark-jobs/src/main/scala/za/co/absa/enceladus/standardization/StandardizationExecution.scala

…non-implicit)

sonarqubecloud · 2020-11-09T09:13:47Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities (and 0 Security Hotspots to review)
1 Code Smell

No Coverage information
0.0% Duplication

Zejnilovic

Code Reviewed

#1422 and 1423 Remove HDFS and Oozie from Menas #1422 Fix HDFS location validation #1424 Add Menas Dockerfile #1416 hadoop-aws 2.8.5 + s3 aws sdk 2.13.65 compiles. #1416 - enceladus on S3: * - all directly-hdfs touching stuff disabled (atum, performance measurements, info files, output path checking) # Add menasfargate into hosts # paste # save & exit (ctrl+O, ctrl+X) #1416 - enceladus on S3 - (crude) conformance works on s3 (s3 std input, s3 conf output) * Merge spline 0.5.3 into aws-poc * Update spline to 0.5.4 for AWS PoC #1503 Remove HDFS url Validation * New dockerfile - smaller image * s3 persistence (atum, sdk fs usage, ...) (#1526) #1526 * FsUtils divided into LocalFsUtils & HdfsUtils * PathConfigSuite update * S3FsUtils with tail-recursive pagination accumulation - now generic with optional short-circuit breakOut * TestRunnerJob updated to manually cover the cases - should serve as a basis for tests * HdfsUtils replace by trait DistributedFsUtils (except for MenasCredentials loading & nonSplittable splitting) * using final version of s3-powered Atum (3.0.0) * mockito-update version update, scalatest version update * S3FsUtilsSuite: exists, read, sizeDir(hidden, non-hidden, reucursive), non-splittable (simple, recursive with breakOut), delete (recursive), version find (simple - empty, recursive) * explicit stubbing fix for hyperdrive #1556 file access PoC using Hadoop FS API (#1586) * s3 using hadoop fs api * s3 sdk usage removed (pom, classes, tests) * atum final version 3.1.0 used * readStandardizationInputData(... path: String)(implicit ... fs: FileSystem) -> readStandardizationInputData(input: PathWithFs) #1554 Tomcat with TLS container in Docker container #1554 Added envoy config + enabling running unencrypted container #1499 Add authentication to /lineage + update spline to 0.5.5 #1618 - fixes failing spline 0.5.5 integration by providing compatible commons library version. Test-ran on EMR. (#1619) #1434 Add new way of serving properties to Docker #1622: Merge of aws-poc to develop brach * put back HDFS browser * put back Oozie * downgraded Spline * Scopt 4.0.0 * AWS SDK Exclusion * ATUM version 3.2.2 Co-authored-by: Saša Zejnilović <[email protected]> Co-authored-by: Daniel Kavan <[email protected]> Co-authored-by: Adrian Olosutean <[email protected]> Co-authored-by: Adrian Olosutean <[email protected]> Co-authored-by: Jan Scherbaum <[email protected]>

* 1422 and 1423 Remove HDFS and Oozie from Menas * #1422 Fix HDFS location validation * #1424 Add Menas Dockerfile * #1416 hadoop-aws 2.8.5 + s3 aws sdk 2.13.65 compiles. * #1416 - enceladus on S3: - all directly-hdfs touching stuff disabled (atum, performance measurements, info files, output path checking) # Add menasfargate into hosts sudo nano /etc/hosts # paste 20.0.63.69 menasfargate # save & exit (ctrl+O, ctrl+X) # Running standardization works via: spark-submit --class za.co.absa.enceladus.standardization.StandardizationJob --conf "spark.driver.extraJavaOptions=-Dmenas.rest.uri=http://menasfargate:8080 -Dstandardized.hdfs.path=s3://euw1-ctodatadev-dev-bigdatarnd-s3-poc/enceladusPoc/ao-hdfs-data/stdOutput/standardized-{0}-{1}-{2}-{3}" ~/enceladusPoc/spark-jobs-2.11.0-SNAPSHOT.jar --menas-credentials-file ~/enceladusPoc/menas-credentials.properties --dataset-name dk_test1_emr285 --raw-format json --dataset-version 1 --report-date 2019-11-27 --report-version 1 2> ~/enceladusPoc/stderr.txt * #1416 - enceladus on S3 - (crude) conformance works on s3 (s3 std input, s3 conf output) 0- all directly-hdfs touching stuff disabled (atum, performance measurements, info files, output path checking) # Add menasfargate into hosts sudo nano /etc/hosts # paste 20.0.63.69 menasfargate # save & exit (ctrl+O, ctrl+X) # Running conformance works via: spark-submit --class za.co.absa.enceladus.conformance.DynamicConformanceJob --conf "spark.driver.extraJavaOptions=-Dmenas.rest.uri=http://menasfargate:8080 -Dstandardized.hdfs.path=s3://euw1-ctodatadev-dev-bigdatarnd-s3-poc/enceladusPoc/ao-hdfs-data/stdOutput/standardized-{0}-{1}-{2}-{3}" ~/enceladusPoc/spark-jobs-2.11.0-SNAPSHOT.jar --menas-credentials-file ~/enceladusPoc/menas-credentials.properties --dataset-name dk_test1_emr285 --dataset-version 1 --report-date 2019-11-27 --report-version 1 2> ~/enceladusPoc/conf-log.txt * ref issue = 1416 * related test cases ignored (issue reference added) * PR updates * Merge spline 0.5.3 into aws-poc * Update spline to 0.5.4 for AWS PoC * #1503 Remove HDFS url Validation This is a temporary solution. We currently experiment with many forms of URLs, and having a regex there now slows us down. * New dockerfile - smaller image * s3 persistence (atum, sdk fs usage, ...) (#1526) #1526 * FsUtils divided into LocalFsUtils & HdfsUtils * PathConfigSuite update * S3FsUtils with tail-recursive pagination accumulation - now generic with optional short-circuit breakOut TestRunnerJob updated to manually cover the cases - should serve as a basis for tests * HdfsUtils replace by trait DistributedFsUtils (except for MenasCredentials loading & nonSplittable splitting) * using final version of s3-powered Atum (3.0.0) * mockito-update version update, scalatest version update * S3FsUtilsSuite: exists, read, sizeDir(hidden, non-hidden, reucursive), non-splittable (simple, recursive with breakOut), delete (recursive), version find (simple - empty, recursive) * explicit stubbing fix for hyperdrive * Feature/1556 file access PoC using Hadoop FS API (#1586) * s3 using hadoop fs api * s3 sdk usage removed (pom, classes, tests) * atum final version 3.1.0 used * readStandardizationInputData(... path: String)(implicit ... fs: FileSystem) -> readStandardizationInputData(input: PathWithFs) * 1554 Tomcat with TLS in Docker container (#1585) * #1554 Tomcat with TLS container * #1554 Added envoy config + enabling running unencrypted container * #1499 Add authentication to /lineage + update spline to 0.5.5 * #1618 - fixes failing spline 0.5.5 integration by providing compatible commons library version. Test-ran on EMR. (#1619) * #1612 Separation start * #1612 Updated DAO for spark-jobs * #1612 Fixed spline integration and schema, removed redundant code * #1612 Fixed tests, removed unused dependency * #1612 Added back dependency * WIP fixing merge issues * * Merge compiles * Tests pass * Depends on ATUM 3.1.1-SNAPSHOT (the bugfix for AbsaOSS/atum#48) * #1612 Removed Spring from menas-web, enabled building war and static resources. Removed version subpath in menas-web + added theme dependencies in repo * #1612 Cookies + updated lineage * * put back HDFS browser * put back Oozie * downgraded Spline * * AWS SDK Exclusion * #1612 Included HDFSFolder + missing Oozie parts * * New ATUM version * * Adding missing files * #1612 menas-web on nginx container and passing API_URL * #1612 Working TLS on nginx, resources not included in code * 1622: Merge of aws-poc to develop brach * Addressed issues identified by reviewers * * comments improvement * 1434 Add new way of serving properties to Docker * #1612 Building using ui5 + reused /api route * #1612 Project version * #713 Add favicon * #1612 Merges * #1612 pom parent version * #1648 Fix war deployment + adding back spline to menas * #1612 other fixes * #1612 added pom package.json version sync * #1612 newline * #1612 fix version sync + cleaning dist * 1648 merge to develop * 1648 merge fix * 1648 Fixes schema upload * 1648 Fixes schema registry request * 1648 pom version * 1612 add docker build * #601 Swagger 2 PoC * #601 Swagger 2 PoC * #601 Swagger 2 PoC * #1648 Updating menas-web to 3.0 * #1612 Updated npm project versions + mvn plugin * #1612 license_check.yml * #1612 licence check fix Co-authored-by: Saša Zejnilović <[email protected]> Co-authored-by: Daniel Kavan <[email protected]> Co-authored-by: Jan Scherbaum <[email protected]> Co-authored-by: David Benedeki <[email protected]>

dk1844 added 3 commits October 26, 2020 14:08

s3 using hadoop fs api - first attempt to run

a26b4c9

s3 sdk usage removed (pom, classes, tests)

5e7a503

import cleanup, tiny updates

17dcfbc

dk1844 added the work in progress Work on this item is not yet finished (mainly intended for PRs) label Nov 2, 2020

dk1844 requested review from AdrianOlosutean, benedeki, DzMakatun, HuvarVer and lokm01 as code owners November 2, 2020 09:05

dk1844 self-assigned this Nov 2, 2020

dk1844 added 2 commits November 2, 2020 15:02

atum snapshot update (fs not needed for control measurement tracking,…

37f440d

… is derived form the path internally)

cleanup

95f4179

dk1844 removed the work in progress Work on this item is not yet finished (mainly intended for PRs) label Nov 2, 2020

AdrianOlosutean previously approved these changes Nov 3, 2020

View reviewed changes

benedeki reviewed Nov 4, 2020

View reviewed changes

Zejnilovic reviewed Nov 5, 2020

View reviewed changes

atum final version usage

33c7ee2

PR touch ups:

dk1844 dismissed AdrianOlosutean’s stale review via 33c7ee2 November 5, 2020 12:54

dk1844 added 5 commits November 5, 2020 16:36

pr touch-ups

cb48db7

FileSystems are now part of PathConfig

0b21ba5

PathConfigEntry -> PathWithFs

c88bd35

+ moved to utils to stand be used for PerformanceMetricTools.addJobInfoToAtumMetadata etc.

readStandardizationInputData(... path: String)(implicit ... fs: FileS…

768dac3

…ystem) -> readStandardizationInputData(input: PathWithFs)

Infamous FileSystem.toFsUtils is now replaced by `HadoopFsUtils.get…

912678c

…OrCreate(fs: FileSystem)` - hadoopFsUtils can only be created this way (from the outside PoV) which is to force cache control.

dk1844 requested review from AdrianOlosutean, benedeki and Zejnilovic November 6, 2020 14:40

Zejnilovic reviewed Nov 6, 2020

View reviewed changes

spark-jobs/src/main/scala/za/co/absa/enceladus/standardization/StandardizationExecution.scala Outdated Show resolved Hide resolved

PR touchup: convertToSplittable's fsUtils is now a regular paramter (…

1e36265

…non-implicit)

Zejnilovic approved these changes Nov 9, 2020

View reviewed changes

AdrianOlosutean approved these changes Nov 9, 2020

View reviewed changes

dk1844 merged commit 5b5628e into aws-poc Nov 9, 2020

dk1844 deleted the feature/1556-s3-over-hadoop-fs-api branch November 9, 2020 12:44

dk1844 mentioned this pull request Nov 9, 2020

S3 file access PoC using Hadoop FS API #1556

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/1556 file access PoC using Hadoop FS API #1586

Feature/1556 file access PoC using Hadoop FS API #1586

dk1844 commented Nov 2, 2020 •

edited

Loading

AdrianOlosutean left a comment

AdrianOlosutean Nov 3, 2020

dk1844 Nov 5, 2020

benedeki left a comment

benedeki Nov 4, 2020

dk1844 Nov 5, 2020

benedeki Nov 4, 2020

dk1844 Nov 5, 2020

benedeki Nov 4, 2020

dk1844 Nov 5, 2020

sonarqubecloud bot commented Nov 9, 2020

Zejnilovic left a comment

		dao.authenticate()

		implicit val hadoopConf = spark.sparkContext.hadoopConfiguration


		class ArrayConformanceSuite extends AnyFunSuite with SparkTestBase with BeforeAndAfterAll {
		class ArrayConformanceSuite extends AnyFunSuite with SparkTestBase with BeforeAndAfterAll with HadoopFsTestBase {

Feature/1556 file access PoC using Hadoop FS API #1586

Feature/1556 file access PoC using Hadoop FS API #1586

Conversation

dk1844 commented Nov 2, 2020 • edited Loading

AdrianOlosutean left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benedeki left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sonarqubecloud bot commented Nov 9, 2020

Zejnilovic left a comment

Choose a reason for hiding this comment

dk1844 commented Nov 2, 2020 •

edited

Loading