Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Resize Spark #630

Merged
merged 4 commits into from
Sep 27, 2024
Merged

Adding Resize Spark #630

merged 4 commits into from
Sep 27, 2024

Conversation

blublinsky
Copy link
Collaborator

Why are these changes needed?

Adding additional transforms to Spark pipeline

Related issue number (if any).

#586

@blublinsky blublinsky requested a review from daw3rd September 26, 2024 09:15
@@ -0,0 +1,44 @@
ARG BASE_IMAGE=quay.io/dataprep1/data-prep-kit/data-prep-kit-spark-3.5.2:0.2.1.dev0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.2.1.dev0 is no longer used in dev. to be consistent with other spark transforms (recent change), use latest as the tag. Note however, that this is generally overridden from the Makefile anyway by setting BASE_IMAGE when docker build is called. But for consistency it would be nice to change to latest.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


# set the version of python transform that this depends on.
set-versions:
$(MAKE) TRANSFORM_PYTHON_VERSION=${NOOP_PYTHON_VERSION} TOML_VERSION=$(NOOP_SPARK_VERSION) .transforms.set-versions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOOP -> RESIZE?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

The set of dictionary keys holding [BlockListTransform](src/blocklist_transform.py)
configuration for values are as follows:

* _max_rows_per_table_ - specifies max documents per table
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To better future-proof this file, shouldn't it defer to the python readme for configuration and CLI?

@daw3rd daw3rd merged commit 49ebd51 into dev Sep 27, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants