-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vdk-core: New feature: StandaloneDataJob #793
vdk-core: New feature: StandaloneDataJob #793
Conversation
Thanks. I think it looks in right direction. But first what would integration with dagster look like using this? I am asking because I also made some small comments on the code but we probably should clear the interfaces and expected usage first. |
projects/vdk-core/src/vdk/internal/builtin_plugins/run/noop_datajob.py
Outdated
Show resolved
Hide resolved
projects/vdk-core/src/vdk/internal/builtin_plugins/run/noop_datajob.py
Outdated
Show resolved
Hide resolved
projects/vdk-core/src/vdk/internal/builtin_plugins/run/noop_datajob.py
Outdated
Show resolved
Hide resolved
projects/vdk-core/src/vdk/internal/builtin_plugins/run/noop_datajob.py
Outdated
Show resolved
Hide resolved
projects/vdk-core/src/vdk/internal/builtin_plugins/run/noop_datajob.py
Outdated
Show resolved
Hide resolved
Thanks for the quick feedback!
The way we've approached it is to integrate into Dagster's Resource system; which basically stores an instance of the VDK It looks roughly like this:
The An example
We have a similar set of datawarehouse_resources backed by SQLite that we use during local development & testing. For full details see the internal implementation & discussion at https://gitlab.eng.vmware.com/tanzu-portfolio-insights/tanzu-dm/-/merge_requests/24, specifically:
Agreed. I'll loop back and address the points mentioned above once we've got the high level structure sorted out |
Sorry for taking my time. I was a bit busy and I wanted to research a bit more Dagster. It will also make it easier to integrate with other frameworks (like Jupyter notebook, e.g to be able to execute data jobs though a notebook). Something I am planning to integrate soon so I'd like to re-use this. I would love if we can contribute to dagster to have dagster-vdk integration library. I don't know if it's something you can consider. |
Oh btw, there's now vdk-sqlite plugin - if you install it and set in db_default_type = sqlite - you'd get sqlite based data job. |
That seems like a good long term goal - although I'm not sure when I'd have time to contribute towards that. Since having a way to instantiate a DataJob from code (the subject of this PR) is a necessary building block to enable a |
Most certainly yes. I didn't mean to imply otherwise. It's definitely a separate effort. |
To avoid ambiguity. To close the PR I think we need the following things
We have not yet set up our CICD to trigger from a fork. So as long as the tests pass locally (from vdk-core folder ./cicd/build.sh succeeds) I think that's enough and I'll merge the change. |
projects/vdk-core/src/vdk/internal/builtin_plugins/run/datajob_initializer.py
Outdated
Show resolved
Hide resolved
Still working on this - just got sidetracked a bit. Will hopefully push up some new code early next week |
TODO
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks good to me. My suggestion is to postpone the refactoring (de-duplication) for separate PR. We just need some automated tests and we can merge it.
Sounds good. I've been fighting a bit trying to get the build working on an Mac M1 (which doesn't officially support Python 3.7). Should hopefully have something to commit later in the week |
You don't need to use Python 3.7 . This VDK works with all recent python versions. |
Is there a way you can disable/skip the use of python 3.7 in the commit process?
|
Interesting, that seems some configuration issue with our pre-commit hooks. I opened PR to fix it: #826 |
projects/vdk-core/tests/functional/run/test_run_standalone_data_job.py
Outdated
Show resolved
Hide resolved
projects/vdk-core/tests/functional/run/test_run_standalone_data_job.py
Outdated
Show resolved
Hide resolved
projects/vdk-core/tests/functional/run/test_run_standalone_data_job.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me. Have you run ./cicd/build.sh locally from vdk-core folder ?
We have not configured yet to run CI from forks, unfortunately. So I'd need to merge your change manually. I will run the test myself but it would be good to get confirmation first.
@tozka Let me address the remaining Code Analysis warnings and squash the commits to make the merge nice and clean. I'll ping you here when that is done |
- Instantiate and execute plugin lifecycle from code rather than via the VDK CLI - Gives access to an instantiated job_input object - Can be run without needing any data job files - Implemented as a contextmanager to reduce API surface area - Triggers all plugin hooks except: * CoreHookSpecs.vdk_command_line Sample usage: with StandaloneDataJobFactory.create(datajob_directory) as job_input: #... use job_input object to interact with SuperCollider
9b34c08
to
cd9e4cd
Compare
@tozka I've addressed the Code Analysis warnings and squashed the commits into a single commit to ease merging. I've also run
Are you happy to progress to merging this PR? |
Yes.. I ran all the tests (with your PR) locally a few times and passed. |
I've merged your change. Congrats and thanks for your first contribution (I hope one of many ;) ) The CI has kicked in - https://gitlab.com/vmware-analytics/versatile-data-kit/-/pipelines/536648262 hopefully it will pass and your change would be automatically released. |
@tozka Am I correct in deducing that this feature landed in https://pypi.org/project/vdk-core/0.2.536648262/ ? |
Yes |
The PR contains a WIP implementation of a vdk-core extension that addresses #791
StandaloneDataJob
- a context manager that:job_input
object before it is finalizedExample usage: