vdk-core: New feature: StandaloneDataJob #793

mrdavidlaing · 2022-04-07T14:07:47Z

The PR contains a WIP implementation of a vdk-core extension that addresses #791

StandaloneDataJob - a context manager that:

can be instantiated from code rather than via the VDK CLI
can be run without needing any data job files
gives access to an instantiated job_input object before it is finalized

Example usage:

 with StandaloneDataJobFactory.create(datajob_directory) as job_input:
      #... use job_input object to interact with SuperCollider

antoniivanov · 2022-04-08T07:12:01Z

Thanks. I think it looks in right direction.

But first what would integration with dagster look like using this?
If you can post a snippet, pseudo-code something so we can imagine it better.

I am asking because
a) I want us to make sure this would work as an integration point
b) I would like to check if similar integration with other tools (like prefect or flyte) is possible.

I also made some small comments on the code but we probably should clear the interfaces and expected usage first.

projects/vdk-core/src/vdk/internal/builtin_plugins/run/noop_datajob.py

mrdavidlaing · 2022-04-08T11:45:39Z

Thanks. I think it looks in right direction.

Thanks for the quick feedback!

But first what would integration with dagster look like using this? If you can post a snippet, pseudo-code something so we can imagine it better.

The way we've approached it is to integrate into Dagster's Resource system; which basically stores an instance of the VDK job_input object on an object that is passed to every op() executed by Dagster

It looks roughly like this:

@resource(...snip...)
def supercollider_datawarehouse_resource(init_context):
    datajob_directory = Path(abspath(file_relative_path(__file__, '../')))

    datajob = NoOpStepDataJob(datajob_directory)
    try:
        datajob.initialize_job()
        vdk_job_input = datajob.run_and_return_job_input()

        yield SuperColliderDwhResource(
            vdk_job_input=vdk_job_input,
            ...snip...
        )
    finally:
        datajob.finalize_job()

The SuperColliderDwhResource() is a fairly thin wrapper around vdk_job_input.get_managed_connection(), vdk_job_input.load() and vdk_job_input.send_tabular_data_for_ingestion() internally.

An example op() using the resource looks like this:

@op(
    required_resource_keys={"datawarehouse", "daily_partition_config"}
)
def enrich_dim_org_data_tmc(context, df_fact_entitlements: fact_entitlement.DataFrame, df_dim_deployments: dim_deployment.DataFrame) -> dim_org.DataFrame:

    dim_org_ids = df_fact_entitlements.dim_org_id.append(df_dim_deployments.dim_org_id).unique().tolist()
    org_ids = extract_csp_org_ids_from_dim_org_ids(dim_org_ids)
    df_raw_data = context.resources.datawarehouse.read_sql_query(text("""
        WITH all_tmc_org_entries_for_day AS (
            SELECT
                CONCAT('CSP:', oci.organization_id)      AS dim_org_id 
     ...snip...

We have a similar set of datawarehouse_resources backed by SQLite that we use during local development & testing.

For full details see the internal implementation & discussion at https://gitlab.eng.vmware.com/tanzu-portfolio-insights/tanzu-dm/-/merge_requests/24, specifically:

I also made some small comments on the code but we probably should clear the interfaces and expected usage first.

Agreed. I'll loop back and address the points mentioned above once we've got the high level structure sorted out

antoniivanov · 2022-04-12T10:34:45Z

The way we've approached it is to integrate into Dagster's Resource system; which basically stores an instance of the VDK job_input object on an object that is passed to every op() executed by Dagster

Sorry for taking my time. I was a bit busy and I wanted to research a bit more Dagster.
I like the approach. It's really simple and it's very much inline with how other somewhat similar tools have integrated with dagster. I took a look at dagster-airbyte , dagster-pyspark and daster-mlflow

It will also make it easier to integrate with other frameworks (like Jupyter notebook, e.g to be able to execute data jobs though a notebook). Something I am planning to integrate soon so I'd like to re-use this.

I would love if we can contribute to dagster to have dagster-vdk integration library. I don't know if it's something you can consider.

antoniivanov · 2022-04-12T15:31:45Z

We have a similar set of datawarehouse_resources backed by SQLite that we use during local development & testing.

Oh btw, there's now vdk-sqlite plugin - if you install it and set in db_default_type = sqlite - you'd get sqlite based data job.
Note sure if it might help here but I wanted to pointed out.

mrdavidlaing · 2022-04-13T17:33:41Z

I would love if we can contribute to dagster to have dagster-vdk integration library. I don't know if it's something you can consider.

That seems like a good long term goal - although I'm not sure when I'd have time to contribute towards that.

Since having a way to instantiate a DataJob from code (the subject of this PR) is a necessary building block to enable a dagster-vdk library; could we keep this PR focussed on that narrower objective?

antoniivanov · 2022-04-13T19:55:29Z

I would love if we can contribute to dagster to have dagster-vdk integration library. I don't know if it's something you can consider.

That seems like a good long term goal - although I'm not sure when I'd have time to contribute towards that.

Since having a way to instantiate a DataJob from code (the subject of this PR) is a necessary building block to enable a dagster-vdk library; could we keep this PR focussed on that narrower objective?

Most certainly yes. I didn't mean to imply otherwise. It's definitely a separate effort.

antoniivanov · 2022-04-14T07:56:15Z

@mrdavidlaing

To avoid ambiguity. To close the PR I think we need the following things

Finalize naming discussion
Take care of comments on the code - the code duplication particularly is something for me that I'd like we fix.
Create the public interface in vdk.api.
One simple way to do that is to create vdk/api/data_job.py something like

class SomeNameWeAgreeOnDataJob:

    def __init__(self):
        from vdk.internal.builtin_plugins.run import DataJobImpl
        self.__job = DataJobImpl()

    def __enter__(self):
        self.__job.__enter__(self)

    def __exit__(self):
        self.__job.__exit()

    def get_job_input(self):
        return self.__job.get_job_input()

Tests. I suggest some kind of "functional" test similar to any of those https://github.com/vmware/versatile-data-kit/tree/main/projects/vdk-core/tests/functional

We have not yet set up our CICD to trigger from a fork. So as long as the tests pass locally (from vdk-core folder ./cicd/build.sh succeeds) I think that's enough and I'll merge the change.

projects/vdk-core/src/vdk/internal/builtin_plugins/run/datajob_initializer.py

mrdavidlaing · 2022-04-20T18:33:28Z

Still working on this - just got sidetracked a bit.

Will hopefully push up some new code early next week

mrdavidlaing · 2022-04-27T17:51:11Z

TODO

Finalize naming discussion - StandaloneDataJob
Create the public interface in vdk.api.
Take care of comments on the code - the code duplication particularly is something for me that I'd like we fix.
Tests.

antoniivanov

That looks good to me. My suggestion is to postpone the refactoring (de-duplication) for separate PR. We just need some automated tests and we can merge it.

mrdavidlaing · 2022-05-04T18:04:46Z

That looks good to me. My suggestion is to postpone the refactoring (de-duplication) for separate PR. We just need some automated tests and we can merge it.

Sounds good.

I've been fighting a bit trying to get the build working on an Mac M1 (which doesn't officially support Python 3.7). Should hopefully have something to commit later in the week

antoniivanov · 2022-05-04T18:07:03Z

I've been fighting a bit trying to get the build working on an Mac M1 (which doesn't officially support Python 3.7). Should hopefully have something to commit later in the week

You don't need to use Python 3.7 . This VDK works with all recent python versions.

mrdavidlaing · 2022-05-05T09:28:28Z

Is there a way you can disable/skip the use of python 3.7 in the commit process?

❯ git commit
[INFO] Initializing environment for https://github.com/pre-commit/pre-commit-hooks.
[INFO] Initializing environment for https://github.com/jorisroovers/gitlint.
[INFO] Initializing environment for https://github.com/jorisroovers/gitlint:./gitlint-core[trusted-deps].
[INFO] Initializing environment for https://github.com/psf/black.
[INFO] Initializing environment for https://github.com/pycqa/pydocstyle.
[INFO] Initializing environment for https://github.com/pre-commit/mirrors-pylint.
[INFO] Initializing environment for https://github.com/asottile/reorder_python_imports.
[INFO] Initializing environment for https://github.com/asottile/pyupgrade.
[INFO] Initializing environment for https://github.com/Lucas-C/pre-commit-hooks.
[INFO] Installing environment for https://github.com/pre-commit/pre-commit-hooks.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for https://github.com/psf/black.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
An unexpected error has occurred: CalledProcessError: command: ('/Users/mrdavidlaing/.pyenv/versions/3.8.13/bin/python3.8', '-mvirtualenv', '/Users/mrdavidlaing/.cache/pre-commit/repoej7q_26p/py_env-python3.7', '-p', 'python3.7')
return code: 1
expected return code: 0
stdout:
    RuntimeError: failed to find interpreter for Builtin discover of python_spec='python3.7'
    
stderr: (none)
Check the log at /Users/mrdavidlaing/.cache/pre-commit/pre-commit.log

❯ cat /Users/mrdavidlaing/.cache/pre-commit/pre-commit.log
### version information

pre-commit version: 2.18.1
git --version: git version 2.32.0 (Apple Git-132)
sys.version:
    3.8.13 (default, May  4 2022, 18:37:14) 
    [Clang 13.1.6 (clang-1316.0.21.2.3)]
sys.executable: /Users/mrdavidlaing/.pyenv/versions/3.8.13/bin/python3.8
os.name: posix
sys.platform: darwin

### error information

An unexpected error has occurred: CalledProcessError: command: ('/Users/mrdavidlaing/.pyenv/versions/3.8.13/bin/python3.8', '-mvirtualenv', '/Users/mrdavidlaing/.cache/pre-commit/repoej7q_26p/py_env-python3.7', '-p', 'python3.7')
return code: 1
expected return code: 0
stdout:
    RuntimeError: failed to find interpreter for Builtin discover of python_spec='python3.7'
    
stderr: (none)


Traceback (most recent call last):
  File "/Users/mrdavidlaing/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pre_commit/error_handler.py", line 73, in error_handler
    yield
  File "/Users/mrdavidlaing/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pre_commit/main.py", line 343, in main
    return hook_impl(
  File "/Users/mrdavidlaing/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pre_commit/commands/hook_impl.py", line 237, in hook_impl
    return retv | run(config, store, ns)
  File "/Users/mrdavidlaing/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pre_commit/commands/run.py", line 414, in run
    install_hook_envs(to_install, store)
  File "/Users/mrdavidlaing/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pre_commit/repository.py", line 221, in install_hook_envs
    _hook_install(hook)
  File "/Users/mrdavidlaing/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pre_commit/repository.py", line 79, in _hook_install
    lang.install_environment(
  File "/Users/mrdavidlaing/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pre_commit/languages/python.py", line 202, in install_environment
    cmd_output_b(*venv_cmd, cwd='/')
  File "/Users/mrdavidlaing/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pre_commit/util.py", line 146, in cmd_output_b
    raise CalledProcessError(returncode, cmd, retcode, stdout_b, stderr_b)
pre_commit.util.CalledProcessError: command: ('/Users/mrdavidlaing/.pyenv/versions/3.8.13/bin/python3.8', '-mvirtualenv', '/Users/mrdavidlaing/.cache/pre-commit/repoej7q_26p/py_env-python3.7', '-p', 'python3.7')
return code: 1
expected return code: 0
stdout:
    RuntimeError: failed to find interpreter for Builtin discover of python_spec='python3.7'
    
stderr: (none)

antoniivanov · 2022-05-05T10:41:30Z

Is there a way you can disable/skip the use of python 3.7 in the commit process?

Interesting, that seems some configuration issue with our pre-commit hooks. I opened PR to fix it: #826

projects/vdk-core/tests/functional/run/test_run_standalone_data_job.py

mrdavidlaing · 2022-05-09T14:45:35Z

@tozka With 54f4a09 done I'm happy that we have a basic set of functional tests. Are there any other tests you'd like to see added?

antoniivanov

It looks good to me. Have you run ./cicd/build.sh locally from vdk-core folder ?

We have not configured yet to run CI from forks, unfortunately. So I'd need to merge your change manually. I will run the test myself but it would be good to get confirmation first.

mrdavidlaing · 2022-05-10T18:07:16Z

@tozka Let me address the remaining Code Analysis warnings and squash the commits to make the merge nice and clean.

I'll ping you here when that is done

- Instantiate and execute plugin lifecycle from code rather than via the VDK CLI - Gives access to an instantiated job_input object - Can be run without needing any data job files - Implemented as a contextmanager to reduce API surface area - Triggers all plugin hooks except: * CoreHookSpecs.vdk_command_line Sample usage: with StandaloneDataJobFactory.create(datajob_directory) as job_input: #... use job_input object to interact with SuperCollider

mrdavidlaing · 2022-05-11T13:40:06Z

@tozka I've addressed the Code Analysis warnings and squashed the commits into a single commit to ease merging.

I've also run projects/vdk-core/cicd/build.sh on my local machine which seemed to succeed barring the failure below that I don't think is related to the changes in this PR

============================================================================== short test summary info ===============================================================================
FAILED tests/functional/run/test_run_sql_queries.py::test_run_dbapi_connection_no_such_db_type - assert 'VdkConfigurationError' in '2022-05-11 14:34:25,169 [VDK] simple-create-ins...

Results (42.93s):
     253 passed
       1 failed
         - tests/functional/run/test_run_sql_queries.py:72 test_run_dbapi_connection_no_such_db_type

Are you happy to progress to merging this PR?

antoniivanov · 2022-05-11T14:40:18Z

@tozka I've addressed the Code Analysis warnings and squashed the commits into a single commit to ease merging.

I've also run projects/vdk-core/cicd/build.sh on my local machine which seemed to succeed barring the failure below that I don't think is related to the changes in this PR
============================================================================== short test summary info ===============================================================================
FAILED tests/functional/run/test_run_sql_queries.py::test_run_dbapi_connection_no_such_db_type - assert 'VdkConfigurationError' in '2022-05-11 14:34:25,169 [VDK] simple-create-ins...

Results (42.93s):
     253 passed
       1 failed
         - tests/functional/run/test_run_sql_queries.py:72 test_run_dbapi_connection_no_such_db_type
Are you happy to progress to merging this PR?

Yes..

I ran all the tests (with your PR) locally a few times and passed.

antoniivanov · 2022-05-11T15:04:59Z

I've merged your change. Congrats and thanks for your first contribution (I hope one of many ;) )

The CI has kicked in - https://gitlab.com/vmware-analytics/versatile-data-kit/-/pipelines/536648262 hopefully it will pass and your change would be automatically released.

mrdavidlaing · 2022-05-11T16:45:07Z

@tozka Am I correct in deducing that this feature landed in https://pypi.org/project/vdk-core/0.2.536648262/ ?

antoniivanov · 2022-05-11T16:52:33Z

@tozka Am I correct in deducing that this feature landed in https://pypi.org/project/vdk-core/0.2.536648262/ ?

Yes

vmwclabot added the cla-not-required label Apr 7, 2022

mrdavidlaing mentioned this pull request Apr 7, 2022

Instantiate a VDK job_input from code #785

Closed

antoniivanov reviewed Apr 8, 2022

View reviewed changes

mrdavidlaing commented Apr 14, 2022

View reviewed changes

projects/vdk-core/src/vdk/internal/builtin_plugins/run/datajob_initializer.py Outdated Show resolved Hide resolved

antoniivanov reviewed May 4, 2022

View reviewed changes

mrdavidlaing commented May 6, 2022

View reviewed changes

projects/vdk-core/tests/functional/run/test_run_standalone_data_job.py Outdated Show resolved Hide resolved

mrdavidlaing commented May 8, 2022

View reviewed changes

projects/vdk-core/tests/functional/run/test_run_standalone_data_job.py Outdated Show resolved Hide resolved

projects/vdk-core/tests/functional/run/test_run_standalone_data_job.py Outdated Show resolved Hide resolved

antoniivanov approved these changes May 9, 2022

View reviewed changes

mrdavidlaing force-pushed the 785-instantiate_job_input_from_code branch from 9b34c08 to cd9e4cd Compare May 11, 2022 13:35

mrdavidlaing changed the title ~~WIP: vdk-core: New feature: NoOpDataJob~~ vdk-core: New feature: StandaloneDataJob May 11, 2022

antoniivanov merged commit 21a9163 into vmware:main May 11, 2022

antoniivanov mentioned this pull request Jul 6, 2022

Standalone Data Job refactoring (de-duplication) #897

Closed

exalate-issue-sync bot mentioned this pull request Aug 26, 2022

Instantiate a VDK job_input from code (duplicate) #986

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vdk-core: New feature: StandaloneDataJob #793

vdk-core: New feature: StandaloneDataJob #793

mrdavidlaing commented Apr 7, 2022 •

edited

Loading

antoniivanov commented Apr 8, 2022 •

edited

Loading

mrdavidlaing commented Apr 8, 2022 •

edited

Loading

antoniivanov commented Apr 12, 2022 •

edited

Loading

antoniivanov commented Apr 12, 2022

mrdavidlaing commented Apr 13, 2022

antoniivanov commented Apr 13, 2022

antoniivanov commented Apr 14, 2022

mrdavidlaing commented Apr 20, 2022

mrdavidlaing commented Apr 27, 2022 •

edited by antoniivanov

Loading

antoniivanov left a comment

mrdavidlaing commented May 4, 2022

antoniivanov commented May 4, 2022

mrdavidlaing commented May 5, 2022

antoniivanov commented May 5, 2022

mrdavidlaing commented May 9, 2022

antoniivanov left a comment

mrdavidlaing commented May 10, 2022

mrdavidlaing commented May 11, 2022 •

edited

Loading

antoniivanov commented May 11, 2022

antoniivanov commented May 11, 2022

mrdavidlaing commented May 11, 2022

antoniivanov commented May 11, 2022

vdk-core: New feature: StandaloneDataJob #793

vdk-core: New feature: StandaloneDataJob #793

Conversation

mrdavidlaing commented Apr 7, 2022 • edited Loading

antoniivanov commented Apr 8, 2022 • edited Loading

mrdavidlaing commented Apr 8, 2022 • edited Loading

antoniivanov commented Apr 12, 2022 • edited Loading

antoniivanov commented Apr 12, 2022

mrdavidlaing commented Apr 13, 2022

antoniivanov commented Apr 13, 2022

antoniivanov commented Apr 14, 2022

mrdavidlaing commented Apr 20, 2022

mrdavidlaing commented Apr 27, 2022 • edited by antoniivanov Loading

antoniivanov left a comment

Choose a reason for hiding this comment

mrdavidlaing commented May 4, 2022

antoniivanov commented May 4, 2022

mrdavidlaing commented May 5, 2022

antoniivanov commented May 5, 2022

mrdavidlaing commented May 9, 2022

antoniivanov left a comment

Choose a reason for hiding this comment

mrdavidlaing commented May 10, 2022

mrdavidlaing commented May 11, 2022 • edited Loading

antoniivanov commented May 11, 2022

antoniivanov commented May 11, 2022

mrdavidlaing commented May 11, 2022

antoniivanov commented May 11, 2022

mrdavidlaing commented Apr 7, 2022 •

edited

Loading

antoniivanov commented Apr 8, 2022 •

edited

Loading

mrdavidlaing commented Apr 8, 2022 •

edited

Loading

antoniivanov commented Apr 12, 2022 •

edited

Loading

mrdavidlaing commented Apr 27, 2022 •

edited by antoniivanov

Loading

mrdavidlaing commented May 11, 2022 •

edited

Loading