Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added upstream wheel uploads for Databricks Workspaces without Public Internet access #99

Merged
merged 7 commits into from
May 12, 2024

Conversation

aminmovahed-db
Copy link
Contributor

@aminmovahed-db aminmovahed-db commented May 4, 2024

The output of the build wheel function will be a list of all the dependent wheel packages. Upload functions will get a new flag as input to either include the dependencies in the download list or just limit the download to the main wheel package.

Relates to databrickslabs/ucx#573

Copy link

codecov bot commented May 4, 2024

Codecov Report

Attention: Patch coverage is 95.45455% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 78.72%. Comparing base (c959367) to head (8445066).

Files Patch % Lines
src/databricks/labs/blueprint/wheels.py 95.45% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #99      +/-   ##
==========================================
+ Coverage   78.49%   78.72%   +0.23%     
==========================================
  Files          14       14              
  Lines        1479     1495      +16     
  Branches      263      269       +6     
==========================================
+ Hits         1161     1177      +16     
  Misses        230      230              
  Partials       88       88              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

self._installation.save(Version(self._product_info.version(), remote_wheel, self._now_iso()))
return remote_wheel
for wheel in self._local_wheel:
if force_dependencies or self._product_info.product_name() in wheel.name:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not breaking existing code

What if a dependency has product name in it? Also, there's no strict guarantees that product name will be part of the wheel name - unlikely, but possible and it'll break this code.

Given the fact that we must not break existing installation process because of this change, we'll have to add upload_upstream_dependencies_to_wsfs() -> list[str] to return adjusted list of wheel dependencies. i've been researching the way of getting the current wheel package name, but it's either not trivial or not portable without additional pip dependencies. e.g. it's not deterministic from this list to see the name of the current package, let's say, databricks_labs_ucx:

PyYAML-6.0.1-cp310-cp310-macosx_11_0_arm64.whl
cachetools-5.3.3-py3-none-any.whl
certifi-2024.2.2-py3-none-any.whl
charset_normalizer-3.3.2-cp310-cp310-macosx_11_0_arm64.whl
databricks_labs_blueprint-0.4.4-py3-none-any.whl
databricks_labs_lsql-0.4.2-py3-none-any.whl
databricks_labs_ucx-0.22.0-py3-none-any.whl
databricks_sdk-0.26.0-py3-none-any.whl
google_auth-2.29.0-py2.py3-none-any.whl
idna-3.7-py3-none-any.whl
pyasn1-0.6.0-py2.py3-none-any.whl
pyasn1_modules-0.4.0-py3-none-any.whl
requests-2.31.0-py3-none-any.whl
rsa-4.9-py3-none-any.whl
sqlglot-23.12.2-py3-none-any.whl
urllib3-2.2.1-py3-none-any.whl

filtering wheels

the other aspect is filtering of wheels and we have two constraints there:

  1. platform independent wheels that end in -none-any.whl and those that are pre-built for certain architectures (e.g. -macosx_11_0_arm64.whl)
  2. wheels that are already present in Databricks Runtime (see https://github.com/databrickslabs/sandbox/blob/main/runtime-packages/sample-output.txt) and those that are not.
  3. databricks_sdk-* is part of DBR, but the versions of it are slightly older and we need to update them along things like UCX
  4. pyyaml is part of DBR already (and it's os/arch-specific)
  5. UCX depends on things that are not in DBR, like sqlglot
  6. meta question - filter-out vs filter-in

direction

If our assumption is that we're adding this functionality specifically for no-internet, then it can take a bit longer to execute (all-deps will be cached after we build the main wheel anyway). based on that, i suggest:

  • add new method, upload_wheel_dependencies(prefixes: list[str] | None = None) -> list[str]
  • slightly modify the _build_wheel to conditionally pass --no-deps flag, e.g. it's explicit _build_wheel(no_deps=True) for existing functionality and _build_wheel(no_deps=False) for this increment
  • filter in for prefixes, so that we can call it as for remote_wheel in self._wheels.upload_wheel_dependencies(prefixes=['databricks', 'sqlglot']): .... in the future, we'll remove the explicit need for prefixes and would depend on the whitelist of DBR packages, that would be refreshed with this repo. but for now, prefixes is our shortcut
  • filter out everything that doesn't end in -none-any.whl
  • filter out main wheel file, e.g. so that we don't upload UCX twice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slightly modify the _build_wheel to conditionally pass --no-deps flag, e.g. it's explicit _build_wheel(no_deps=True) for existing functionality and _build_wheel(no_deps=False) for this increment

Thanks, all the items above make sense and will be actioned. Just regarding the modification in _build_wheel, since it is called in WheelsV2.__enter__ , it is not straightforward how to pass no_deps=False while running with self._wheels: in UCX code #L527C1-L529C49. Any suggestion on the workaround? If I haven't mistaken couple of changes are required in WorkspaceInstaller and GlobalContext accordingly.

The changes to add no_deps in _build_wheel would be:

    def __init__(self, installation: Installation, product_info: ProductInfo, *, verbose: bool = False, no_deps: bool = True):
       ...
        self._no_deps = no_deps
     def __enter__(self) -> "WheelsV2":
       ...
        self._local_wheel = self._build_wheel(self._tmp_dir.name, verbose=self._verbose, no_deps=self._no_deps)
        return self
     def _build_wheel(self, tmp_dir: str, *, verbose: bool = False, no_deps: bool = True) -> list[Path]: ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build wheel is a small change:

  • Add keyed argument, default value keeps the current behavior
  • it should return a list of paths instead of a single one

The rest of the increment is in new methods, that call build wheel for the second time.

main_wheel_remote_path = remote_wheel
return main_wheel_remote_path

def upload_to_wsfs(self, force_dependencies: bool = False) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approach is slightly incorrect, because the consuming code has to know the remote wheel paths for every dependency in order to get added as Library(whl=...).

That being said, this functionality has to be added in a second method, let's say, upload_upstream_dependencies_to_wsfs() -> list[str] and upload_to_wsfs(self) -> str has to remain with the same interface not to cause breakage for downstream dependencies (also during the upgrades)

see https://pypistats.org/packages/databricks-labs-blueprint

wheels = WheelsV2(installation, product_info)
with wheels:
remote_on_wsfs = wheels.upload_to_wsfs(force_dependencies=True)
installation.assert_file_uploaded(re.compile("wheels/databricks_labs_blueprint-*"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert not os.path.exists(wheel)


def test_build_and_upload_wheel_with_no_dependency():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add an integration test to verify that things get uploaded to installation directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently facing an issue with the test environment - to be completed

@CLAassistant
Copy link

CLAassistant commented May 8, 2024

CLA assistant check
All committers have signed the CLA.

… download)

Resolves #573

The output of the build wheel function will be a list of all the dependent wheel packages. Upload functions will get a new flag as input to either include the dependencies in the download list or just limit the download to the main wheel package.
Copy link
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

address last comments and add documentation to README.md and we should be good to go.

@nfx nfx marked this pull request as ready for review May 10, 2024 15:54
Copy link
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add:

  • integration test (the one using real workspace)
  • Documentation in readme.md, so that others know about this feat

:param prefixes : A list of prefixes to match against the wheel names. If a prefix matches, the wheel is uploaded.
"""
remote_paths = []
for wheel in list(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapping in list() is redundant here - you're only iterating over paths

@nfx nfx changed the title Updating _build_wheel, upload_to_wsfs and upload_to_dbfs (to download all dependant wheels) Added upstream wheel uploads for Databricks Workspaces without internet access May 12, 2024
Copy link
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed documentation

Comment on lines +955 to +960
wheel_paths = wheels.upload_wheel_dependencies(['databricks_sdk', 'pandas'])
for path in wheel_paths:
print(f'Uploaded dependency to {path}')
```

In this example, the `upload_wheel_dependencies(['databricks_sdk', 'pandas'])` call will upload all the dependencies of the wheel that have names starting with 'databricks_sdk' or 'pandas'. This method excludes any platform specific dependencies (i.e. ending with `-none-any.whl`). Also the main wheel file is not uploaded. The method returns a list of paths to the uploaded dependencies on WorkspaceFS.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
wheel_paths = wheels.upload_wheel_dependencies(['databricks_sdk', 'pandas'])
for path in wheel_paths:
print(f'Uploaded dependency to {path}')
```
In this example, the `upload_wheel_dependencies(['databricks_sdk', 'pandas'])` call will upload all the dependencies of the wheel that have names starting with 'databricks_sdk' or 'pandas'. This method excludes any platform specific dependencies (i.e. ending with `-none-any.whl`). Also the main wheel file is not uploaded. The method returns a list of paths to the uploaded dependencies on WorkspaceFS.
remote_wheel = wheels.upload_to_wsfs()
logger.info(f'Uploaded main wheel: {remote_wheel}')
wheel_paths = wheels.upload_wheel_dependencies(prefixes=['databricks', 'sqlglot'])
for path in wheel_paths:
logger.info(f'Uploaded upstream dependency: {path}')

In this example, the upload_wheel_dependencies(['databricks', 'sqlglot']) call will upload all the dependencies of the wheel that have names starting with 'databricks' or 'pandas'. This method excludes any platform specific dependencies (i.e. ending with -none-any.whl). Also the main wheel file is not uploaded. The method returns a list of paths to the uploaded dependencies on WorkspaceFS.

@nfx nfx changed the title Added upstream wheel uploads for Databricks Workspaces without internet access Added upstream wheel uploads for Databricks Workspaces without Public Internet access May 12, 2024
@nfx nfx merged commit 50b5474 into databrickslabs:main May 12, 2024
12 of 13 checks passed
nfx added a commit that referenced this pull request May 12, 2024
With the changes from #99, we were not hitting `if wheel.name == self._local_wheel.name` condition for unreleased wheels, resulting in undefined behavior:

```
product_info = ProductInfo.from_class(WheelsV2)
with WheelsV2(new_installation, product_info) as whl:
    whl.upload_wheel_dependencies(["databricks"])
    installation_files = new_installation.files()
    # only Databricks SDK has to be uploaded
    assert len(installation_files) == 1
```
nfx added a commit that referenced this pull request May 12, 2024
…ups (#103)

With the changes from #99, we were not hitting `if wheel.name ==
self._local_wheel.name` condition for unreleased wheels, resulting in
undefined behavior:

```
product_info = ProductInfo.from_class(WheelsV2)
with WheelsV2(new_installation, product_info) as whl:
    whl.upload_wheel_dependencies(["databricks"])
    installation_files = new_installation.files()
    # only Databricks SDK has to be uploaded
    assert len(installation_files) == 1
```
nfx added a commit that referenced this pull request May 12, 2024
* Added upstream wheel uploads for Databricks Workspaces without Public Internet access ([#99](#99)). This commit introduces a new feature for uploading upstream wheel dependencies to Databricks Workspaces without Public Internet access. A new flag has been added to upload functions, allowing users to include or exclude dependencies in the download list. The `WheelsV2` class has been updated with a new method, `upload_wheel_dependencies(prefixes)`, which checks if each wheel's name starts with any of the provided prefixes before uploading it to the Workspace File System (WSFS). This feature also includes two new tests to verify the functionality of uploading the main wheel package and dependent wheel packages, optimizing downloads based on specific use cases. This enables users to more easily use the package in offline environments with restricted internet access, particularly for Databricks Workspaces with extra layers of network security.
* Fixed bug for double-uploading of unreleased wheels in air-gapped setups ([#103](#103)). In this release, we have addressed a bug in the `upload_wheel_dependencies` method of the `WheelsV2` class, which caused double-uploading of unreleased wheels in air-gapped setups. This issue occurred due to the condition `if wheel.name == self._local_wheel.name` not being met, resulting in undefined behavior. We have introduced a cached property `_current_version` to tackle this bug for unreleased versions uploaded to air-gapped workspaces. We also added a new method, `upload_to_wsfs()`, that uploads files to the workspace file system (WSFS) in the integration test. This release also includes new tests to ensure that only the Databricks SDK is uploaded and that the number of installation files is correct. These changes have resolved the double-uploading issue, and the number of installation files, Databricks SDK, Blueprint, and version.json metadata are now uploaded correctly to WSFS.
@nfx nfx mentioned this pull request May 12, 2024
nfx added a commit that referenced this pull request May 12, 2024
* Added upstream wheel uploads for Databricks Workspaces without Public
Internet access
([#99](#99)). This
commit introduces a new feature for uploading upstream wheel
dependencies to Databricks Workspaces without Public Internet access. A
new flag has been added to upload functions, allowing users to include
or exclude dependencies in the download list. The `WheelsV2` class has
been updated with a new method, `upload_wheel_dependencies(prefixes)`,
which checks if each wheel's name starts with any of the provided
prefixes before uploading it to the Workspace File System (WSFS). This
feature also includes two new tests to verify the functionality of
uploading the main wheel package and dependent wheel packages,
optimizing downloads based on specific use cases. This enables users to
more easily use the package in offline environments with restricted
internet access, particularly for Databricks Workspaces with extra
layers of network security.
* Fixed bug for double-uploading of unreleased wheels in air-gapped
setups ([#103](#103)).
In this release, we have addressed a bug in the
`upload_wheel_dependencies` method of the `WheelsV2` class, which caused
double-uploading of unreleased wheels in air-gapped setups. This issue
occurred due to the condition `if wheel.name == self._local_wheel.name`
not being met, resulting in undefined behavior. We have introduced a
cached property `_current_version` to tackle this bug for unreleased
versions uploaded to air-gapped workspaces. We also added a new method,
`upload_to_wsfs()`, that uploads files to the workspace file system
(WSFS) in the integration test. This release also includes new tests to
ensure that only the Databricks SDK is uploaded and that the number of
installation files is correct. These changes have resolved the
double-uploading issue, and the number of installation files, Databricks
SDK, Blueprint, and version.json metadata are now uploaded correctly to
WSFS.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants