-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added upstream wheel uploads for Databricks Workspaces without Public Internet access #99
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #99 +/- ##
==========================================
+ Coverage 78.49% 78.72% +0.23%
==========================================
Files 14 14
Lines 1479 1495 +16
Branches 263 269 +6
==========================================
+ Hits 1161 1177 +16
Misses 230 230
Partials 88 88 ☔ View full report in Codecov by Sentry. |
ca58db5
to
cee03b4
Compare
self._installation.save(Version(self._product_info.version(), remote_wheel, self._now_iso())) | ||
return remote_wheel | ||
for wheel in self._local_wheel: | ||
if force_dependencies or self._product_info.product_name() in wheel.name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not breaking existing code
What if a dependency has product name in it? Also, there's no strict guarantees that product name will be part of the wheel name - unlikely, but possible and it'll break this code.
Given the fact that we must not break existing installation process because of this change, we'll have to add upload_upstream_dependencies_to_wsfs() -> list[str]
to return adjusted list of wheel dependencies. i've been researching the way of getting the current wheel package name, but it's either not trivial or not portable without additional pip dependencies. e.g. it's not deterministic from this list to see the name of the current package, let's say, databricks_labs_ucx
:
PyYAML-6.0.1-cp310-cp310-macosx_11_0_arm64.whl
cachetools-5.3.3-py3-none-any.whl
certifi-2024.2.2-py3-none-any.whl
charset_normalizer-3.3.2-cp310-cp310-macosx_11_0_arm64.whl
databricks_labs_blueprint-0.4.4-py3-none-any.whl
databricks_labs_lsql-0.4.2-py3-none-any.whl
databricks_labs_ucx-0.22.0-py3-none-any.whl
databricks_sdk-0.26.0-py3-none-any.whl
google_auth-2.29.0-py2.py3-none-any.whl
idna-3.7-py3-none-any.whl
pyasn1-0.6.0-py2.py3-none-any.whl
pyasn1_modules-0.4.0-py3-none-any.whl
requests-2.31.0-py3-none-any.whl
rsa-4.9-py3-none-any.whl
sqlglot-23.12.2-py3-none-any.whl
urllib3-2.2.1-py3-none-any.whl
filtering wheels
the other aspect is filtering of wheels and we have two constraints there:
- platform independent wheels that end in
-none-any.whl
and those that are pre-built for certain architectures (e.g.-macosx_11_0_arm64.whl
) - wheels that are already present in Databricks Runtime (see https://github.com/databrickslabs/sandbox/blob/main/runtime-packages/sample-output.txt) and those that are not.
databricks_sdk-*
is part of DBR, but the versions of it are slightly older and we need to update them along things like UCX- pyyaml is part of DBR already (and it's os/arch-specific)
- UCX depends on things that are not in DBR, like
sqlglot
- meta question -
filter-out
vsfilter-in
direction
If our assumption is that we're adding this functionality specifically for no-internet, then it can take a bit longer to execute (all-deps will be cached after we build the main wheel anyway). based on that, i suggest:
- add new method,
upload_wheel_dependencies(prefixes: list[str] | None = None) -> list[str]
- slightly modify the
_build_wheel
to conditionally pass--no-deps
flag, e.g. it's explicit_build_wheel(no_deps=True)
for existing functionality and_build_wheel(no_deps=False)
for this increment - filter in for prefixes, so that we can call it as
for remote_wheel in self._wheels.upload_wheel_dependencies(prefixes=['databricks', 'sqlglot']): ...
. in the future, we'll remove the explicit need forprefixes
and would depend on the whitelist of DBR packages, that would be refreshed with this repo. but for now,prefixes
is our shortcut - filter out everything that doesn't end in
-none-any.whl
- filter out main wheel file, e.g. so that we don't upload UCX twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slightly modify the _build_wheel to conditionally pass --no-deps flag, e.g. it's explicit _build_wheel(no_deps=True) for existing functionality and _build_wheel(no_deps=False) for this increment
Thanks, all the items above make sense and will be actioned. Just regarding the modification in _build_wheel
, since it is called in WheelsV2.__enter__
, it is not straightforward how to pass no_deps=False
while running with self._wheels:
in UCX code #L527C1-L529C49. Any suggestion on the workaround? If I haven't mistaken couple of changes are required in WorkspaceInstaller and GlobalContext accordingly.
The changes to add no_deps
in _build_wheel
would be:
def __init__(self, installation: Installation, product_info: ProductInfo, *, verbose: bool = False, no_deps: bool = True):
...
self._no_deps = no_deps
def __enter__(self) -> "WheelsV2":
...
self._local_wheel = self._build_wheel(self._tmp_dir.name, verbose=self._verbose, no_deps=self._no_deps)
return self
def _build_wheel(self, tmp_dir: str, *, verbose: bool = False, no_deps: bool = True) -> list[Path]: ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Build wheel is a small change:
- Add keyed argument, default value keeps the current behavior
- it should return a list of paths instead of a single one
The rest of the increment is in new methods, that call build wheel for the second time.
main_wheel_remote_path = remote_wheel | ||
return main_wheel_remote_path | ||
|
||
def upload_to_wsfs(self, force_dependencies: bool = False) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approach is slightly incorrect, because the consuming code has to know the remote wheel paths for every dependency in order to get added as Library(whl=...)
.
That being said, this functionality has to be added in a second method, let's say, upload_upstream_dependencies_to_wsfs() -> list[str]
and upload_to_wsfs(self) -> str
has to remain with the same interface not to cause breakage for downstream dependencies (also during the upgrades)
see https://pypistats.org/packages/databricks-labs-blueprint
tests/unit/test_wheels.py
Outdated
wheels = WheelsV2(installation, product_info) | ||
with wheels: | ||
remote_on_wsfs = wheels.upload_to_wsfs(force_dependencies=True) | ||
installation.assert_file_uploaded(re.compile("wheels/databricks_labs_blueprint-*")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also assert that SDK is uploaded:
https://github.com/databrickslabs/blueprint/blob/main/pyproject.toml#L17
tests/unit/test_wheels.py
Outdated
assert not os.path.exists(wheel) | ||
|
||
|
||
def test_build_and_upload_wheel_with_no_dependency(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also add an integration test to verify that things get uploaded to installation directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently facing an issue with the test environment - to be completed
… download) Resolves #573 The output of the build wheel function will be a list of all the dependent wheel packages. Upload functions will get a new flag as input to either include the dependencies in the download list or just limit the download to the main wheel package.
… download) Resolves #573
… download) Resolves #573
104ffbc
to
b14c6ca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
address last comments and add documentation to README.md and we should be good to go.
… download) Resolves #573
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add:
- integration test (the one using real workspace)
- Documentation in readme.md, so that others know about this feat
:param prefixes : A list of prefixes to match against the wheel names. If a prefix matches, the wheel is uploaded. | ||
""" | ||
remote_paths = [] | ||
for wheel in list( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrapping in list() is redundant here - you're only iterating over paths
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed documentation
wheel_paths = wheels.upload_wheel_dependencies(['databricks_sdk', 'pandas']) | ||
for path in wheel_paths: | ||
print(f'Uploaded dependency to {path}') | ||
``` | ||
|
||
In this example, the `upload_wheel_dependencies(['databricks_sdk', 'pandas'])` call will upload all the dependencies of the wheel that have names starting with 'databricks_sdk' or 'pandas'. This method excludes any platform specific dependencies (i.e. ending with `-none-any.whl`). Also the main wheel file is not uploaded. The method returns a list of paths to the uploaded dependencies on WorkspaceFS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wheel_paths = wheels.upload_wheel_dependencies(['databricks_sdk', 'pandas']) | |
for path in wheel_paths: | |
print(f'Uploaded dependency to {path}') | |
``` | |
In this example, the `upload_wheel_dependencies(['databricks_sdk', 'pandas'])` call will upload all the dependencies of the wheel that have names starting with 'databricks_sdk' or 'pandas'. This method excludes any platform specific dependencies (i.e. ending with `-none-any.whl`). Also the main wheel file is not uploaded. The method returns a list of paths to the uploaded dependencies on WorkspaceFS. | |
remote_wheel = wheels.upload_to_wsfs() | |
logger.info(f'Uploaded main wheel: {remote_wheel}') | |
wheel_paths = wheels.upload_wheel_dependencies(prefixes=['databricks', 'sqlglot']) | |
for path in wheel_paths: | |
logger.info(f'Uploaded upstream dependency: {path}') |
In this example, the upload_wheel_dependencies(['databricks', 'sqlglot'])
call will upload all the dependencies of the wheel that have names starting with 'databricks' or 'pandas'. This method excludes any platform specific dependencies (i.e. ending with -none-any.whl
). Also the main wheel file is not uploaded. The method returns a list of paths to the uploaded dependencies on WorkspaceFS.
With the changes from #99, we were not hitting `if wheel.name == self._local_wheel.name` condition for unreleased wheels, resulting in undefined behavior: ``` product_info = ProductInfo.from_class(WheelsV2) with WheelsV2(new_installation, product_info) as whl: whl.upload_wheel_dependencies(["databricks"]) installation_files = new_installation.files() # only Databricks SDK has to be uploaded assert len(installation_files) == 1 ```
…ups (#103) With the changes from #99, we were not hitting `if wheel.name == self._local_wheel.name` condition for unreleased wheels, resulting in undefined behavior: ``` product_info = ProductInfo.from_class(WheelsV2) with WheelsV2(new_installation, product_info) as whl: whl.upload_wheel_dependencies(["databricks"]) installation_files = new_installation.files() # only Databricks SDK has to be uploaded assert len(installation_files) == 1 ```
* Added upstream wheel uploads for Databricks Workspaces without Public Internet access ([#99](#99)). This commit introduces a new feature for uploading upstream wheel dependencies to Databricks Workspaces without Public Internet access. A new flag has been added to upload functions, allowing users to include or exclude dependencies in the download list. The `WheelsV2` class has been updated with a new method, `upload_wheel_dependencies(prefixes)`, which checks if each wheel's name starts with any of the provided prefixes before uploading it to the Workspace File System (WSFS). This feature also includes two new tests to verify the functionality of uploading the main wheel package and dependent wheel packages, optimizing downloads based on specific use cases. This enables users to more easily use the package in offline environments with restricted internet access, particularly for Databricks Workspaces with extra layers of network security. * Fixed bug for double-uploading of unreleased wheels in air-gapped setups ([#103](#103)). In this release, we have addressed a bug in the `upload_wheel_dependencies` method of the `WheelsV2` class, which caused double-uploading of unreleased wheels in air-gapped setups. This issue occurred due to the condition `if wheel.name == self._local_wheel.name` not being met, resulting in undefined behavior. We have introduced a cached property `_current_version` to tackle this bug for unreleased versions uploaded to air-gapped workspaces. We also added a new method, `upload_to_wsfs()`, that uploads files to the workspace file system (WSFS) in the integration test. This release also includes new tests to ensure that only the Databricks SDK is uploaded and that the number of installation files is correct. These changes have resolved the double-uploading issue, and the number of installation files, Databricks SDK, Blueprint, and version.json metadata are now uploaded correctly to WSFS.
* Added upstream wheel uploads for Databricks Workspaces without Public Internet access ([#99](#99)). This commit introduces a new feature for uploading upstream wheel dependencies to Databricks Workspaces without Public Internet access. A new flag has been added to upload functions, allowing users to include or exclude dependencies in the download list. The `WheelsV2` class has been updated with a new method, `upload_wheel_dependencies(prefixes)`, which checks if each wheel's name starts with any of the provided prefixes before uploading it to the Workspace File System (WSFS). This feature also includes two new tests to verify the functionality of uploading the main wheel package and dependent wheel packages, optimizing downloads based on specific use cases. This enables users to more easily use the package in offline environments with restricted internet access, particularly for Databricks Workspaces with extra layers of network security. * Fixed bug for double-uploading of unreleased wheels in air-gapped setups ([#103](#103)). In this release, we have addressed a bug in the `upload_wheel_dependencies` method of the `WheelsV2` class, which caused double-uploading of unreleased wheels in air-gapped setups. This issue occurred due to the condition `if wheel.name == self._local_wheel.name` not being met, resulting in undefined behavior. We have introduced a cached property `_current_version` to tackle this bug for unreleased versions uploaded to air-gapped workspaces. We also added a new method, `upload_to_wsfs()`, that uploads files to the workspace file system (WSFS) in the integration test. This release also includes new tests to ensure that only the Databricks SDK is uploaded and that the number of installation files is correct. These changes have resolved the double-uploading issue, and the number of installation files, Databricks SDK, Blueprint, and version.json metadata are now uploaded correctly to WSFS.
The output of the build wheel function will be a list of all the dependent wheel packages. Upload functions will get a new flag as input to either include the dependencies in the download list or just limit the download to the main wheel package.
Relates to databrickslabs/ucx#573