Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

databricks bundle init template v1 #686

Merged
merged 54 commits into from
Sep 5, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
59c1c73
Add foundation for builtin templates
lennartkats-db Aug 21, 2023
2b67262
Avoid nil context
lennartkats-db Aug 21, 2023
737c612
Add missing file
lennartkats-db Aug 21, 2023
2955e84
Fix test in Windows
lennartkats-db Aug 21, 2023
ea5666a
Add initial default-python template
lennartkats-db Aug 21, 2023
a1c6b3b
Avoid ipynb for now
lennartkats-db Aug 21, 2023
6ae0279
Fix typos
lennartkats-db Aug 21, 2023
568afe0
Remove the hard requirement of a target.prod.git setting
lennartkats-db Aug 21, 2023
37de1e7
Merge remote-tracking branch 'databricks/main' into template-foundation
lennartkats-db Aug 21, 2023
a3c7e54
Don't guess the main branch
lennartkats-db Aug 21, 2023
a7c6928
Remove old test
lennartkats-db Aug 21, 2023
9435670
Fix typo
lennartkats-db Aug 21, 2023
cfa8a83
Cleanup
lennartkats-db Aug 23, 2023
4340356
Add caching to is_service_principal
lennartkats-db Aug 24, 2023
9d41ccc
Fix tests
lennartkats-db Aug 24, 2023
581aae6
Update template
lennartkats-db Aug 24, 2023
bc57e53
Merge branch 'template-foundation' into template
lennartkats-db Aug 24, 2023
79370f8
Update template
lennartkats-db Aug 25, 2023
078805f
Add comment
lennartkats-db Aug 25, 2023
c1ae220
Merge remote-tracking branch 'databricks/main' into template
lennartkats-db Aug 25, 2023
8554b4e
Restore removed file
lennartkats-db Aug 25, 2023
6a82aee
Fix name
lennartkats-db Aug 25, 2023
44bc9e9
Copy-edit
lennartkats-db Aug 28, 2023
3660191
Add missing file
lennartkats-db Aug 28, 2023
6c1f80a
Support cluster overrides with cluster_key and compute_key (#696)
lennartkats-db Aug 28, 2023
2a3b86c
Fix newline
lennartkats-db Aug 28, 2023
9f937ed
Process review comments
lennartkats-db Aug 29, 2023
30c5e04
Update
lennartkats-db Aug 30, 2023
35350b4
Update README
lennartkats-db Aug 30, 2023
b071ad7
Allow referencing local Python wheels without artifacts section defin…
andrewnester Aug 28, 2023
f6c769b
Fixed --environment flag (#705)
andrewnester Aug 28, 2023
a7407f0
Correctly identify local paths in libraries section (#702)
andrewnester Aug 29, 2023
886a4e5
Fixed path joining in FindFilesWithSuffixInPath (#704)
andrewnester Aug 29, 2023
9451d16
Added transformation mutator for Python wheel task for them to work …
andrewnester Aug 30, 2023
da854ef
Minor tweaks
lennartkats-db Sep 1, 2023
0667a10
Update libs/template/templates/default-python/template/{{.project_nam…
lennartkats-db Aug 30, 2023
dd0eba4
Update libs/template/templates/default-python/template/{{.project_nam…
lennartkats-db Aug 30, 2023
53b1c2d
Update libs/template/templates/default-python/template/{{.project_nam…
lennartkats-db Aug 30, 2023
fde7eb3
Update libs/template/templates/default-python/template/{{.project_nam…
lennartkats-db Aug 30, 2023
c87a648
Update libs/template/templates/default-python/template/{{.project_nam…
lennartkats-db Aug 30, 2023
81c2ff4
Update libs/template/templates/default-python/template/{{.project_nam…
lennartkats-db Aug 30, 2023
d351fbf
Update libs/template/templates/default-python/template/{{.project_nam…
lennartkats-db Aug 30, 2023
6bed4d1
Update libs/template/templates/default-python/template/{{.project_nam…
lennartkats-db Aug 30, 2023
369c578
Update libs/template/templates/default-python/template/{{.project_nam…
lennartkats-db Aug 30, 2023
01c9868
Update libs/template/templates/default-python/template/{{.project_nam…
lennartkats-db Aug 30, 2023
79e9211
Update libs/template/templates/default-python/template/{{.project_nam…
lennartkats-db Aug 30, 2023
a5068bb
Update libs/template/templates/default-python/template/{{.project_nam…
lennartkats-db Aug 30, 2023
aea403f
Update libs/template/templates/default-python/template/{{.project_nam…
lennartkats-db Sep 1, 2023
391ab24
Don't show this template just yet
lennartkats-db Sep 1, 2023
9063af7
Merge remote-tracking branch 'databricks/main' into template
lennartkats-db Sep 1, 2023
2833db9
Merge remote-tracking branch 'databricks/main' into template
lennartkats-db Sep 4, 2023
2b33682
Remove artifacts section; no longer required
lennartkats-db Sep 4, 2023
f650a3f
Restore scratch directory
lennartkats-db Sep 4, 2023
db52765
Backport Databricks connect suggestion
lennartkats-db Sep 5, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion cmd/bundle/init.go
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ func newInitCommand() *cobra.Command {
} else {
return errors.New("please specify a template")

/* TODO: propose to use default-python (once template is ready)
/* TODO: propose to use default-python (once #708 is merged)
lennartkats-db marked this conversation as resolved.
Show resolved Hide resolved
var err error
if !cmdio.IsOutTTY(ctx) || !cmdio.IsInTTY(ctx) {
return errors.New("please specify a template")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"project_name": {
"type": "string",
"default": "my_project",
"description": "Name of the directory"
"description": "Unique name for this project"
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@

.databricks/
build/
dist/
__pycache__/
*.egg-info
.venv/
scratch/**
!scratch/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Typings for Pylance in Visual Studio Code
# see https://github.com/microsoft/pyright/blob/main/docs/builtins.md
from databricks.sdk.runtime import *
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"recommendations": [
"databricks.databricks",
"ms-python.vscode-pylance",
"redhat.vscode-yaml"
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"python.analysis.stubPath": ".vscode",
"databricks.python.envFile": "${workspaceFolder}/.env",
"jupyter.interactiveWindow.cellMarker.codeRegex": "^# COMMAND ----------|^# Databricks notebook source|^(#\\s*%%|#\\s*\\<codecell\\>|#\\s*In\\[\\d*?\\]|#\\s*In\\[ \\])",
"jupyter.interactiveWindow.cellMarker.default": "# COMMAND ----------",
"python.testing.pytestArgs": [
"."
],
"python.testing.unittestEnabled": false,
"python.testing.pytestEnabled": true,
"files.exclude": {
"**/*.egg-info": true
},
}

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# {{.project_name}}

The '{{.project_name}}' project was generated by using the default-python template.

## Getting started

1. Install the Databricks CLI from https://docs.databricks.com/dev-tools/cli/databricks-cli.html

2. Authenticate to your Databricks workspace:
```
$ databricks configure
```
lennartkats-db marked this conversation as resolved.
Show resolved Hide resolved

3. To deploy a development copy of this project, type:
```
$ databricks bundle deploy --target dev
```
(Note that "dev" is the default target, so the `--target` parameter
is optional here.)

This deploys everything that's defined for this project.
For example, the default template would deploy a job called
`[dev yourname] {{.project_name}}-job` to your workspace.
You can find that job by opening your workpace and clicking on **Workflows**.

4. Similarly, to deploy a production copy, type:
```
$ databricks bundle deploy --target prod
```

5. Optionally, install developer tools such as the Databricks extension for Visual Studio Code from
https://docs.databricks.com/dev-tools/vscode-ext.html. Or read the "getting started" documentation for
**Databricks Connect** for instructions on running the included Python code from a different IDE.

6. For documentation on the Databricks asset bundles format used
for this project, and for CI/CD configuration, see
https://docs.databricks.com/dev-tools/bundles/index.html.
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# This is a Databricks asset bundle definition for {{.project_name}}.
# See https://docs.databricks.com/dev-tools/bundles/index.html for documentation.
bundle:
name: {{.project_name}}

include:
- resources/*.yml

targets:
# The 'dev' target, used development purposes.
# Whenever a developer deploys using 'dev', they get their own copy.
dev:
lennartkats-db marked this conversation as resolved.
Show resolved Hide resolved
# We use 'mode: development' to make everything deployed to this target gets a prefix
# like '[dev my_user_name]'. Setting this mode also disables any schedules and
# automatic triggers for jobs and enables the 'development' mode for Delta Live Tables pipelines.
mode: development
default: true
workspace:
host: {{workspace_host}}

# Optionally, there could be a 'staging' target here.
# (See Databricks docs on CI/CD at https://docs.databricks.com/dev-tools/bundles/index.html.)
#
# staging:
# workspace:
# host: {{workspace_host}}

# The 'prod' target, used for production deployment.
prod:
# For production deployments, we only have a single copy, so we override the
# workspace.root_path default of
# /Users/${workspace.current_user.userName}/.bundle/${bundle.target}/${bundle.name}
# to a path that is not specific to the current user.
{{- /*
Explaining 'mode: production' isn't as pressing as explaining 'mode: development'.
As we already talked about the other mode above, users can just
look at documentation or ask the assistant about 'mode: production'.
#
# By making use of 'mode: production' we enable strict checks
# to make sure we have correctly configured this target.
*/}}
PaulCornellDB marked this conversation as resolved.
Show resolved Hide resolved
mode: production
workspace:
host: {{workspace_host}}
root_path: /Shared/.bundle/prod/${bundle.name}
{{- if not is_service_principal}}
run_as:
# This runs as {{user_name}} in production. Alternatively,
# a service principal could be used here using service_principal_name
# (see Databricks documentation).
PaulCornellDB marked this conversation as resolved.
Show resolved Hide resolved
user_name: {{user_name}}
{{end -}}
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Fixtures
{{- /*
We don't want to have too many README.md files, since they
stand out so much. But we do need to have a file here to make
sure the folder is added to Git.
*/}}
lennartkats-db marked this conversation as resolved.
Show resolved Hide resolved

This folder is reserved for fixtures, such as CSV files.

Below is an example of how to load fixtures as a data frame:

```
import pandas as pd
import os

def get_absolute_path(*relative_parts):
if 'dbutils' in globals():
base_dir = os.path.dirname(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()) # type: ignore
path = os.path.normpath(os.path.join(base_dir, *relative_parts))
return path if path.startswith("/Workspace") else os.path.join("/Workspace", path)
else:
return os.path.join(*relative_parts)

csv_file = get_absolute_path("..", "fixtures", "mycsv.csv")
df = pd.read_csv(csv_file)
display(df)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading a fixture with a relative path requires us to set $PWD.

Copy link
Contributor Author

@lennartkats-db lennartkats-db Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes there was supposed to be some path mangling here. Which is not required on serverless, btw.

I don't think there's any way we can do this without the (broadly published) .entry_point API of dbutils. Or is there, @fjakobs?

This is what we could suggest:

import pandas as pd
import os

def get_relative_path(*relative_parts):
    if 'dbutils' in globals():
        base_dir = os.path.dirname(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()) # type: ignore
        path = os.path.normpath(os.path.join(base_dir, *relative_parts))
        return path if path.startswith("/Workspace") else os.path.join("/Workspace", path)
    else:
        return os.path.join(*relative_parts)

csv_file = get_relative_path("..", "fixtures", "mycsv.csv")
df = pd.read_csv(csv_file)
display(df)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IDE adds some bootstrap code that sets CWD and python path correctly. We were considering to do the same for code deployed through DABs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fjakobs We need something that also works from a notebook in the workspace, though. Is there anything cleaner for that than the option above?

And, as we get to a solution here, we should look at moving this to public docs, e.g. to https://docs.databricks.com/en/external-data/csv.html. cc @PaulCornellDB

Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[pytest]
testpaths = tests
pythonpath = src
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# The main job for {{.project_name}}
resources:

jobs:
{{.project_name}}_job:
name: {{.project_name}}_job

schedule:
quartz_cron_expression: '44 37 8 * * ?'
timezone_id: Europe/Amsterdam

{{- if not is_service_principal}}
email_notifications:
on_failure:
- {{user_name}}
{{end -}}

tasks:
- task_key: notebook_task
job_cluster_key: job_cluster
notebook_task:
notebook_path: ../src/notebook.ipynb

- task_key: python_wheel_task
depends_on:
- task_key: notebook_task
job_cluster_key: job_cluster
python_wheel_task:
package_name: {{.project_name}}
entry_point: main
libraries:
- whl: ../dist/*.whl

job_clusters:
- job_cluster_key: job_cluster
new_cluster:
{{- /* we should always use an LTS version in our templates */}}
PaulCornellDB marked this conversation as resolved.
Show resolved Hide resolved
spark_version: 13.3.x-scala2.12
node_type_id: {{smallest_node_type}}
autoscale:
min_workers: 1
max_workers: 4
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# scratch

This folder is reserved for personal, exploratory notebooks.
By default these are not committed to Git, as 'scratch' is listed in .gitignore.
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
"byteLimit": 2048000,
"rowLimit": 10000
},
"inputWidgets": {},
"nuid": "6bca260b-13d1-448f-8082-30b60a85c9ae",
"showTitle": false,
"title": ""
}
},
"outputs": [],
"source": [
"import sys\n",
"sys.path.append('../src')\n",
"from project import main\n",
"\n",
"main.taxis.show(10)"
]
}
],
"metadata": {
"application/vnd.databricks.v1+notebook": {
"dashboards": [],
"language": "python",
"notebookMetadata": {
"pythonIndentUnit": 2
},
"notebookName": "ipynb-notebook",
"widgets": {}
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
"""
Setup script for {{.project_name}}.

This script packages and distributes the associated wheel file(s).
Source code is in ./src/. Run 'python setup.py sdist bdist_wheel' to build.
"""
from setuptools import setup, find_packages

import sys
sys.path.append('./src')

import {{.project_name}}

setup(
name="{{.project_name}}",
version={{.project_name}}.__version__,
url="https://databricks.com",
author="{{.user_name}}",
description="my test wheel",
packages=find_packages(where='./src'),
package_dir={'': 'src'},
entry_points={"entry_points": "main={{.project_name}}.main:main"},
install_requires=["setuptools"],
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "ee353e42-ff58-4955-9608-12865bd0950e",
"showTitle": false,
"title": ""
}
},
"source": [
"# Default notebook\n",
"\n",
"This default notebook is executed using Databricks Workflows as defined in resources/{{.my_project}}_job.yml."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
"byteLimit": 2048000,
"rowLimit": 10000
},
"inputWidgets": {},
"nuid": "6bca260b-13d1-448f-8082-30b60a85c9ae",
"showTitle": false,
"title": ""
}
},
"outputs": [],
"source": [
"from {{.project_name}} import main\n",
lennartkats-db marked this conversation as resolved.
Show resolved Hide resolved
"\n",
"main.get_taxis().show(10)\n"
]
}
],
"metadata": {
"application/vnd.databricks.v1+notebook": {
"dashboards": [],
"language": "python",
"notebookMetadata": {
"pythonIndentUnit": 2
},
"notebookName": "notebook",
"widgets": {}
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
__version__ = "0.0.1"
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{{- /*
We use pyspark.sql rather than DatabricksSession.builder.getOrCreate()
for compatibility with older runtimes. With a new runtime, it's
equivalent to DatabricksSession.builder.getOrCreate().
*/ -}}
lennartkats-db marked this conversation as resolved.
Show resolved Hide resolved
from pyspark.sql import SparkSession

def get_taxis():
spark = SparkSession.builder.getOrCreate()
return spark.read.table("samples.nyctaxi.trips")

def main():
get_taxis().show(5)

if __name__ == '__main__':
main()
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
from {{.project_name}} import main

def test_main():
taxis = main.get_taxis()
assert taxis.count() == 5