Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for versioning with DVC in Kedro #4443

Merged
merged 32 commits into from
Feb 12, 2025
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
1bb2cd1
Add kedro dvc page
lrcouto Jan 27, 2025
3ec3cd6
Merge branch 'main' into kedro-dvc-documentation
lrcouto Jan 27, 2025
7d88e90
Lint
lrcouto Jan 27, 2025
fdc3ca9
Add new page to index
lrcouto Jan 27, 2025
ae04428
Lint
lrcouto Jan 27, 2025
3b2fca8
Update docs/source/data/kedro_dvc_versioning.md
lrcouto Jan 27, 2025
fe13392
Merge branch 'main' into kedro-dvc-documentation
lrcouto Jan 27, 2025
d5a2bdf
Formatting, add more examples
lrcouto Jan 28, 2025
a019343
Merge branch 'main' into kedro-dvc-documentation
lrcouto Jan 30, 2025
5eee394
Add additional information about starters
lrcouto Feb 3, 2025
b1f7b84
Elaborate information about the gitignore file
lrcouto Feb 3, 2025
467deba
Merge branch 'main' into kedro-dvc-documentation
lrcouto Feb 4, 2025
80822de
Elaborate on the instructions
lrcouto Feb 5, 2025
d957b8d
Merge branch 'main' into kedro-dvc-documentation
lrcouto Feb 6, 2025
6188949
Further clarification on the .gitignore file
lrcouto Feb 6, 2025
9d8e886
Merge branch 'main' into kedro-dvc-documentation
lrcouto Feb 7, 2025
2ab8b5d
Change details on the template gitignore
lrcouto Feb 11, 2025
c55aa23
Elaborate on explanations
lrcouto Feb 11, 2025
7c19b19
Merge branch 'main' into kedro-dvc-documentation
lrcouto Feb 11, 2025
e44dd9a
Add more detail on version checkout
lrcouto Feb 12, 2025
38c2cba
Merge branch 'kedro-dvc-documentation' of github.com:kedro-org/kedro …
lrcouto Feb 12, 2025
ea2ca74
Update docs/source/data/kedro_dvc_versioning.md
lrcouto Feb 12, 2025
657cd8d
Style and grammar corrections
lrcouto Feb 12, 2025
a5918bf
Merge branch 'kedro-dvc-documentation' of github.com:kedro-org/kedro …
lrcouto Feb 12, 2025
ec7fc8d
More style/grammar
lrcouto Feb 12, 2025
65cf95a
Merge branch 'main' into kedro-dvc-documentation
ankatiyar Feb 12, 2025
4bab41f
Change 'we' to 'you'
lrcouto Feb 12, 2025
090625b
Merge branch 'kedro-dvc-documentation' of github.com:kedro-org/kedro …
lrcouto Feb 12, 2025
ca384e9
Update docs/source/data/kedro_dvc_versioning.md
lrcouto Feb 12, 2025
0180459
Update docs/source/data/kedro_dvc_versioning.md
lrcouto Feb 12, 2025
d67db83
Add link to the dvc.yaml docs
lrcouto Feb 12, 2025
026a84d
Merge branch 'kedro-dvc-documentation' of github.com:kedro-org/kedro …
lrcouto Feb 12, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/data/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ Further pages describe more advanced concepts:

advanced_data_catalog_usage
partitioned_and_incremental_datasets
kedro_dvc_versioning
```

This section on handing data with Kedro concludes with an advanced use case, illustrated with a tutorial that explains how to create your own custom dataset:
Expand Down
231 changes: 231 additions & 0 deletions docs/source/data/kedro_dvc_versioning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,231 @@
# Data and pipeline versioning with Kedro and DVC

This document explains how to use [DVC](https://dvc.org/), a command line tool and VS Code Extension to help you develop reproducible machine learning projects, to version datasets and pipelines in your Kedro project.

Check notice on line 3 in docs/source/data/kedro_dvc_versioning.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/kedro_dvc_versioning.md#L3

[Kedro.sentencelength] Try to keep your sentence length to 30 words or fewer.
Raw output
{"message": "[Kedro.sentencelength] Try to keep your sentence length to 30 words or fewer.", "location": {"path": "docs/source/data/kedro_dvc_versioning.md", "range": {"start": {"line": 3, "column": 1}}}, "severity": "INFO"}

This tutorial assumes you have experience with the Git CLI and Kedro CLI commands but does not require any prior knowledge of DVC.

## Versioning data with .dvc files

### Initialising the repository

For this example, we will be using a Kedro `spaceflights-pandas` starter project, which includes preconfigured datasets and pipelines. To create this starter project locally, use the command:

Check warning on line 11 in docs/source/data/kedro_dvc_versioning.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/kedro_dvc_versioning.md#L11

[Kedro.Spellings] Did you really mean 'preconfigured'?
Raw output
{"message": "[Kedro.Spellings] Did you really mean 'preconfigured'?", "location": {"path": "docs/source/data/kedro_dvc_versioning.md", "range": {"start": {"line": 11, "column": 98}}}, "severity": "WARNING"}

`kedro new --starter=spaceflights-pandas --name=space-dvc`

For more information about starter projects, visit the [Kedro starters documentation](https://docs.kedro.org/en/stable/starters/starters.html) page.

Check warning on line 15 in docs/source/data/kedro_dvc_versioning.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/kedro_dvc_versioning.md#L15

[Kedro.words] Use 'see', 'read', or 'follow' instead of 'visit'.
Raw output
{"message": "[Kedro.words] Use 'see', 'read', or 'follow' instead of 'visit'.", "location": {"path": "docs/source/data/kedro_dvc_versioning.md", "range": {"start": {"line": 15, "column": 46}}}, "severity": "WARNING"}

To use DVC as a Python library, install it using `pip` or `conda`, for example:
`pip install dvc`

Since DVC works alongside Git to track data changes, initialise the Kedro project as a git repository: `git init`.

Then, initialize DVC in the project: `dvc init`.

Check warning on line 22 in docs/source/data/kedro_dvc_versioning.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/kedro_dvc_versioning.md#L22

[Kedro.ukspelling] In general, use UK English spelling instead of 'initialize'.
Raw output
{"message": "[Kedro.ukspelling] In general, use UK English spelling instead of 'initialize'.", "location": {"path": "docs/source/data/kedro_dvc_versioning.md", "range": {"start": {"line": 22, "column": 7}}}, "severity": "WARNING"}

You should see a message such as:

```bash
Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
| |
| DVC has enabled anonymous aggregate usage analytics. |
| Read the analytics documentation (and how to opt-out) here: |
| <https://dvc.org/doc/user-guide/analytics> |
| |
+---------------------------------------------------------------------+
```

### First commits

DVC helps manage large datasets that should not be stored directly in Git. Instead of adding dataset files to Git, DVC generates small metadata files that Git tracks instead.

These metadata files store information about the actual dataset, such as its hash and location. More information about the structure of the .dvc file can be found in the [DVC documentation](https://dvc.org/doc/user-guide/project-structure/dvc-files#dvc-files).

Suppose you have a dataset in your project, such as:

```yaml
companies:
type: pandas.CSVDataset
filepath: data/01_raw/companies.csv
```

Use `dvc add` to start tracking a dataset file:

```bash
dvc add data/01_raw/companies.csv
```

This generates the `companies.csv.dvc` file which can be committed to git. This small, human-readable metadata file acts as a placeholder for the original data for the purpose of Git tracking.

Check warning on line 60 in docs/source/data/kedro_dvc_versioning.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/kedro_dvc_versioning.md#L60

[Kedro.toowordy] 'for the purpose of' is too wordy
Raw output
{"message": "[Kedro.toowordy] 'for the purpose of' is too wordy", "location": {"path": "docs/source/data/kedro_dvc_versioning.md", "range": {"start": {"line": 60, "column": 161}}}, "severity": "WARNING"}

Since the spaceflights-pandas starter ignores everything under the `data/` directory by default, you have to update the `.gitignore` file provided by the template by removing the following lines from it:

```bash
# ignore everything in the following folders
data/**

# except their sub-folders
!data/**/
```

Once updated, add the `.dvc` file to Git and commit the changes:

```bash
git add data/01_raw/companies.csv.dvc
git commit -m "Track companies.csv dataset with DVC"
```

### Going back to a previous version of the data

DVC integrates with Git to manage different dataset versions. If you need to restore a previous version of a dataset, first identify the commit containing the desired version. You can use:

```bash
git log -- data/01_raw/companies.csv.dvc
```

To display the commit hashes associated with this file. Once you find the desired commit, run:

```bash
git checkout <commit_hash> data/01_raw/companies.csv.dvc
dvc checkout
```

The first command will restore the `.dvc` metadata file to its previous version. The second uses the metadata file to restore the corresponding dataset.

### Storing data remotely

DVC remotes provide access to external storage locations to track and share your data and ML models with the `dvc push` and `dvc pull` commands. Usually, those will be shared between devices or team members who are working on a project. It supports [several different storage types](https://dvc.org/doc/user-guide/data-management/remote-storage#supported-storage-types).

Check warning on line 98 in docs/source/data/kedro_dvc_versioning.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/kedro_dvc_versioning.md#L98

[Kedro.weaselwords] 'Usually' is a weasel word!
Raw output
{"message": "[Kedro.weaselwords] 'Usually' is a weasel word!", "location": {"path": "docs/source/data/kedro_dvc_versioning.md", "range": {"start": {"line": 98, "column": 146}}}, "severity": "WARNING"}

For example:

```bash
dvc remote add myremote s3://mybucket
kedro run
git add .
git commit -m "Update"
dvc push
```

### Going back to a previous version of the data, stored remotely

```bash
git checkout <commit hash> data/01_raw/companies.csv.dvc
dvc checkout
dvc pull
```

## Versioning with DVC data pipelines

While the previous method allows you to version datasets, it comes with some limitations:

- Intermediate and output datasets must be added to DVC manually.
- Parameters and code changes are not explicitly tracked.
- Artefacts and metrics cannot be tracked effectively.

To address these issues, you can define Kedro pipelines as DVC stages in the dvc.yaml file. The list of stages is typically the most important part of a dvc.yaml file, though the file can also be used to configure artifacts, metrics, params, and plots, either as part of a stage definition or on their own.

Check notice on line 126 in docs/source/data/kedro_dvc_versioning.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/kedro_dvc_versioning.md#L126

[Kedro.sentencelength] Try to keep your sentence length to 30 words or fewer.
Raw output
{"message": "[Kedro.sentencelength] Try to keep your sentence length to 30 words or fewer.", "location": {"path": "docs/source/data/kedro_dvc_versioning.md", "range": {"start": {"line": 126, "column": 93}}}, "severity": "INFO"}

Check warning on line 126 in docs/source/data/kedro_dvc_versioning.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/kedro_dvc_versioning.md#L126

[Kedro.Spellings] Did you really mean 'params'?
Raw output
{"message": "[Kedro.Spellings] Did you really mean 'params'?", "location": {"path": "docs/source/data/kedro_dvc_versioning.md", "range": {"start": {"line": 126, "column": 235}}}, "severity": "WARNING"}

### Defining Kedro pipelines as DVC stages

Here is an example configuration for dvc.yaml:

```yaml
stages:
data_processing:
cmd: kedro run --pipeline data_processing
deps:
- data/01_raw/companies.csv
- data/01_raw/reviews.csv
- data/01_raw/shuttles.xlsx
outs:
- data/02_intermediate/preprocessed_companies.parquet
- data/02_intermediate/preprocessed_shuttles.parquet
- data/03_primary/model_input_table.parquet

data_science:
cmd: kedro run --pipeline data_science
deps:
- data/03_primary/model_input_table.parquet
outs:
- data/06_models/regressor.pickle
```

Run the pipeline with:

```bash
dvc repro
```

### Updating a dataset

If one of the datasets is updated, you can rerun only the pipelines affected by the change.

Check warning on line 161 in docs/source/data/kedro_dvc_versioning.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/kedro_dvc_versioning.md#L161

[Kedro.weaselwords] 'only' is a weasel word!
Raw output
{"message": "[Kedro.weaselwords] 'only' is a weasel word!", "location": {"path": "docs/source/data/kedro_dvc_versioning.md", "range": {"start": {"line": 161, "column": 50}}}, "severity": "WARNING"}

The command `dvc repro` executes pipelines where outputs or dependencies have changed.

### Tracking code changes

You can track changes to your code by adding the relevant files to the `deps` section in `dvc.yaml`.

```yaml
stages:
data_processing:
cmd: kedro run --pipeline data_processing
deps:
- data/01_raw/companies.csv
- data/01_raw/reviews.csv
- data/01_raw/shuttles.xlsx
- src/space_dvc/pipelines/data_processing/nodes.py
- src/space_dvc/pipelines/data_processing/pipeline.py
outs:
- data/02_intermediate/preprocessed_companies.parquet
- data/02_intermediate/preprocessed_shuttles.parquet
- data/03_primary/model_input_table.parquet
```

After applying the desired code changes, run `dvc repro`. The output should confirm the updates on the `dvc.lock` file, if any:

```bash
Updating lock file 'dvc.lock'
Use `dvc push` to send your updates to remote storage.
```

After that, they can be pushed to remote storage with the `dvc push` command.

### Tracking parameters

To track parameters, you can include them under the params section in `dvc.yaml`.

Check warning on line 196 in docs/source/data/kedro_dvc_versioning.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/kedro_dvc_versioning.md#L196

[Kedro.Spellings] Did you really mean 'params'?
Raw output
{"message": "[Kedro.Spellings] Did you really mean 'params'?", "location": {"path": "docs/source/data/kedro_dvc_versioning.md", "range": {"start": {"line": 196, "column": 53}}}, "severity": "WARNING"}

```yaml
stages:
data_science:
cmd: kedro run --pipeline data_science
deps:
- data/03_primary/model_input_table.parquet
- src/space_dvc/pipelines/data_science/nodes.py
- src/space_dvc/pipelines/data_science/pipeline.py
params:
- conf/base/parameters_data_science.yaml:
- model_options
outs:
- data/06_models/regressor.pickle
```

Run the pipeline and push the changes:

```bash
dvc repro
dvc push
```

### Running experiments with different parameters

To experiment with different parameter values, update the parameter in `parameters.yaml` and then run the pipelines with `dvc repro`.

Compare parameter changes between runs with `dvc params diff`:

```bash
Path Param HEAD workspace
conf/base/parameters_data_science.yml model_options.features - ['engines', 'passenger_capacity', 'crew', 'd_check_complete', 'moon_clearance_complete', 'iata_approved', 'company_rating', 'review_scores_rating']
conf/base/parameters_data_science.yml model_options.random_state - 3
conf/base/parameters_data_science.yml model_options.test_size - 0.2
```
Loading