Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: port docs to mdBook #1360

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,4 @@ Cargo.lock
!/delta-inspect/Cargo.lock
!/proofs/Cargo.lock

docs/book
19 changes: 19 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Python deltalake book

This book is built with [mdBook](https://github.com/rust-lang/mdBook).

This book is hosted here: https://mrpowers.github.io/

## Deploy

Build the latest files with `mdbook build`.

TBD: Work out a production deploy process.

## Install

Download the `mdbook` binary.

Then manually open it, so you have permissions to run it on your Mac.

Add it to your path with a command like this so you can easily run the commands: `mv ~/Downloads/mdbook /Users/matthew.powers/.local/bin`.
6 changes: 6 additions & 0 deletions docs/book.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
[book]
authors = ["Delta Rust Team"]
language = "en"
multilingual = false
src = "src"
title = "Python deltalake book"
10 changes: 10 additions & 0 deletions docs/src/SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Summary

- [Index](./index.md)
- [Installation](./installation.md)
- [Usage](./usage/README.md)
- [Loading a Delta Table](./usage/loading-table.md)
- [Examining a Delta Table](./usage/examining-table.md)
- [Querying a Delta Table](./usage/querying-delta-tables.md)
- [Managing a Delta Table](./usage/managing-tables.md)
- [Writing Delta Tables](./usage/writing-delta-tables.md)
19 changes: 19 additions & 0 deletions docs/src/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Python deltalake package

This is the documentation for the native Python implementation of
deltalake. It is based on the delta-rs Rust library and requires no
Spark or JVM dependencies. For the PySpark implementation, see
[delta-spark](https://docs.delta.io/latest/api/python/index.html)
instead.

This module provides the capability to read, write, and manage [Delta
Lake](https://delta.io/) tables from Python without Spark or Java. It
uses [Apache Arrow](https://arrow.apache.org/) under the hood, so is
compatible with other Arrow-native or integrated libraries such as
[Pandas](https://pandas.pydata.org/), [DuckDB](https://duckdb.org/), and
[Polars](https://www.pola.rs/).

Note: This module is under active development and some features are
experimental. It is not yet as feature-complete as the PySpark
implementation of Delta Lake. If you encounter a bug, please let us know
in our [GitHub repo](https://github.com/delta-io/delta-rs/issues).
11 changes: 11 additions & 0 deletions docs/src/installation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Installation

## Using Pip

``` bash
pip install deltalake
```

NOTE: official binary wheels are linked against openssl statically for
remote objection store communication. Please file Github issue to
request for critical openssl upgrade.
17 changes: 17 additions & 0 deletions docs/src/usage/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Usage

A `DeltaTable` represents the state of a
delta table at a particular version. This includes which files are
currently part of the table, the schema of the table, and other metadata
such as creation time.

``` python
>>> from deltalake import DeltaTable
>>> dt = DeltaTable("../rust/tests/data/delta-0.2.0")
>>> dt.version()
3
>>> dt.files()
['part-00000-cb6b150b-30b8-4662-ad28-ff32ddab96d2-c000.snappy.parquet',
'part-00000-7c2deba3-1994-4fb8-bc07-d46c948aa415-c000.snappy.parquet',
'part-00001-c373a5bd-85f0-4758-815e-7eb62007a15c-c000.snappy.parquet']
```
115 changes: 115 additions & 0 deletions docs/src/usage/examining-table.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Examining a Table

## Metadata

The delta log maintains basic metadata about a table, including:

- A unique `id`
- A `name`, if provided
- A `description`, if provided
- The list of `partitionColumns`.
- The `created_time` of the table
- A map of table `configuration`. This includes fields such as
`delta.appendOnly`, which if `true` indicates the table is not meant
to have data deleted from it.

Get metadata from a table with the
`DeltaTable.metadata` method:

``` python
>>> from deltalake import DeltaTable
>>> dt = DeltaTable("../rust/tests/data/simple_table")
>>> dt.metadata()
Metadata(id: 5fba94ed-9794-4965-ba6e-6ee3c0d22af9, name: None, description: None, partitionColumns: [], created_time: 1587968585495, configuration={})
```

## Schema

The schema for the table is also saved in the transaction log. It can
either be retrieved in the Delta Lake form as
`deltalake.schema.Schema` or as a
PyArrow schema. The first allows you to introspect any column-level
metadata stored in the schema, while the latter represents the schema
the table will be loaded into.

Use `DeltaTable.schema` to retrieve the delta lake schema:

``` python
>>> from deltalake import DeltaTable
>>> dt = DeltaTable("../rust/tests/data/simple_table")
>>> dt.schema()
Schema([Field(id, PrimitiveType("long"), nullable=True)])
```

These schemas have a JSON representation that can be retrieved. To
reconstruct from json, use
`deltalake.schema.Schema.from_json()`.

``` python
>>> dt.schema().json()
'{"type":"struct","fields":[{"name":"id","type":"long","nullable":true,"metadata":{}}]}'
```

Use `deltalake.schema.Schema.to_pyarrow()` to retrieve the PyArrow schema:

``` python
>>> dt.schema().to_pyarrow()
id: int64
```

## History

Depending on what system wrote the table, the delta table may have
provenance information describing what operations were performed on the
table, when, and by whom. This information is retained for 30 days by
default, unless otherwise specified by the table configuration
`delta.logRetentionDuration`.

::: note
::: title
Note
:::

This information is not written by all writers and different writers may
use different schemas to encode the actions. For Spark\'s format, see:
<https://docs.delta.io/latest/delta-utility.html#history-schema>
:::

To view the available history, use `DeltaTable.history`:

``` python
>>> from deltalake import DeltaTable
>>> dt = DeltaTable("../rust/tests/data/simple_table")
>>> dt.history()
[{'timestamp': 1587968626537, 'operation': 'DELETE', 'operationParameters': {'predicate': '["((`id` % CAST(2 AS BIGINT)) = CAST(0 AS BIGINT))"]'}, 'readVersion': 3, 'isBlindAppend': False},
{'timestamp': 1587968614187, 'operation': 'UPDATE', 'operationParameters': {'predicate': '((id#697L % cast(2 as bigint)) = cast(0 as bigint))'}, 'readVersion': 2, 'isBlindAppend': False},
{'timestamp': 1587968604143, 'operation': 'WRITE', 'operationParameters': {'mode': 'Overwrite', 'partitionBy': '[]'}, 'readVersion': 1, 'isBlindAppend': False},
{'timestamp': 1587968596254, 'operation': 'MERGE', 'operationParameters': {'predicate': '(oldData.`id` = newData.`id`)'}, 'readVersion': 0, 'isBlindAppend': False},
{'timestamp': 1587968586154, 'operation': 'WRITE', 'operationParameters': {'mode': 'ErrorIfExists', 'partitionBy': '[]'}, 'isBlindAppend': True}]
```

## Current Add Actions

The active state for a delta table is determined by the Add actions,
which provide the list of files that are part of the table and metadata
about them, such as creation time, size, and statistics. You can get a
data frame of the add actions data using `DeltaTable.get_add_actions`:

``` python
>>> from deltalake import DeltaTable
>>> dt = DeltaTable("../rust/tests/data/delta-0.8.0")
>>> dt.get_add_actions(flatten=True).to_pandas()
path size_bytes modification_time data_change num_records null_count.value min.value max.value
0 part-00000-c9b90f86-73e6-46c8-93ba-ff6bfaf892a... 440 2021-03-06 15:16:07 True 2 0 0 2
1 part-00000-04ec9591-0b73-459e-8d18-ba5711d6cbe... 440 2021-03-06 15:16:16 True 2 0 2 4
```

This works even with past versions of the table:

``` python
>>> dt = DeltaTable("../rust/tests/data/delta-0.8.0", version=0)
>>> dt.get_add_actions(flatten=True).to_pandas()
path size_bytes modification_time data_change num_records null_count.value min.value max.value
0 part-00000-c9b90f86-73e6-46c8-93ba-ff6bfaf892a... 440 2021-03-06 15:16:07 True 2 0 0 2
1 part-00001-911a94a2-43f6-4acb-8620-5e68c265498... 445 2021-03-06 15:16:07 True 3 0 2 4
```
120 changes: 120 additions & 0 deletions docs/src/usage/loading-table.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# Loading a Delta Table

To load the current version, use the constructor:

``` python
>>> dt = DeltaTable("../rust/tests/data/delta-0.2.0")
```

Depending on your storage backend, you could use the `storage_options`
parameter to provide some configuration. Configuration is defined for
specific backends - [s3
options](https://docs.rs/object_store/latest/object_store/aws/enum.AmazonS3ConfigKey.html#variants),
[azure
options](https://docs.rs/object_store/latest/object_store/azure/enum.AzureConfigKey.html#variants),
[gcs
options](https://docs.rs/object_store/latest/object_store/gcp/enum.GoogleConfigKey.html#variants).

``` python
>>> storage_options = {"AWS_ACCESS_KEY_ID": "THE_AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY":"THE_AWS_SECRET_ACCESS_KEY"}
>>> dt = DeltaTable("../rust/tests/data/delta-0.2.0", storage_options=storage_options)
```

The configuration can also be provided via the environment, and the
basic service provider is derived from the URL being used. We try to
support many of the well-known formats to identify basic service
properties.

**S3**:

> - s3://\<bucket\>/\<path\>
> - s3a://\<bucket\>/\<path\>

**Azure**:

> - az://\<container\>/\<path\>
> - adl://\<container\>/\<path\>
> - abfs://\<container\>/\<path\>

**GCS**:

> - gs://\<bucket\>/\<path\>

Alternatively, if you have a data catalog you can load it by reference
to a database and table name. Currently only AWS Glue is supported.

For AWS Glue catalog, use AWS environment variables to authenticate.

``` python
>>> from deltalake import DeltaTable
>>> from deltalake import DataCatalog
>>> database_name = "simple_database"
>>> table_name = "simple_table"
>>> data_catalog = DataCatalog.AWS
>>> dt = DeltaTable.from_data_catalog(data_catalog=data_catalog, database_name=database_name, table_name=table_name)
>>> dt.to_pyarrow_table().to_pydict()
{'id': [5, 7, 9, 5, 6, 7, 8, 9]}
```

## Custom Storage Backends

While delta always needs its internal storage backend to work and be
properly configured, in order to manage the delta log, it may sometime
be advantageous - and is common practice in the arrow world - to
customize the storage interface used for reading the bulk data.

`deltalake` will work with any storage compliant with `pyarrow.fs.FileSystem`, however the root of the filesystem has to be adjusted to point at the root of the Delta table. We can achieve this by wrapping the custom filesystem into a `pyarrow.fs.SubTreeFileSystem`.

``` python
import pyarrow.fs as fs
from deltalake import DeltaTable

path = "<path/to/table>"
filesystem = fs.SubTreeFileSystem(path, fs.LocalFileSystem())

dt = DeltaTable(path)
ds = dt.to_pyarrow_dataset(filesystem=filesystem)
```

When using the pyarrow factory method for file systems, the normalized
path is provided on creation. In case of S3 this would look something
like:

``` python
import pyarrow.fs as fs
from deltalake import DeltaTable

table_uri = "s3://<bucket>/<path>"
raw_fs, normalized_path = fs.FileSystem.from_uri(table_uri)
filesystem = fs.SubTreeFileSystem(normalized_path, raw_fs)

dt = DeltaTable(table_uri)
ds = dt.to_pyarrow_dataset(filesystem=filesystem)
```

## Time Travel

To load previous table states, you can provide the version number you
wish to load:

``` python
>>> dt = DeltaTable("../rust/tests/data/simple_table", version=2)
```

Once you\'ve loaded a table, you can also change versions using either a
version number or datetime string:

``` python
>>> dt.load_version(1)
>>> dt.load_with_datetime("2021-11-04 00:05:23.283+00:00")
```

::: warning
::: title
Warning
:::

Previous table versions may not exist if they have been vacuumed, in
which case an exception will be thrown. See [Vacuuming
tables](#vacuuming-tables) for more information.
:::
29 changes: 29 additions & 0 deletions docs/src/usage/managing-tables.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Managing Delta Tables

## Vacuuming tables

Vacuuming a table will delete any files that have been marked for
deletion. This may make some past versions of a table invalid, so this
can break time travel. However, it will save storage space. Vacuum will
retain files in a certain window, by default one week, so time travel
will still work in shorter ranges.

Delta tables usually don't delete old files automatically, so vacuuming
regularly is considered good practice, unless the table is only appended
to.

Use `DeltaTable.vacuum` to perform the vacuum operation. Note that to prevent accidental deletion, the function performs a dry-run by default: it will only list the files to be deleted. Pass `dry_run=False` to actually delete files.

``` python
>>> dt = DeltaTable("../rust/tests/data/simple_table")
>>> dt.vacuum()
['../rust/tests/data/simple_table/part-00006-46f2ff20-eb5d-4dda-8498-7bfb2940713b-c000.snappy.parquet',
'../rust/tests/data/simple_table/part-00190-8ac0ae67-fb1d-461d-a3d3-8dc112766ff5-c000.snappy.parquet',
'../rust/tests/data/simple_table/part-00164-bf40481c-4afd-4c02-befa-90f056c2d77a-c000.snappy.parquet',
...]
>>> dt.vacuum(dry_run=False) # Don't run this unless you are sure!
```

## Optimizing tables

Optimizing tables is not currently supported.
Loading