Skip to content

Commit

Permalink
Port docs to mkdocs
Browse files Browse the repository at this point in the history
  • Loading branch information
MrPowers authored and rtyler committed Jul 26, 2023
1 parent f4a4341 commit 433fb6b
Show file tree
Hide file tree
Showing 9 changed files with 481 additions and 0 deletions.
7 changes: 7 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Python deltalake package

This is the documentation for the native Python implementation of Delta Lake. It is based on the delta-rs Rust library and requires no Spark or JVM dependencies. For the PySpark implementation, see [delta-spark](https://docs.delta.io/latest/api/python/index.html) instead.

This module provides the capability to read, write, and manage [Delta Lake](https://delta.io/) tables from Python without Spark or Java. It uses [Apache Arrow](https://arrow.apache.org/) under the hood, so is compatible with other Arrow-native or integrated libraries such as [Pandas](https://pandas.pydata.org/), [DuckDB](https://duckdb.org/), and [Polars](https://www.pola.rs/).

Note: This module is under active development and some features are experimental. It is not yet as feature-complete as the PySpark implementation of Delta Lake. If you encounter a bug, please let us know in our [GitHub repo](https://github.com/delta-io/delta-rs/issues).
9 changes: 9 additions & 0 deletions docs/installation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Installation

## Using Pip

``` bash
pip install deltalake
```

NOTE: official binary wheels are linked against openssl statically for remote objection store communication. Please file Github issue to request for critical openssl upgrade.
115 changes: 115 additions & 0 deletions docs/usage/examining-table.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Examining a Table

## Metadata

The delta log maintains basic metadata about a table, including:

- A unique `id`
- A `name`, if provided
- A `description`, if provided
- The list of `partitionColumns`.
- The `created_time` of the table
- A map of table `configuration`. This includes fields such as
`delta.appendOnly`, which if `true` indicates the table is not meant
to have data deleted from it.

Get metadata from a table with the
`DeltaTable.metadata` method:

``` python
>>> from deltalake import DeltaTable
>>> dt = DeltaTable("../rust/tests/data/simple_table")
>>> dt.metadata()
Metadata(id: 5fba94ed-9794-4965-ba6e-6ee3c0d22af9, name: None, description: None, partitionColumns: [], created_time: 1587968585495, configuration={})
```

## Schema

The schema for the table is also saved in the transaction log. It can
either be retrieved in the Delta Lake form as
`deltalake.schema.Schema` or as a
PyArrow schema. The first allows you to introspect any column-level
metadata stored in the schema, while the latter represents the schema
the table will be loaded into.

Use `DeltaTable.schema` to retrieve the delta lake schema:

``` python
>>> from deltalake import DeltaTable
>>> dt = DeltaTable("../rust/tests/data/simple_table")
>>> dt.schema()
Schema([Field(id, PrimitiveType("long"), nullable=True)])
```

These schemas have a JSON representation that can be retrieved. To
reconstruct from json, use
`deltalake.schema.Schema.from_json()`.

``` python
>>> dt.schema().json()
'{"type":"struct","fields":[{"name":"id","type":"long","nullable":true,"metadata":{}}]}'
```

Use `deltalake.schema.Schema.to_pyarrow()` to retrieve the PyArrow schema:

``` python
>>> dt.schema().to_pyarrow()
id: int64
```

## History

Depending on what system wrote the table, the delta table may have
provenance information describing what operations were performed on the
table, when, and by whom. This information is retained for 30 days by
default, unless otherwise specified by the table configuration
`delta.logRetentionDuration`.

::: note
::: title
Note
:::

This information is not written by all writers and different writers may
use different schemas to encode the actions. For Spark\'s format, see:
<https://docs.delta.io/latest/delta-utility.html#history-schema>
:::

To view the available history, use `DeltaTable.history`:

``` python
>>> from deltalake import DeltaTable
>>> dt = DeltaTable("../rust/tests/data/simple_table")
>>> dt.history()
[{'timestamp': 1587968626537, 'operation': 'DELETE', 'operationParameters': {'predicate': '["((`id` % CAST(2 AS BIGINT)) = CAST(0 AS BIGINT))"]'}, 'readVersion': 3, 'isBlindAppend': False},
{'timestamp': 1587968614187, 'operation': 'UPDATE', 'operationParameters': {'predicate': '((id#697L % cast(2 as bigint)) = cast(0 as bigint))'}, 'readVersion': 2, 'isBlindAppend': False},
{'timestamp': 1587968604143, 'operation': 'WRITE', 'operationParameters': {'mode': 'Overwrite', 'partitionBy': '[]'}, 'readVersion': 1, 'isBlindAppend': False},
{'timestamp': 1587968596254, 'operation': 'MERGE', 'operationParameters': {'predicate': '(oldData.`id` = newData.`id`)'}, 'readVersion': 0, 'isBlindAppend': False},
{'timestamp': 1587968586154, 'operation': 'WRITE', 'operationParameters': {'mode': 'ErrorIfExists', 'partitionBy': '[]'}, 'isBlindAppend': True}]
```

## Current Add Actions

The active state for a delta table is determined by the Add actions,
which provide the list of files that are part of the table and metadata
about them, such as creation time, size, and statistics. You can get a
data frame of the add actions data using `DeltaTable.get_add_actions`:

``` python
>>> from deltalake import DeltaTable
>>> dt = DeltaTable("../rust/tests/data/delta-0.8.0")
>>> dt.get_add_actions(flatten=True).to_pandas()
path size_bytes modification_time data_change num_records null_count.value min.value max.value
0 part-00000-c9b90f86-73e6-46c8-93ba-ff6bfaf892a... 440 2021-03-06 15:16:07 True 2 0 0 2
1 part-00000-04ec9591-0b73-459e-8d18-ba5711d6cbe... 440 2021-03-06 15:16:16 True 2 0 2 4
```

This works even with past versions of the table:

``` python
>>> dt = DeltaTable("../rust/tests/data/delta-0.8.0", version=0)
>>> dt.get_add_actions(flatten=True).to_pandas()
path size_bytes modification_time data_change num_records null_count.value min.value max.value
0 part-00000-c9b90f86-73e6-46c8-93ba-ff6bfaf892a... 440 2021-03-06 15:16:07 True 2 0 0 2
1 part-00001-911a94a2-43f6-4acb-8620-5e68c265498... 445 2021-03-06 15:16:07 True 3 0 2 4
```
17 changes: 17 additions & 0 deletions docs/usage/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Usage

A `DeltaTable` represents the state of a
delta table at a particular version. This includes which files are
currently part of the table, the schema of the table, and other metadata
such as creation time.

``` python
>>> from deltalake import DeltaTable
>>> dt = DeltaTable("../rust/tests/data/delta-0.2.0")
>>> dt.version()
3
>>> dt.files()
['part-00000-cb6b150b-30b8-4662-ad28-ff32ddab96d2-c000.snappy.parquet',
'part-00000-7c2deba3-1994-4fb8-bc07-d46c948aa415-c000.snappy.parquet',
'part-00001-c373a5bd-85f0-4758-815e-7eb62007a15c-c000.snappy.parquet']
```
120 changes: 120 additions & 0 deletions docs/usage/loading-table.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# Loading a Delta Table

To load the current version, use the constructor:

``` python
>>> dt = DeltaTable("../rust/tests/data/delta-0.2.0")
```

Depending on your storage backend, you could use the `storage_options`
parameter to provide some configuration. Configuration is defined for
specific backends - [s3
options](https://docs.rs/object_store/latest/object_store/aws/enum.AmazonS3ConfigKey.html#variants),
[azure
options](https://docs.rs/object_store/latest/object_store/azure/enum.AzureConfigKey.html#variants),
[gcs
options](https://docs.rs/object_store/latest/object_store/gcp/enum.GoogleConfigKey.html#variants).

``` python
>>> storage_options = {"AWS_ACCESS_KEY_ID": "THE_AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY":"THE_AWS_SECRET_ACCESS_KEY"}
>>> dt = DeltaTable("../rust/tests/data/delta-0.2.0", storage_options=storage_options)
```

The configuration can also be provided via the environment, and the
basic service provider is derived from the URL being used. We try to
support many of the well-known formats to identify basic service
properties.

**S3**:

> - s3://\<bucket\>/\<path\>
> - s3a://\<bucket\>/\<path\>
**Azure**:

> - az://\<container\>/\<path\>
> - adl://\<container\>/\<path\>
> - abfs://\<container\>/\<path\>
**GCS**:

> - gs://\<bucket\>/\<path\>
Alternatively, if you have a data catalog you can load it by reference
to a database and table name. Currently only AWS Glue is supported.

For AWS Glue catalog, use AWS environment variables to authenticate.

``` python
>>> from deltalake import DeltaTable
>>> from deltalake import DataCatalog
>>> database_name = "simple_database"
>>> table_name = "simple_table"
>>> data_catalog = DataCatalog.AWS
>>> dt = DeltaTable.from_data_catalog(data_catalog=data_catalog, database_name=database_name, table_name=table_name)
>>> dt.to_pyarrow_table().to_pydict()
{'id': [5, 7, 9, 5, 6, 7, 8, 9]}
```

## Custom Storage Backends

While delta always needs its internal storage backend to work and be
properly configured, in order to manage the delta log, it may sometime
be advantageous - and is common practice in the arrow world - to
customize the storage interface used for reading the bulk data.

`deltalake` will work with any storage compliant with `pyarrow.fs.FileSystem`, however the root of the filesystem has to be adjusted to point at the root of the Delta table. We can achieve this by wrapping the custom filesystem into a `pyarrow.fs.SubTreeFileSystem`.

``` python
import pyarrow.fs as fs
from deltalake import DeltaTable

path = "<path/to/table>"
filesystem = fs.SubTreeFileSystem(path, fs.LocalFileSystem())

dt = DeltaTable(path)
ds = dt.to_pyarrow_dataset(filesystem=filesystem)
```

When using the pyarrow factory method for file systems, the normalized
path is provided on creation. In case of S3 this would look something
like:

``` python
import pyarrow.fs as fs
from deltalake import DeltaTable

table_uri = "s3://<bucket>/<path>"
raw_fs, normalized_path = fs.FileSystem.from_uri(table_uri)
filesystem = fs.SubTreeFileSystem(normalized_path, raw_fs)

dt = DeltaTable(table_uri)
ds = dt.to_pyarrow_dataset(filesystem=filesystem)
```

## Time Travel

To load previous table states, you can provide the version number you
wish to load:

``` python
>>> dt = DeltaTable("../rust/tests/data/simple_table", version=2)
```

Once you\'ve loaded a table, you can also change versions using either a
version number or datetime string:

``` python
>>> dt.load_version(1)
>>> dt.load_with_datetime("2021-11-04 00:05:23.283+00:00")
```

::: warning
::: title
Warning
:::

Previous table versions may not exist if they have been vacuumed, in
which case an exception will be thrown. See [Vacuuming
tables](#vacuuming-tables) for more information.
:::
29 changes: 29 additions & 0 deletions docs/usage/managing-tables.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Managing Delta Tables

## Vacuuming tables

Vacuuming a table will delete any files that have been marked for
deletion. This may make some past versions of a table invalid, so this
can break time travel. However, it will save storage space. Vacuum will
retain files in a certain window, by default one week, so time travel
will still work in shorter ranges.

Delta tables usually don't delete old files automatically, so vacuuming
regularly is considered good practice, unless the table is only appended
to.

Use `DeltaTable.vacuum` to perform the vacuum operation. Note that to prevent accidental deletion, the function performs a dry-run by default: it will only list the files to be deleted. Pass `dry_run=False` to actually delete files.

``` python
>>> dt = DeltaTable("../rust/tests/data/simple_table")
>>> dt.vacuum()
['../rust/tests/data/simple_table/part-00006-46f2ff20-eb5d-4dda-8498-7bfb2940713b-c000.snappy.parquet',
'../rust/tests/data/simple_table/part-00190-8ac0ae67-fb1d-461d-a3d3-8dc112766ff5-c000.snappy.parquet',
'../rust/tests/data/simple_table/part-00164-bf40481c-4afd-4c02-befa-90f056c2d77a-c000.snappy.parquet',
...]
>>> dt.vacuum(dry_run=False) # Don't run this unless you are sure!
```

## Optimizing tables

Optimizing tables is not currently supported.
Loading

0 comments on commit 433fb6b

Please sign in to comment.