-
Notifications
You must be signed in to change notification settings - Fork 449
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
9 changed files
with
481 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# Python deltalake package | ||
|
||
This is the documentation for the native Python implementation of Delta Lake. It is based on the delta-rs Rust library and requires no Spark or JVM dependencies. For the PySpark implementation, see [delta-spark](https://docs.delta.io/latest/api/python/index.html) instead. | ||
|
||
This module provides the capability to read, write, and manage [Delta Lake](https://delta.io/) tables from Python without Spark or Java. It uses [Apache Arrow](https://arrow.apache.org/) under the hood, so is compatible with other Arrow-native or integrated libraries such as [Pandas](https://pandas.pydata.org/), [DuckDB](https://duckdb.org/), and [Polars](https://www.pola.rs/). | ||
|
||
Note: This module is under active development and some features are experimental. It is not yet as feature-complete as the PySpark implementation of Delta Lake. If you encounter a bug, please let us know in our [GitHub repo](https://github.com/delta-io/delta-rs/issues). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# Installation | ||
|
||
## Using Pip | ||
|
||
``` bash | ||
pip install deltalake | ||
``` | ||
|
||
NOTE: official binary wheels are linked against openssl statically for remote objection store communication. Please file Github issue to request for critical openssl upgrade. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
# Examining a Table | ||
|
||
## Metadata | ||
|
||
The delta log maintains basic metadata about a table, including: | ||
|
||
- A unique `id` | ||
- A `name`, if provided | ||
- A `description`, if provided | ||
- The list of `partitionColumns`. | ||
- The `created_time` of the table | ||
- A map of table `configuration`. This includes fields such as | ||
`delta.appendOnly`, which if `true` indicates the table is not meant | ||
to have data deleted from it. | ||
|
||
Get metadata from a table with the | ||
`DeltaTable.metadata` method: | ||
|
||
``` python | ||
>>> from deltalake import DeltaTable | ||
>>> dt = DeltaTable("../rust/tests/data/simple_table") | ||
>>> dt.metadata() | ||
Metadata(id: 5fba94ed-9794-4965-ba6e-6ee3c0d22af9, name: None, description: None, partitionColumns: [], created_time: 1587968585495, configuration={}) | ||
``` | ||
|
||
## Schema | ||
|
||
The schema for the table is also saved in the transaction log. It can | ||
either be retrieved in the Delta Lake form as | ||
`deltalake.schema.Schema` or as a | ||
PyArrow schema. The first allows you to introspect any column-level | ||
metadata stored in the schema, while the latter represents the schema | ||
the table will be loaded into. | ||
|
||
Use `DeltaTable.schema` to retrieve the delta lake schema: | ||
|
||
``` python | ||
>>> from deltalake import DeltaTable | ||
>>> dt = DeltaTable("../rust/tests/data/simple_table") | ||
>>> dt.schema() | ||
Schema([Field(id, PrimitiveType("long"), nullable=True)]) | ||
``` | ||
|
||
These schemas have a JSON representation that can be retrieved. To | ||
reconstruct from json, use | ||
`deltalake.schema.Schema.from_json()`. | ||
|
||
``` python | ||
>>> dt.schema().json() | ||
'{"type":"struct","fields":[{"name":"id","type":"long","nullable":true,"metadata":{}}]}' | ||
``` | ||
|
||
Use `deltalake.schema.Schema.to_pyarrow()` to retrieve the PyArrow schema: | ||
|
||
``` python | ||
>>> dt.schema().to_pyarrow() | ||
id: int64 | ||
``` | ||
|
||
## History | ||
|
||
Depending on what system wrote the table, the delta table may have | ||
provenance information describing what operations were performed on the | ||
table, when, and by whom. This information is retained for 30 days by | ||
default, unless otherwise specified by the table configuration | ||
`delta.logRetentionDuration`. | ||
|
||
::: note | ||
::: title | ||
Note | ||
::: | ||
|
||
This information is not written by all writers and different writers may | ||
use different schemas to encode the actions. For Spark\'s format, see: | ||
<https://docs.delta.io/latest/delta-utility.html#history-schema> | ||
::: | ||
|
||
To view the available history, use `DeltaTable.history`: | ||
|
||
``` python | ||
>>> from deltalake import DeltaTable | ||
>>> dt = DeltaTable("../rust/tests/data/simple_table") | ||
>>> dt.history() | ||
[{'timestamp': 1587968626537, 'operation': 'DELETE', 'operationParameters': {'predicate': '["((`id` % CAST(2 AS BIGINT)) = CAST(0 AS BIGINT))"]'}, 'readVersion': 3, 'isBlindAppend': False}, | ||
{'timestamp': 1587968614187, 'operation': 'UPDATE', 'operationParameters': {'predicate': '((id#697L % cast(2 as bigint)) = cast(0 as bigint))'}, 'readVersion': 2, 'isBlindAppend': False}, | ||
{'timestamp': 1587968604143, 'operation': 'WRITE', 'operationParameters': {'mode': 'Overwrite', 'partitionBy': '[]'}, 'readVersion': 1, 'isBlindAppend': False}, | ||
{'timestamp': 1587968596254, 'operation': 'MERGE', 'operationParameters': {'predicate': '(oldData.`id` = newData.`id`)'}, 'readVersion': 0, 'isBlindAppend': False}, | ||
{'timestamp': 1587968586154, 'operation': 'WRITE', 'operationParameters': {'mode': 'ErrorIfExists', 'partitionBy': '[]'}, 'isBlindAppend': True}] | ||
``` | ||
|
||
## Current Add Actions | ||
|
||
The active state for a delta table is determined by the Add actions, | ||
which provide the list of files that are part of the table and metadata | ||
about them, such as creation time, size, and statistics. You can get a | ||
data frame of the add actions data using `DeltaTable.get_add_actions`: | ||
|
||
``` python | ||
>>> from deltalake import DeltaTable | ||
>>> dt = DeltaTable("../rust/tests/data/delta-0.8.0") | ||
>>> dt.get_add_actions(flatten=True).to_pandas() | ||
path size_bytes modification_time data_change num_records null_count.value min.value max.value | ||
0 part-00000-c9b90f86-73e6-46c8-93ba-ff6bfaf892a... 440 2021-03-06 15:16:07 True 2 0 0 2 | ||
1 part-00000-04ec9591-0b73-459e-8d18-ba5711d6cbe... 440 2021-03-06 15:16:16 True 2 0 2 4 | ||
``` | ||
|
||
This works even with past versions of the table: | ||
|
||
``` python | ||
>>> dt = DeltaTable("../rust/tests/data/delta-0.8.0", version=0) | ||
>>> dt.get_add_actions(flatten=True).to_pandas() | ||
path size_bytes modification_time data_change num_records null_count.value min.value max.value | ||
0 part-00000-c9b90f86-73e6-46c8-93ba-ff6bfaf892a... 440 2021-03-06 15:16:07 True 2 0 0 2 | ||
1 part-00001-911a94a2-43f6-4acb-8620-5e68c265498... 445 2021-03-06 15:16:07 True 3 0 2 4 | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
# Usage | ||
|
||
A `DeltaTable` represents the state of a | ||
delta table at a particular version. This includes which files are | ||
currently part of the table, the schema of the table, and other metadata | ||
such as creation time. | ||
|
||
``` python | ||
>>> from deltalake import DeltaTable | ||
>>> dt = DeltaTable("../rust/tests/data/delta-0.2.0") | ||
>>> dt.version() | ||
3 | ||
>>> dt.files() | ||
['part-00000-cb6b150b-30b8-4662-ad28-ff32ddab96d2-c000.snappy.parquet', | ||
'part-00000-7c2deba3-1994-4fb8-bc07-d46c948aa415-c000.snappy.parquet', | ||
'part-00001-c373a5bd-85f0-4758-815e-7eb62007a15c-c000.snappy.parquet'] | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,120 @@ | ||
# Loading a Delta Table | ||
|
||
To load the current version, use the constructor: | ||
|
||
``` python | ||
>>> dt = DeltaTable("../rust/tests/data/delta-0.2.0") | ||
``` | ||
|
||
Depending on your storage backend, you could use the `storage_options` | ||
parameter to provide some configuration. Configuration is defined for | ||
specific backends - [s3 | ||
options](https://docs.rs/object_store/latest/object_store/aws/enum.AmazonS3ConfigKey.html#variants), | ||
[azure | ||
options](https://docs.rs/object_store/latest/object_store/azure/enum.AzureConfigKey.html#variants), | ||
[gcs | ||
options](https://docs.rs/object_store/latest/object_store/gcp/enum.GoogleConfigKey.html#variants). | ||
|
||
``` python | ||
>>> storage_options = {"AWS_ACCESS_KEY_ID": "THE_AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY":"THE_AWS_SECRET_ACCESS_KEY"} | ||
>>> dt = DeltaTable("../rust/tests/data/delta-0.2.0", storage_options=storage_options) | ||
``` | ||
|
||
The configuration can also be provided via the environment, and the | ||
basic service provider is derived from the URL being used. We try to | ||
support many of the well-known formats to identify basic service | ||
properties. | ||
|
||
**S3**: | ||
|
||
> - s3://\<bucket\>/\<path\> | ||
> - s3a://\<bucket\>/\<path\> | ||
**Azure**: | ||
|
||
> - az://\<container\>/\<path\> | ||
> - adl://\<container\>/\<path\> | ||
> - abfs://\<container\>/\<path\> | ||
**GCS**: | ||
|
||
> - gs://\<bucket\>/\<path\> | ||
Alternatively, if you have a data catalog you can load it by reference | ||
to a database and table name. Currently only AWS Glue is supported. | ||
|
||
For AWS Glue catalog, use AWS environment variables to authenticate. | ||
|
||
``` python | ||
>>> from deltalake import DeltaTable | ||
>>> from deltalake import DataCatalog | ||
>>> database_name = "simple_database" | ||
>>> table_name = "simple_table" | ||
>>> data_catalog = DataCatalog.AWS | ||
>>> dt = DeltaTable.from_data_catalog(data_catalog=data_catalog, database_name=database_name, table_name=table_name) | ||
>>> dt.to_pyarrow_table().to_pydict() | ||
{'id': [5, 7, 9, 5, 6, 7, 8, 9]} | ||
``` | ||
|
||
## Custom Storage Backends | ||
|
||
While delta always needs its internal storage backend to work and be | ||
properly configured, in order to manage the delta log, it may sometime | ||
be advantageous - and is common practice in the arrow world - to | ||
customize the storage interface used for reading the bulk data. | ||
|
||
`deltalake` will work with any storage compliant with `pyarrow.fs.FileSystem`, however the root of the filesystem has to be adjusted to point at the root of the Delta table. We can achieve this by wrapping the custom filesystem into a `pyarrow.fs.SubTreeFileSystem`. | ||
|
||
``` python | ||
import pyarrow.fs as fs | ||
from deltalake import DeltaTable | ||
|
||
path = "<path/to/table>" | ||
filesystem = fs.SubTreeFileSystem(path, fs.LocalFileSystem()) | ||
|
||
dt = DeltaTable(path) | ||
ds = dt.to_pyarrow_dataset(filesystem=filesystem) | ||
``` | ||
|
||
When using the pyarrow factory method for file systems, the normalized | ||
path is provided on creation. In case of S3 this would look something | ||
like: | ||
|
||
``` python | ||
import pyarrow.fs as fs | ||
from deltalake import DeltaTable | ||
|
||
table_uri = "s3://<bucket>/<path>" | ||
raw_fs, normalized_path = fs.FileSystem.from_uri(table_uri) | ||
filesystem = fs.SubTreeFileSystem(normalized_path, raw_fs) | ||
|
||
dt = DeltaTable(table_uri) | ||
ds = dt.to_pyarrow_dataset(filesystem=filesystem) | ||
``` | ||
|
||
## Time Travel | ||
|
||
To load previous table states, you can provide the version number you | ||
wish to load: | ||
|
||
``` python | ||
>>> dt = DeltaTable("../rust/tests/data/simple_table", version=2) | ||
``` | ||
|
||
Once you\'ve loaded a table, you can also change versions using either a | ||
version number or datetime string: | ||
|
||
``` python | ||
>>> dt.load_version(1) | ||
>>> dt.load_with_datetime("2021-11-04 00:05:23.283+00:00") | ||
``` | ||
|
||
::: warning | ||
::: title | ||
Warning | ||
::: | ||
|
||
Previous table versions may not exist if they have been vacuumed, in | ||
which case an exception will be thrown. See [Vacuuming | ||
tables](#vacuuming-tables) for more information. | ||
::: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
# Managing Delta Tables | ||
|
||
## Vacuuming tables | ||
|
||
Vacuuming a table will delete any files that have been marked for | ||
deletion. This may make some past versions of a table invalid, so this | ||
can break time travel. However, it will save storage space. Vacuum will | ||
retain files in a certain window, by default one week, so time travel | ||
will still work in shorter ranges. | ||
|
||
Delta tables usually don't delete old files automatically, so vacuuming | ||
regularly is considered good practice, unless the table is only appended | ||
to. | ||
|
||
Use `DeltaTable.vacuum` to perform the vacuum operation. Note that to prevent accidental deletion, the function performs a dry-run by default: it will only list the files to be deleted. Pass `dry_run=False` to actually delete files. | ||
|
||
``` python | ||
>>> dt = DeltaTable("../rust/tests/data/simple_table") | ||
>>> dt.vacuum() | ||
['../rust/tests/data/simple_table/part-00006-46f2ff20-eb5d-4dda-8498-7bfb2940713b-c000.snappy.parquet', | ||
'../rust/tests/data/simple_table/part-00190-8ac0ae67-fb1d-461d-a3d3-8dc112766ff5-c000.snappy.parquet', | ||
'../rust/tests/data/simple_table/part-00164-bf40481c-4afd-4c02-befa-90f056c2d77a-c000.snappy.parquet', | ||
...] | ||
>>> dt.vacuum(dry_run=False) # Don't run this unless you are sure! | ||
``` | ||
|
||
## Optimizing tables | ||
|
||
Optimizing tables is not currently supported. |
Oops, something went wrong.