Port docs to mkdocs

delta-io · Jul 26, 2023 · 433fb6b · 433fb6b
1 parent f4a4341
commit 433fb6b
Show file tree

Hide file tree

Showing 9 changed files with 481 additions and 0 deletions.
diff --git a/docs/index.md b/docs/index.md
@@ -0,0 +1,7 @@
+# Python deltalake package
+
+This is the documentation for the native Python implementation of Delta Lake. It is based on the delta-rs Rust library and requires no Spark or JVM dependencies. For the PySpark implementation, see [delta-spark](https://docs.delta.io/latest/api/python/index.html) instead.
+
+This module provides the capability to read, write, and manage [Delta Lake](https://delta.io/) tables from Python without Spark or Java. It uses [Apache Arrow](https://arrow.apache.org/) under the hood, so is compatible with other Arrow-native or integrated libraries such as [Pandas](https://pandas.pydata.org/), [DuckDB](https://duckdb.org/), and [Polars](https://www.pola.rs/).
+
+Note: This module is under active development and some features are experimental. It is not yet as feature-complete as the PySpark implementation of Delta Lake. If you encounter a bug, please let us know in our [GitHub repo](https://github.com/delta-io/delta-rs/issues).
diff --git a/docs/installation.md b/docs/installation.md
@@ -0,0 +1,9 @@
+# Installation
+
+## Using Pip
+
+``` bash
+pip install deltalake
+```
+
+NOTE: official binary wheels are linked against openssl statically for remote objection store communication. Please file Github issue to request for critical openssl upgrade.
diff --git a/docs/usage/examining-table.md b/docs/usage/examining-table.md
@@ -0,0 +1,115 @@
+# Examining a Table
+
+## Metadata
+
+The delta log maintains basic metadata about a table, including:
+
+-   A unique `id`
+-   A `name`, if provided
+-   A `description`, if provided
+-   The list of `partitionColumns`.
+-   The `created_time` of the table
+-   A map of table `configuration`. This includes fields such as
+    `delta.appendOnly`, which if `true` indicates the table is not meant
+    to have data deleted from it.
+
+Get metadata from a table with the
+`DeltaTable.metadata` method:
+
+``` python
+>>> from deltalake import DeltaTable
+>>> dt = DeltaTable("../rust/tests/data/simple_table")
+>>> dt.metadata()
+Metadata(id: 5fba94ed-9794-4965-ba6e-6ee3c0d22af9, name: None, description: None, partitionColumns: [], created_time: 1587968585495, configuration={})
+```
+
+## Schema
+
+The schema for the table is also saved in the transaction log. It can
+either be retrieved in the Delta Lake form as
+`deltalake.schema.Schema` or as a
+PyArrow schema. The first allows you to introspect any column-level
+metadata stored in the schema, while the latter represents the schema
+the table will be loaded into.
+
+Use `DeltaTable.schema` to retrieve the delta lake schema:
+
+``` python
+>>> from deltalake import DeltaTable
+>>> dt = DeltaTable("../rust/tests/data/simple_table")
+>>> dt.schema()
+Schema([Field(id, PrimitiveType("long"), nullable=True)])
+```
+
+These schemas have a JSON representation that can be retrieved. To
+reconstruct from json, use
+`deltalake.schema.Schema.from_json()`.
+
+``` python
+>>> dt.schema().json()
+'{"type":"struct","fields":[{"name":"id","type":"long","nullable":true,"metadata":{}}]}'
+```
+
+Use `deltalake.schema.Schema.to_pyarrow()` to retrieve the PyArrow schema:
+
+``` python
+>>> dt.schema().to_pyarrow()
+id: int64
+```
+
+## History
+
+Depending on what system wrote the table, the delta table may have
+provenance information describing what operations were performed on the
+table, when, and by whom. This information is retained for 30 days by
+default, unless otherwise specified by the table configuration
+`delta.logRetentionDuration`.
+
+::: note
+::: title
+Note
+:::
+
+This information is not written by all writers and different writers may
+use different schemas to encode the actions. For Spark\'s format, see:
+<https://docs.delta.io/latest/delta-utility.html#history-schema>
+:::
+
+To view the available history, use `DeltaTable.history`:
+
+``` python
+>>> from deltalake import DeltaTable
+>>> dt = DeltaTable("../rust/tests/data/simple_table")
+>>> dt.history()
+[{'timestamp': 1587968626537, 'operation': 'DELETE', 'operationParameters': {'predicate': '["((`id` % CAST(2 AS BIGINT)) = CAST(0 AS BIGINT))"]'}, 'readVersion': 3, 'isBlindAppend': False},
+ {'timestamp': 1587968614187, 'operation': 'UPDATE', 'operationParameters': {'predicate': '((id#697L % cast(2 as bigint)) = cast(0 as bigint))'}, 'readVersion': 2, 'isBlindAppend': False},
+ {'timestamp': 1587968604143, 'operation': 'WRITE', 'operationParameters': {'mode': 'Overwrite', 'partitionBy': '[]'}, 'readVersion': 1, 'isBlindAppend': False},
+ {'timestamp': 1587968596254, 'operation': 'MERGE', 'operationParameters': {'predicate': '(oldData.`id` = newData.`id`)'}, 'readVersion': 0, 'isBlindAppend': False},
+ {'timestamp': 1587968586154, 'operation': 'WRITE', 'operationParameters': {'mode': 'ErrorIfExists', 'partitionBy': '[]'}, 'isBlindAppend': True}]
+```
+
+## Current Add Actions
+
+The active state for a delta table is determined by the Add actions,
+which provide the list of files that are part of the table and metadata
+about them, such as creation time, size, and statistics. You can get a
+data frame of the add actions data using `DeltaTable.get_add_actions`:
+
+``` python
+>>> from deltalake import DeltaTable
+>>> dt = DeltaTable("../rust/tests/data/delta-0.8.0")
+>>> dt.get_add_actions(flatten=True).to_pandas()
+                                                    path  size_bytes   modification_time  data_change  num_records  null_count.value  min.value  max.value
+0  part-00000-c9b90f86-73e6-46c8-93ba-ff6bfaf892a...         440 2021-03-06 15:16:07         True            2                 0          0          2
+1  part-00000-04ec9591-0b73-459e-8d18-ba5711d6cbe...         440 2021-03-06 15:16:16         True            2                 0          2          4
+```
+
+This works even with past versions of the table:
+
+``` python
+>>> dt = DeltaTable("../rust/tests/data/delta-0.8.0", version=0)
+>>> dt.get_add_actions(flatten=True).to_pandas()
+                                                path  size_bytes   modification_time  data_change  num_records  null_count.value  min.value  max.value
+0  part-00000-c9b90f86-73e6-46c8-93ba-ff6bfaf892a...         440 2021-03-06 15:16:07         True            2                 0          0          2
+1  part-00001-911a94a2-43f6-4acb-8620-5e68c265498...         445 2021-03-06 15:16:07         True            3                 0          2          4
+```
diff --git a/docs/usage/index.md b/docs/usage/index.md
@@ -0,0 +1,17 @@
+# Usage
+
+A `DeltaTable` represents the state of a
+delta table at a particular version. This includes which files are
+currently part of the table, the schema of the table, and other metadata
+such as creation time.
+
+``` python
+>>> from deltalake import DeltaTable
+>>> dt = DeltaTable("../rust/tests/data/delta-0.2.0")
+>>> dt.version()
+3
+>>> dt.files()
+['part-00000-cb6b150b-30b8-4662-ad28-ff32ddab96d2-c000.snappy.parquet', 
+ 'part-00000-7c2deba3-1994-4fb8-bc07-d46c948aa415-c000.snappy.parquet', 
+ 'part-00001-c373a5bd-85f0-4758-815e-7eb62007a15c-c000.snappy.parquet']
+```
diff --git a/docs/usage/loading-table.md b/docs/usage/loading-table.md
@@ -0,0 +1,120 @@
+# Loading a Delta Table
+
+To load the current version, use the constructor:
+
+``` python
+>>> dt = DeltaTable("../rust/tests/data/delta-0.2.0")
+```
+
+Depending on your storage backend, you could use the `storage_options`
+parameter to provide some configuration. Configuration is defined for
+specific backends - [s3
+options](https://docs.rs/object_store/latest/object_store/aws/enum.AmazonS3ConfigKey.html#variants),
+[azure
+options](https://docs.rs/object_store/latest/object_store/azure/enum.AzureConfigKey.html#variants),
+[gcs
+options](https://docs.rs/object_store/latest/object_store/gcp/enum.GoogleConfigKey.html#variants).
+
+``` python
+>>> storage_options = {"AWS_ACCESS_KEY_ID": "THE_AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY":"THE_AWS_SECRET_ACCESS_KEY"}
+>>> dt = DeltaTable("../rust/tests/data/delta-0.2.0", storage_options=storage_options)
+```
+
+The configuration can also be provided via the environment, and the
+basic service provider is derived from the URL being used. We try to
+support many of the well-known formats to identify basic service
+properties.
+
+**S3**:
+
+> -   s3://\<bucket\>/\<path\>
+> -   s3a://\<bucket\>/\<path\>
+
+**Azure**:
+
+> -   az://\<container\>/\<path\>
+> -   adl://\<container\>/\<path\>
+> -   abfs://\<container\>/\<path\>
+
+**GCS**:
+
+> -   gs://\<bucket\>/\<path\>
+
+Alternatively, if you have a data catalog you can load it by reference
+to a database and table name. Currently only AWS Glue is supported.
+
+For AWS Glue catalog, use AWS environment variables to authenticate.
+
+``` python
+>>> from deltalake import DeltaTable
+>>> from deltalake import DataCatalog
+>>> database_name = "simple_database"
+>>> table_name = "simple_table"
+>>> data_catalog = DataCatalog.AWS
+>>> dt = DeltaTable.from_data_catalog(data_catalog=data_catalog, database_name=database_name, table_name=table_name)
+>>> dt.to_pyarrow_table().to_pydict()
+{'id': [5, 7, 9, 5, 6, 7, 8, 9]}
+```
+
+## Custom Storage Backends
+
+While delta always needs its internal storage backend to work and be
+properly configured, in order to manage the delta log, it may sometime
+be advantageous - and is common practice in the arrow world - to
+customize the storage interface used for reading the bulk data.
+
+`deltalake` will work with any storage compliant with `pyarrow.fs.FileSystem`, however the root of the filesystem has to be adjusted to point at the root of the Delta table. We can achieve this by wrapping the custom filesystem into a `pyarrow.fs.SubTreeFileSystem`.
+
+``` python
+import pyarrow.fs as fs
+from deltalake import DeltaTable
+
+path = "<path/to/table>"
+filesystem = fs.SubTreeFileSystem(path, fs.LocalFileSystem())
+
+dt = DeltaTable(path)
+ds = dt.to_pyarrow_dataset(filesystem=filesystem)
+```
+
+When using the pyarrow factory method for file systems, the normalized
+path is provided on creation. In case of S3 this would look something
+like:
+
+``` python
+import pyarrow.fs as fs
+from deltalake import DeltaTable
+
+table_uri = "s3://<bucket>/<path>"
+raw_fs, normalized_path = fs.FileSystem.from_uri(table_uri)
+filesystem = fs.SubTreeFileSystem(normalized_path, raw_fs)
+
+dt = DeltaTable(table_uri)
+ds = dt.to_pyarrow_dataset(filesystem=filesystem)
+```
+
+## Time Travel
+
+To load previous table states, you can provide the version number you
+wish to load:
+
+``` python
+>>> dt = DeltaTable("../rust/tests/data/simple_table", version=2)
+```
+
+Once you\'ve loaded a table, you can also change versions using either a
+version number or datetime string:
+
+``` python
+>>> dt.load_version(1)
+>>> dt.load_with_datetime("2021-11-04 00:05:23.283+00:00")
+```
+
+::: warning
+::: title
+Warning
+:::
+
+Previous table versions may not exist if they have been vacuumed, in
+which case an exception will be thrown. See [Vacuuming
+tables](#vacuuming-tables) for more information.
+:::
diff --git a/docs/usage/managing-tables.md b/docs/usage/managing-tables.md
@@ -0,0 +1,29 @@
+# Managing Delta Tables
+
+## Vacuuming tables
+
+Vacuuming a table will delete any files that have been marked for
+deletion. This may make some past versions of a table invalid, so this
+can break time travel. However, it will save storage space. Vacuum will
+retain files in a certain window, by default one week, so time travel
+will still work in shorter ranges.
+
+Delta tables usually don't delete old files automatically, so vacuuming
+regularly is considered good practice, unless the table is only appended
+to.
+
+Use `DeltaTable.vacuum` to perform the vacuum operation. Note that to prevent accidental deletion, the function performs a dry-run by default: it will only list the files to be deleted. Pass `dry_run=False` to actually delete files.
+
+``` python
+>>> dt = DeltaTable("../rust/tests/data/simple_table")
+>>> dt.vacuum()
+['../rust/tests/data/simple_table/part-00006-46f2ff20-eb5d-4dda-8498-7bfb2940713b-c000.snappy.parquet',
+ '../rust/tests/data/simple_table/part-00190-8ac0ae67-fb1d-461d-a3d3-8dc112766ff5-c000.snappy.parquet',
+ '../rust/tests/data/simple_table/part-00164-bf40481c-4afd-4c02-befa-90f056c2d77a-c000.snappy.parquet',
+ ...]
+>>> dt.vacuum(dry_run=False) # Don't run this unless you are sure!
+```
+
+## Optimizing tables
+
+Optimizing tables is not currently supported.