You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: python/README.md
+10-10
Original file line number
Diff line number
Diff line change
@@ -2,20 +2,20 @@
2
2
3
3
This project provides extensions to the [Apache Spark project](https://spark.apache.org/) in Scala and Python:
4
4
5
-
**[Diff](https://github.com/G-Research/spark-extension/blob/v2.11.0/DIFF.md):** A `diff` transformation and application for `Dataset`s that computes the differences between
5
+
**[Diff](https://github.com/G-Research/spark-extension/blob/v2.12.0/DIFF.md):** A `diff` transformation and application for `Dataset`s that computes the differences between
6
6
two datasets, i.e. which rows to _add_, _delete_ or _change_ to get from one dataset to the other.
7
7
8
-
**[Histogram](https://github.com/G-Research/spark-extension/blob/v2.11.0/HISTOGRAM.md):** A `histogram` transformation that computes the histogram DataFrame for a value column.
8
+
**[Histogram](https://github.com/G-Research/spark-extension/blob/v2.12.0/HISTOGRAM.md):** A `histogram` transformation that computes the histogram DataFrame for a value column.
9
9
10
-
**[Global Row Number](https://github.com/G-Research/spark-extension/blob/v2.11.0/ROW_NUMBER.md):** A `withRowNumbers` transformation that provides the global row number w.r.t.
10
+
**[Global Row Number](https://github.com/G-Research/spark-extension/blob/v2.12.0/ROW_NUMBER.md):** A `withRowNumbers` transformation that provides the global row number w.r.t.
11
11
the current order of the Dataset, or any given order. In contrast to the existing SQL function `row_number`, which
12
12
requires a window spec, this transformation provides the row number across the entire Dataset without scaling problems.
13
13
14
-
**[Inspect Parquet files](https://github.com/G-Research/spark-extension/blob/v2.11.0/PARQUET.md):** The structure of Parquet files (the metadata, not the data stored in Parquet) can be inspected similar to [parquet-tools](https://pypi.org/project/parquet-tools/)
14
+
**[Inspect Parquet files](https://github.com/G-Research/spark-extension/blob/v2.12.0/PARQUET.md):** The structure of Parquet files (the metadata, not the data stored in Parquet) can be inspected similar to [parquet-tools](https://pypi.org/project/parquet-tools/)
15
15
or [parquet-cli](https://pypi.org/project/parquet-cli/) by reading from a simple Spark data source.
16
16
This simplifies identifying why some Parquet files cannot be split by Spark into scalable partitions.
17
17
18
-
**[Install Python packages into PySpark job](https://github.com/G-Research/spark-extension/blob/v2.11.0/PYSPARK-DEPS.md):** Install Python dependencies via PIP or Poetry programatically into your running PySpark job (PySpark ≥ 3.1.0):
18
+
**[Install Python packages into PySpark job](https://github.com/G-Research/spark-extension/blob/v2.12.0/PYSPARK-DEPS.md):** Install Python dependencies via PIP or Poetry programatically into your running PySpark job (PySpark ≥ 3.1.0):
19
19
20
20
```python
21
21
# noinspection PyUnresolvedReferences
@@ -94,7 +94,7 @@ Running your Python application on a Spark cluster will still require one of the
94
94
to add the Scala package to the Spark environment.
95
95
96
96
```shell script
97
-
pip install pyspark-extension==2.11.0.3.4
97
+
pip install pyspark-extension==2.12.0.3.4
98
98
```
99
99
100
100
Note: Pick the right Spark version (here 3.4) depending on your PySpark version.
@@ -108,7 +108,7 @@ from pyspark.sql import SparkSession
0 commit comments