askap-vast · ddobie · Jan 29, 2025 · Jan 22, 2025 · Jan 22, 2025 · Jan 22, 2025
diff --git a/docs/adminusage/cli.md b/docs/adminusage/cli.md
@@ -17,7 +17,7 @@ Output:
 
 [vast_pipeline]
   clearpiperun
-  createmeasarrow
+  createmeasparquet
   debugrun
   ingestimages
   initingest
@@ -52,7 +52,7 @@ positional arguments:
 optional arguments:
   -h, --help            show this help message and exit
   --keep-parquet        Flag to keep the pipeline run(s) parquet files. Will
-                        also apply to arrow files if present.
+                        also apply to parquet files if present.
   --remove-all          Flag to remove all the content of the pipeline run(s)
                         folder.
   --version             show program's version number and exit
@@ -83,30 +83,26 @@ Example usage:
 !!!tip
     Further information on clearing a specific run, or resetting the database, can be found in the [Contributing and Developing](../developing/localdevenv.md#removingclearing-data) section.
 
-### createmeasarrow
+### createmeasparquet
 
-This command allows for the creation of the `measurements.arrow` and `measurement_pairs.arrow` files after a run has been successfully completed. See [Arrow Files](../outputs/outputs.md#arrow-files) for more information.
-
-!!!info
-    The `measurement_pairs.arrow` file will only be created if the run was configured to calculate pair metrics.
+This command allows for the creation of the `measurements.parquet` file after a run has been successfully completed. See [Parquet Files](../outputs/outputs.md#parquet-files) for more information.
 
 ```terminal
-./manage.py createmeasarrow --help
-usage: manage.py createmeasarrow [-h] [--overwrite] [--version] [-v {0,1,2,3}]
+./manage.py createmeasparquet --help
+usage: manage.py createmeasparquet [-h] [--overwrite] [--version] [-v {0,1,2,3}]
                                  [--settings SETTINGS]
                                  [--pythonpath PYTHONPATH] [--traceback]
                                  [--no-color] [--force-color] [--skip-checks]
                                  piperun
 
-Create `measurements.arrow` and `measurement_pairs.arrow` files for a
-completed pipeline run.
+Create `measurements.parquet` files for a completed pipeline run.
 
 positional arguments:
   piperun               Path or name of the pipeline run.
 
 optional arguments:
   -h, --help            show this help message and exit
-  --overwrite           Overwrite previous 'measurements.arrow' file.
+  --overwrite           Overwrite previous 'measurements.parquet' file.
   --version             show program's version number and exit
   -v {0,1,2,3}, --verbosity {0,1,2,3}
                         Verbosity level; 0=minimal output, 1=normal output,
@@ -126,11 +122,9 @@ optional arguments:
 Example usage:
 
 ```terminal
-./manage.py createmeasarrow docs_example_run
-2021-03-30 10:48:40,952 createmeasarrow INFO Creating measurements arrow file for 'docs_example_run'.
-2021-03-30 10:48:40,952 utils INFO Creating measurements.arrow for run docs_example_run.
-2021-03-30 10:48:41,829 createmeasarrow INFO Creating measurement pairs arrow file for 'docs_example_run'.
-2021-03-30 10:48:41,829 utils INFO Creating measurement_pairs.arrow for run docs_example_run.
+./manage.py createmeasparquet docs_example_run
+2021-03-30 10:48:40,952 createmeasparquet INFO Creating measurements parquet file for 'docs_example_run'.
+2021-03-30 10:48:40,952 utils INFO Creating measurements.parquet for run docs_example_run.
 ```
 
 ### debugrun
@@ -482,7 +476,7 @@ measurements:
   flux_fractional_error: 0.0
   condon_errors: yes
   selavy_local_rms_fill_value: 0.2
-  write_arrow_files: no
+  write_measurements_parquet: no
   ra_uncertainty: 1.0
   dec_uncertainty: 1.0
 variability:

diff --git a/docs/exploringwebsite/runpages.md b/docs/exploringwebsite/runpages.md
@@ -22,7 +22,7 @@ Explanation of the table options can be found in the [DataTables section](datata
 
 ## Pipeline Run Detail Page
 
-This page presents all the information about the pipeline run, including options to edit the configuration file and to schedule the run for processing, restore the run, delete the run and generate the arrow measurement files.
+This page presents all the information about the pipeline run, including options to edit the configuration file and to schedule the run for processing, restore the run, delete the run and generate the measurement parquet file.
 
 ![!Pipeline Run detail page.](../img/run-detail1.png){: loading=lazy }
 
@@ -32,9 +32,9 @@ This page presents all the information about the pipeline run, including options
 
 For admins and creators of runs there are four action buttons available:
 
-* **Generate Arrow Files**  
-     A process to generate the arrow measurement files.
-     See [Generating Arrow Files](../../using/genarrow).
+* **Generate Measurements Parquet File**  
+     A process to generate the parquet measurement files.
+     See [Generating Measurements Parquet File](../../using/genparquet).
 * **Delete Run**  
      Delete the pipeline run.
      See [Deleting a Run](../../using/deleterun).
@@ -118,11 +118,11 @@ The log file of the restore run action.
 
 ![!Restore log file.](../img/run-detail8.png){: loading=lazy }
 
-#### Generate Arrow Files Log File
+#### Generate Measurements Parquet Log File
 
-The log file of the generate arrow files action.
+The log file of the generate measurements parquet file action.
 
-![!Generate arrow files log file.](../img/run-detail9.png){: loading=lazy }
+![!Generate measurements parquet log file.](../img/generate-measurements-parquet-log.png){: loading=lazy }
 
 
 ### Image and Measurements Tables

diff --git a/docs/img/action-buttons.png b/docs/img/action-buttons.png
diff --git a/docs/img/arrow-files-available.png b/docs/img/arrow-files-available.png
diff --git a/docs/img/docs-example-run-detail.png b/docs/img/docs-example-run-detail.png
diff --git a/docs/img/generate-arrow-button.png b/docs/img/generate-arrow-button.png
diff --git a/docs/img/generate-arrow-files-log.png b/docs/img/generate-arrow-files-log.png
diff --git a/docs/img/generate-arrow-modal.png b/docs/img/generate-arrow-modal.png
diff --git a/docs/img/generate-arrow-notification.png b/docs/img/generate-arrow-notification.png
diff --git a/docs/img/generate-measurements-parquet-log.png b/docs/img/generate-measurements-parquet-log.png
diff --git a/docs/img/generate-parquet-button.png b/docs/img/generate-parquet-button.png
diff --git a/docs/img/generate-parquet-modal.png b/docs/img/generate-parquet-modal.png
diff --git a/docs/img/generate-parquet-notification.png b/docs/img/generate-parquet-notification.png
diff --git a/docs/img/measurements-parquet-available.png b/docs/img/measurements-parquet-available.png
diff --git a/docs/img/run-detail1.png b/docs/img/run-detail1.png
diff --git a/docs/img/run-detail2.png b/docs/img/run-detail2.png
diff --git a/docs/img/run-detail3.png b/docs/img/run-detail3.png
diff --git a/docs/img/run-detail6.png b/docs/img/run-detail6.png
diff --git a/docs/img/run-detail7.png b/docs/img/run-detail7.png
diff --git a/docs/img/run-detail9.png b/docs/img/run-detail9.png
diff --git a/docs/outputs/outputs.md b/docs/outputs/outputs.md
@@ -13,7 +13,7 @@ A sub-directory will exist for each pipeline run that contains the output produc
 
 The pipeline uses the [Apache Parquet](https://parquet.apache.org){:target="_blank"} file format to write results to disk. Details on how to read these files can be found below in [Reading the Outputs](#reading-the-outputs).
 
-Below is the output structure for a pipeline run named `new-test-data` when the pipeline run option `measurements.write_arrow_files` has been set to `True` and the working directory is named `pipeline-runs` (see [File Details](#file-details) for descriptions):
+Below is the output structure for a pipeline run named `new-test-data` when the pipeline run option `measurements.write_measurements_parquet` has been set to `True` and the working directory is named `pipeline-runs` (see [File Details](#file-details) for descriptions):
 
 ```bash
 pipeline-runs
@@ -38,29 +38,14 @@ pipeline-runs
 │   ├── forced_measurements_VAST_2118-06A_EPOCH12_I_cutout_fits.parquet
 │   ├── images.parquet
 │   ├── YYYY-MM-DD-HH-MM-SS_log.txt
-│   ├── measurements.arrow
-│   ├── measurement_pairs.arrow
+│   ├── measurements.parquet
+│   ├── measurement_pairs.parquet
 │   ├── measurement_pairs.parquet
 │   ├── relations.parquet
 │   ├── skyregions.parquet
 │   └── sources.parquet
 ```
 
-### Arrow Files
-
-Large pipeline runs (hundreds of images) mean that to read the measurements, hundreds of parquet files need to be read in, and can contain millions of rows.
-This can be slow using libraries such as pandas, and also consumes a lot of system memory.
-A solution to this is to save all the measurements associated with the pipeline run into one single file in the [Apache Arrow](https://arrow.apache.org/overview/){:target="_blank"} format.
-
-The library `vaex` is able to open `.arrow` files in an out-of-core context so the memory footprint is hugely reduced along with the reading of the file being very fast.
-The two-epoch measurement pairs are also saved to arrow format due to the same reasons. See [Reading with vaex](usingoutputs.md#reading-with-vaex) for further details on using `vaex`.
-
-!!! note
-    At the time of development `vaex` could not open parquets in an out-of-core context. This will be reviewed in the future if such functionality is added to `vaex`.
-
-To enable the arrow files to be produced, the option `measurements.write_arrow_files` is required to be set to `True` in the pipeline run config.
-Alternatively, the arrow files can be generated after the completion of the run, see the [Generating Arrow Files page](../../using/genarrow) for full details.
-
 ### Image Data
 
 The data for the images [ingested](../design/imageingest.md) into the pipeline is also stored in the pipeline working directory under the subdirectory `images`:
@@ -111,8 +96,8 @@ Here, for each image, the selavy measurements that have been ingested are stored
 | `forced_measurements*.parquet` | Multiple files that contain the forced measurements extracted from the respective image denoted in the filename. |
 | `images.parquet` | Contains the information of the images processed in the pipeline run. |
 | `YYYY-MM-DD-HH-MM-SS_log.txt` | The log file of the pipeline run. It is timestamped with the date and time of the run start. |
-| `measurements.arrow` | An [Apache Arrow](https://arrow.apache.org/overview/){:target="_blank"} format file containing all the measurements associated with the pipeline run (see [Arrow Files](#arrow-files)).|
-| `measurement_pairs.arrow` | An [Apache Arrow](https://arrow.apache.org/overview/){:target="_blank"} format file containing all the measurement pair metrics (see [Arrow Files](#arrow-files)). |
+| `measurements.parquet` | An [Apache Parquet](https://parquet.apache.org/){:target="_blank"} format file containing all the measurements associated with the pipeline run (see [Arrow Files](#parquet-files)).|
+| `measurement_pairs.parquet` | An [Apache Parquet](https://parquet.apache.org/){:target="_blank"} format file containing all the measurement pair metrics (see [Arrow Files](#parquet-files)). |
 | `measurement_pairs.parquet` | Contains all the measurement pairs metrics. |
 | `relations.parquet` | Contains the relation information between sources. |
 | `skyregions.parquet` | Contains the sky region information of the pipeline run. |

diff --git a/docs/outputs/usingoutputs.md b/docs/outputs/usingoutputs.md
@@ -2,11 +2,7 @@
 
 This page gives details on how to open and use the pipeline output files.
 
-It is recommended to use `pandas` or `vaex` to read the pipeline results from the parquet files. See the sections below for more information on using each library.
-
-!!! note
-    It is also possible to use [`Dask`](https://docs.dask.org/en/latest/){:target="_blank"} to read the parquets in an out-of-core context but the general performance can sometimes be poor with many parquet files. 
-    `vaex` is the preferred out-of-core method.
+It is recommended to use `pandas` or `dask` to read the pipeline results from the parquet files. In general, even for runs with thousands of images, pandas is sufficient for reading most of the VAST pipeline outputs. However, loading a large set of measurements (e.g. `measurements.parquet` or a large subset of the individual parquet files) usually requires `dask` due to memory constraints. See the sections below for more information on using each library.
 
 !!! tip
     Be sure to look at [`vast-tools`](#vast-tools), a ready-made library for exploring pipeline results!
@@ -55,23 +51,16 @@ measurements = pd.concat(data, ignore_index=True)
     sources = pd.read_parquet('pipeline-runs/new-test-data/sources.parquet', columns=['id', 'n_meas'])
     ```
 
-### Reading with vaex
-
-[vaex documentation](https://vaex.io/docs/index.html){:target="_blank"}.
-
-!!! warning
-    vaex is a young project so bugs may be expected along with frequent updates. It has currently been tested with version `3.0.0`. 
-    Version `4.0.0` promises opening parquet files in an out-of-core context.
+### Reading with dask
 
-!!! warning
-    Some pipeline `parquet` format files do not open with vaex 3.0.0. `arrow` format files should open successfully.
+[dask documentation](https://docs.dask.org/en/stable/dataframe.html){:target="_blank"}.
 
-A parquet, or arrow file, can be opened using the `open()` method:
+A parquet can be opened using the `read_parquet()` method:
 
 ```python
-import vaex
+import dask.dataframe as dd
 
-measurements = vaex.open('pipeline-runs/new-test-data/measurements.arrow')
+measurements = dd.read_parquet('pipeline-runs/new-test-data/measurements.parquet')
 
 measurements.head()
   #    source  island_id         component_id            local_rms       ra       ra_err       dec      dec_err    flux_peak    flux_peak_err    flux_int    flux_int_err    bmaj     err_bmaj    bmin     err_bmin      pa      err_pa    psf_bmaj    psf_bmin    psf_pa  flag_c4      chi_squared_fit    spectral_index  spectral_index_from_TT    has_siblings      image_id  time                           name                                    snr    compactness    ew_sys_err    ns_sys_err    error_radius    uncertainty_ew    uncertainty_ns    weight_ew    weight_ns  forced      flux_int_isl_ratio    flux_peak_isl_ratio    id
@@ -87,31 +76,21 @@ measurements.head()
   9       730  SB00013_island_1  SB00013_component_1a     0.437279  321.901  3.14407e-06  -4.20052  2.36161e-06      294.141         0.451346     340.92         0.864347   18.38  7.55055e-06   12.12  5.36009e-06  106.18  0.00262701        6.01        4        51.55  False                2368.93               -99  True                      True                    12  2020-01-12 05:36:03.834000000  VAST_2118-06A_SB00013_component_1a  672.663       1.15903    0.000277778   0.000277778     4.00455e-06       0.000277807       0.000277807  1.29573e+07  1.29573e+07  False                 0.640807               0.72508   1740
 ```
 
-Multiple parquet files can be opened at once using the `open_many()` method:
-
-```python
-import glob
-import vaex
-
-files = glob.glob("pipeline-runs/images/*/measurements.parquet")
-measurements = vaex.open_many(files)
-```
-
 !!! tip
-    You can convert a vaex dataframe to pandas by using the `to_pandas_df()` method:
+    You can convert a dask dataframe to pandas by using the `compute()` method:
     ```python
-    import vaex
+    import dask.dataframe as dd
 
-    sources = vaex.open('pipeline-runs/new-test-data/sources.parquet')
-    sources = sources.to_pandas_df()
+    sources = dd.read_parquet('pipeline-runs/new-test-data/sources.parquet')
+    sources = sources.compute()
     ```
 
 ### Linking the Results
 
 The table below shows what parameters act as keys to link data from the different results tables.
 
 !!! tip
-    If loading the measurements via the `.arrow` file, then the measurements already have the `source` column in-place.
+    If loading the measurements via the `.parquet` file, then the measurements already have the `source` column in-place.
 
 !!! tip
     The `images.parquet` file contains the column `measurements_path` which can be used to get the filepaths for all the selavy `parquet` files.

diff --git a/docs/using/genarrow.md b/docs/using/genarrow.md