Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the measurements creation to use parquet/dask rather than arrow/vaex files #800

Merged
merged 66 commits into from
Jan 29, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
3957434
Switched to output parquet file by default
ddobie Jan 22, 2025
5dee5f9
Remove timezone stuff
ddobie Jan 22, 2025
f47ba4a
Dependencies
ddobie Jan 22, 2025
f425524
Updated commands file
ddobie Jan 22, 2025
a49ae2b
Updated pipeline.utils.py
ddobie Jan 22, 2025
5697ad1
Updated forms.py
ddobie Jan 22, 2025
c2be6eb
Updated views.py
ddobie Jan 22, 2025
fd24ade
Updated config_template.yaml.j2
ddobie Jan 22, 2025
d344040
Updated templates/run_detail.html
ddobie Jan 22, 2025
613221f
More updates
ddobie Jan 22, 2025
a26c043
Fix run_detail template
ddobie Jan 22, 2025
6254837
Correctly handle overwrite check - check exists rather than isfile
ddobie Jan 23, 2025
6016e42
Check it for meas too
ddobie Jan 23, 2025
25f7e12
Correctly handle directory deletion
ddobie Jan 23, 2025
afaac16
Committed missing file?
ddobie Jan 23, 2025
9f233d8
Resolve merge conflicts
ddobie Jan 23, 2025
16faba0
Resolve merge conflicts
ddobie Jan 23, 2025
7c88fc3
Updated genparquet.md - still need to update a lot of the associated …
ddobie Jan 23, 2025
1c3e1aa
Done docs/using and docs/adminusage
ddobie Jan 23, 2025
891f9cb
Updated outputs docs
ddobie Jan 23, 2025
9c8a4c4
updated outputs docs
ddobie Jan 23, 2025
14bc5c3
Removed final vaex references
ddobie Jan 23, 2025
3e3746a
Fixed typo
ddobie Jan 23, 2025
664ef65
Merge branch 'v2.0' into v2-measurements-creation
ddobie Jan 24, 2025
c85fe0a
Update dependencies to remove vaex
ddobie Jan 24, 2025
1b970c3
Add initial batch of updated screenshots
ddobie Jan 24, 2025
4000761
Renamed docs/imgs arrow->parquet
ddobie Jan 24, 2025
b7a0ec3
Updated run_detail page
ddobie Jan 24, 2025
6eda109
Add files via upload
ddobie Jan 24, 2025
95c0772
Replaced parquet-modal
ddobie Jan 24, 2025
069f615
Updated Apache Parquet link
ddobie Jan 24, 2025
54fada1
Add files via upload
ddobie Jan 24, 2025
91649e4
Temp fix
ddobie Jan 24, 2025
0774cee
Added notes
ddobie Jan 24, 2025
05b55b2
Merge branch 'v2-measurements-creation' of github.com:askap-vast/vast…
ddobie Jan 24, 2025
c7168d6
Scrap pairs parquet generation
ddobie Jan 24, 2025
1d3c73d
Updated docs to reflect generating a single parquet file vs measureme…
ddobie Jan 28, 2025
3d08182
Updated webpage templates
ddobie Jan 28, 2025
4a9c756
First pass update of webform options
ddobie Jan 28, 2025
212772f
Updated run_detail.html
ddobie Jan 28, 2025
f83c23f
Fixed typo in logging
ddobie Jan 28, 2025
e78aec8
Fixed measurements parquet existence check
ddobie Jan 28, 2025
1fe50f2
Fixed parquet removal and pipeline config variable name
ddobie Jan 28, 2025
106ed89
write_parquet_files -> write_measurements_parquet in docs
ddobie Jan 28, 2025
d1692fa
Fix naming
ddobie Jan 28, 2025
af3a483
Update screenshots
ddobie Jan 28, 2025
9dc40b7
Merge branch 'v2-measurements-creation' of github.com:askap-vast/vast…
ddobie Jan 28, 2025
893ac3c
Update screenshot names
ddobie Jan 28, 2025
274ad12
Reorganisation
ddobie Jan 28, 2025
281cb07
Maybe commit uncommitted changes?
ddobie Jan 28, 2025
535382f
Remove unused import
ddobie Jan 28, 2025
368c049
Missed commit?
ddobie Jan 28, 2025
3637d53
Update docs/using/genparquet.md
ddobie Jan 29, 2025
51f645b
Update docs/using/runconfig.md
ddobie Jan 29, 2025
fcd6a6b
Added delete_file_or_dir function to utils
ddobie Jan 29, 2025
094f208
Implemented delete_file_or_dir
ddobie Jan 29, 2025
94ccf5d
Added missing import
ddobie Jan 29, 2025
e7c52ae
Remove arrow backup
ddobie Jan 29, 2025
5a239df
Implement copy_file_or_dir
ddobie Jan 29, 2025
c54ca92
Fix backup_parquets
ddobie Jan 29, 2025
4066850
Update variable names in backup_parquets
ddobie Jan 29, 2025
48f20dd
Added delete_file_or_dir import to pipeline.utils.py
ddobie Jan 29, 2025
6dbc6e0
Added missing shutils import
ddobie Jan 29, 2025
7667227
Fixed final os.remove
ddobie Jan 29, 2025
6984081
Fix deprecated dask config
ddobie Jan 29, 2025
cb74bf1
Remove unused shutil import - stupid linter
ddobie Jan 29, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 12 additions & 18 deletions docs/adminusage/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Output:

[vast_pipeline]
clearpiperun
createmeasarrow
createmeasparquet
debugrun
ingestimages
initingest
Expand Down Expand Up @@ -52,7 +52,7 @@ positional arguments:
optional arguments:
-h, --help show this help message and exit
--keep-parquet Flag to keep the pipeline run(s) parquet files. Will
also apply to arrow files if present.
also apply to parquet files if present.
--remove-all Flag to remove all the content of the pipeline run(s)
folder.
--version show program's version number and exit
Expand Down Expand Up @@ -83,30 +83,26 @@ Example usage:
!!!tip
Further information on clearing a specific run, or resetting the database, can be found in the [Contributing and Developing](../developing/localdevenv.md#removingclearing-data) section.

### createmeasarrow
### createmeasparquet

This command allows for the creation of the `measurements.arrow` and `measurement_pairs.arrow` files after a run has been successfully completed. See [Arrow Files](../outputs/outputs.md#arrow-files) for more information.

!!!info
The `measurement_pairs.arrow` file will only be created if the run was configured to calculate pair metrics.
This command allows for the creation of the `measurements.parquet` file after a run has been successfully completed. See [Parquet Files](../outputs/outputs.md#parquet-files) for more information.

```terminal
./manage.py createmeasarrow --help
usage: manage.py createmeasarrow [-h] [--overwrite] [--version] [-v {0,1,2,3}]
./manage.py createmeasparquet --help
usage: manage.py createmeasparquet [-h] [--overwrite] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH] [--traceback]
[--no-color] [--force-color] [--skip-checks]
piperun

Create `measurements.arrow` and `measurement_pairs.arrow` files for a
completed pipeline run.
Create `measurements.parquet` files for a completed pipeline run.

positional arguments:
piperun Path or name of the pipeline run.

optional arguments:
-h, --help show this help message and exit
--overwrite Overwrite previous 'measurements.arrow' file.
--overwrite Overwrite previous 'measurements.parquet' file.
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
Expand All @@ -126,11 +122,9 @@ optional arguments:
Example usage:

```terminal
./manage.py createmeasarrow docs_example_run
2021-03-30 10:48:40,952 createmeasarrow INFO Creating measurements arrow file for 'docs_example_run'.
2021-03-30 10:48:40,952 utils INFO Creating measurements.arrow for run docs_example_run.
2021-03-30 10:48:41,829 createmeasarrow INFO Creating measurement pairs arrow file for 'docs_example_run'.
2021-03-30 10:48:41,829 utils INFO Creating measurement_pairs.arrow for run docs_example_run.
./manage.py createmeasparquet docs_example_run
2021-03-30 10:48:40,952 createmeasparquet INFO Creating measurements parquet file for 'docs_example_run'.
2021-03-30 10:48:40,952 utils INFO Creating measurements.parquet for run docs_example_run.
```

### debugrun
Expand Down Expand Up @@ -482,7 +476,7 @@ measurements:
flux_fractional_error: 0.0
condon_errors: yes
selavy_local_rms_fill_value: 0.2
write_arrow_files: no
write_measurements_parquet: no
ra_uncertainty: 1.0
dec_uncertainty: 1.0
variability:
Expand Down
14 changes: 7 additions & 7 deletions docs/exploringwebsite/runpages.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Explanation of the table options can be found in the [DataTables section](datata

## Pipeline Run Detail Page

This page presents all the information about the pipeline run, including options to edit the configuration file and to schedule the run for processing, restore the run, delete the run and generate the arrow measurement files.
This page presents all the information about the pipeline run, including options to edit the configuration file and to schedule the run for processing, restore the run, delete the run and generate the measurement parquet file.

![!Pipeline Run detail page.](../img/run-detail1.png){: loading=lazy }

Expand All @@ -32,9 +32,9 @@ This page presents all the information about the pipeline run, including options

For admins and creators of runs there are four action buttons available:

* **Generate Arrow Files**
A process to generate the arrow measurement files.
See [Generating Arrow Files](../../using/genarrow).
* **Generate Measurements Parquet File**
A process to generate the parquet measurement files.
See [Generating Measurements Parquet File](../../using/genparquet).
* **Delete Run**
Delete the pipeline run.
See [Deleting a Run](../../using/deleterun).
Expand Down Expand Up @@ -118,11 +118,11 @@ The log file of the restore run action.

![!Restore log file.](../img/run-detail8.png){: loading=lazy }

#### Generate Arrow Files Log File
#### Generate Measurements Parquet Log File

The log file of the generate arrow files action.
The log file of the generate measurements parquet file action.

![!Generate arrow files log file.](../img/run-detail9.png){: loading=lazy }
![!Generate measurements parquet log file.](../img/generate-measurements-parquet-log.png){: loading=lazy }


### Image and Measurements Tables
Expand Down
Binary file modified docs/img/action-buttons.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed docs/img/arrow-files-available.png
Binary file not shown.
Binary file modified docs/img/docs-example-run-detail.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed docs/img/generate-arrow-button.png
Binary file not shown.
Binary file removed docs/img/generate-arrow-files-log.png
Binary file not shown.
Binary file removed docs/img/generate-arrow-modal.png
Binary file not shown.
Binary file removed docs/img/generate-arrow-notification.png
Binary file not shown.
Binary file added docs/img/generate-measurements-parquet-log.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/generate-parquet-button.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/generate-parquet-modal.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/generate-parquet-notification.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/measurements-parquet-available.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/img/run-detail1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/img/run-detail2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/img/run-detail3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/img/run-detail6.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/img/run-detail7.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed docs/img/run-detail9.png
Binary file not shown.
25 changes: 5 additions & 20 deletions docs/outputs/outputs.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ A sub-directory will exist for each pipeline run that contains the output produc

The pipeline uses the [Apache Parquet](https://parquet.apache.org){:target="_blank"} file format to write results to disk. Details on how to read these files can be found below in [Reading the Outputs](#reading-the-outputs).

Below is the output structure for a pipeline run named `new-test-data` when the pipeline run option `measurements.write_arrow_files` has been set to `True` and the working directory is named `pipeline-runs` (see [File Details](#file-details) for descriptions):
Below is the output structure for a pipeline run named `new-test-data` when the pipeline run option `measurements.write_measurements_parquet` has been set to `True` and the working directory is named `pipeline-runs` (see [File Details](#file-details) for descriptions):

```bash
pipeline-runs
Expand All @@ -38,29 +38,14 @@ pipeline-runs
│   ├── forced_measurements_VAST_2118-06A_EPOCH12_I_cutout_fits.parquet
│   ├── images.parquet
│   ├── YYYY-MM-DD-HH-MM-SS_log.txt
│   ├── measurements.arrow
│   ├── measurement_pairs.arrow
│   ├── measurements.parquet
│   ├── measurement_pairs.parquet
│   ├── measurement_pairs.parquet
│   ├── relations.parquet
│   ├── skyregions.parquet
│   └── sources.parquet
```

### Arrow Files

Large pipeline runs (hundreds of images) mean that to read the measurements, hundreds of parquet files need to be read in, and can contain millions of rows.
This can be slow using libraries such as pandas, and also consumes a lot of system memory.
A solution to this is to save all the measurements associated with the pipeline run into one single file in the [Apache Arrow](https://arrow.apache.org/overview/){:target="_blank"} format.

The library `vaex` is able to open `.arrow` files in an out-of-core context so the memory footprint is hugely reduced along with the reading of the file being very fast.
The two-epoch measurement pairs are also saved to arrow format due to the same reasons. See [Reading with vaex](usingoutputs.md#reading-with-vaex) for further details on using `vaex`.

!!! note
At the time of development `vaex` could not open parquets in an out-of-core context. This will be reviewed in the future if such functionality is added to `vaex`.

To enable the arrow files to be produced, the option `measurements.write_arrow_files` is required to be set to `True` in the pipeline run config.
Alternatively, the arrow files can be generated after the completion of the run, see the [Generating Arrow Files page](../../using/genarrow) for full details.

### Image Data

The data for the images [ingested](../design/imageingest.md) into the pipeline is also stored in the pipeline working directory under the subdirectory `images`:
Expand Down Expand Up @@ -111,8 +96,8 @@ Here, for each image, the selavy measurements that have been ingested are stored
| `forced_measurements*.parquet` | Multiple files that contain the forced measurements extracted from the respective image denoted in the filename. |
| `images.parquet` | Contains the information of the images processed in the pipeline run. |
| `YYYY-MM-DD-HH-MM-SS_log.txt` | The log file of the pipeline run. It is timestamped with the date and time of the run start. |
| `measurements.arrow` | An [Apache Arrow](https://arrow.apache.org/overview/){:target="_blank"} format file containing all the measurements associated with the pipeline run (see [Arrow Files](#arrow-files)).|
| `measurement_pairs.arrow` | An [Apache Arrow](https://arrow.apache.org/overview/){:target="_blank"} format file containing all the measurement pair metrics (see [Arrow Files](#arrow-files)). |
| `measurements.parquet` | An [Apache Parquet](https://parquet.apache.org/){:target="_blank"} format file containing all the measurements associated with the pipeline run (see [Arrow Files](#parquet-files)).|
| `measurement_pairs.parquet` | An [Apache Parquet](https://parquet.apache.org/){:target="_blank"} format file containing all the measurement pair metrics (see [Arrow Files](#parquet-files)). |
| `measurement_pairs.parquet` | Contains all the measurement pairs metrics. |
| `relations.parquet` | Contains the relation information between sources. |
| `skyregions.parquet` | Contains the sky region information of the pipeline run. |
Expand Down
43 changes: 11 additions & 32 deletions docs/outputs/usingoutputs.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,7 @@

This page gives details on how to open and use the pipeline output files.

It is recommended to use `pandas` or `vaex` to read the pipeline results from the parquet files. See the sections below for more information on using each library.

!!! note
It is also possible to use [`Dask`](https://docs.dask.org/en/latest/){:target="_blank"} to read the parquets in an out-of-core context but the general performance can sometimes be poor with many parquet files.
`vaex` is the preferred out-of-core method.
It is recommended to use `pandas` or `dask` to read the pipeline results from the parquet files. In general, even for runs with thousands of images, pandas is sufficient for reading most of the VAST pipeline outputs. However, loading a large set of measurements (e.g. `measurements.parquet` or a large subset of the individual parquet files) usually requires `dask` due to memory constraints. See the sections below for more information on using each library.

!!! tip
Be sure to look at [`vast-tools`](#vast-tools), a ready-made library for exploring pipeline results!
Expand Down Expand Up @@ -55,23 +51,16 @@ measurements = pd.concat(data, ignore_index=True)
sources = pd.read_parquet('pipeline-runs/new-test-data/sources.parquet', columns=['id', 'n_meas'])
```

### Reading with vaex

[vaex documentation](https://vaex.io/docs/index.html){:target="_blank"}.

!!! warning
vaex is a young project so bugs may be expected along with frequent updates. It has currently been tested with version `3.0.0`.
Version `4.0.0` promises opening parquet files in an out-of-core context.
### Reading with dask

!!! warning
Some pipeline `parquet` format files do not open with vaex 3.0.0. `arrow` format files should open successfully.
[dask documentation](https://docs.dask.org/en/stable/dataframe.html){:target="_blank"}.

A parquet, or arrow file, can be opened using the `open()` method:
A parquet can be opened using the `read_parquet()` method:

```python
import vaex
import dask.dataframe as dd

measurements = vaex.open('pipeline-runs/new-test-data/measurements.arrow')
measurements = dd.read_parquet('pipeline-runs/new-test-data/measurements.parquet')

measurements.head()
# source island_id component_id local_rms ra ra_err dec dec_err flux_peak flux_peak_err flux_int flux_int_err bmaj err_bmaj bmin err_bmin pa err_pa psf_bmaj psf_bmin psf_pa flag_c4 chi_squared_fit spectral_index spectral_index_from_TT has_siblings image_id time name snr compactness ew_sys_err ns_sys_err error_radius uncertainty_ew uncertainty_ns weight_ew weight_ns forced flux_int_isl_ratio flux_peak_isl_ratio id
Expand All @@ -87,31 +76,21 @@ measurements.head()
9 730 SB00013_island_1 SB00013_component_1a 0.437279 321.901 3.14407e-06 -4.20052 2.36161e-06 294.141 0.451346 340.92 0.864347 18.38 7.55055e-06 12.12 5.36009e-06 106.18 0.00262701 6.01 4 51.55 False 2368.93 -99 True True 12 2020-01-12 05:36:03.834000000 VAST_2118-06A_SB00013_component_1a 672.663 1.15903 0.000277778 0.000277778 4.00455e-06 0.000277807 0.000277807 1.29573e+07 1.29573e+07 False 0.640807 0.72508 1740
```

Multiple parquet files can be opened at once using the `open_many()` method:

```python
import glob
import vaex

files = glob.glob("pipeline-runs/images/*/measurements.parquet")
measurements = vaex.open_many(files)
```

!!! tip
You can convert a vaex dataframe to pandas by using the `to_pandas_df()` method:
You can convert a dask dataframe to pandas by using the `compute()` method:
```python
import vaex
import dask.dataframe as dd

sources = vaex.open('pipeline-runs/new-test-data/sources.parquet')
sources = sources.to_pandas_df()
sources = dd.read_parquet('pipeline-runs/new-test-data/sources.parquet')
sources = sources.compute()
```

### Linking the Results

The table below shows what parameters act as keys to link data from the different results tables.

!!! tip
If loading the measurements via the `.arrow` file, then the measurements already have the `source` column in-place.
If loading the measurements via the `.parquet` file, then the measurements already have the `source` column in-place.

!!! tip
The `images.parquet` file contains the column `measurements_path` which can be used to get the filepaths for all the selavy `parquet` files.
Expand Down
66 changes: 0 additions & 66 deletions docs/using/genarrow.md

This file was deleted.

Loading