Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query/vds] Add LEN field to VDS #14675

Merged
merged 11 commits into from
Oct 5, 2024

Conversation

chrisvittal
Copy link
Collaborator

@chrisvittal chrisvittal commented Sep 9, 2024

This is the beginning of a series of changes to support export of VDS to VCF 4.5, the version of VCF that contains the standardized form of our work that culminated in SVCR/VDS.

Reference blocks were standardized with a LEN rather than an END. So, now, by default, add LEN to all VDS reads and drop END in favor of LEN on all VDS writes. Our optimizer will be able to take care of pruning away the dead field in pipelines that don't use it.

We make sure that all VDS creation (other than the combiner), such as read_vds and from_merged_representation, contains both LEN and END preserving user code that depends on the presence of the END field.

Furthermore, this change contains necessary combiner updates to prefer LEN over END, and to use LEN in the combiner itself.

@chrisvittal chrisvittal mentioned this pull request Sep 9, 2024
7 tasks
@chrisvittal chrisvittal marked this pull request as ready for review September 9, 2024 20:44
Copy link
Collaborator

@patrick-schultz patrick-schultz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just a few small things

Copy link
Collaborator

@patrick-schultz patrick-schultz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Chris!

Copy link
Collaborator

@cjllanwarne cjllanwarne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I've left a few minor comments but happy to switch over to Approve if you prefer to leave things as they are.

.or_error(
hl.str(
'cannot create VDS from merged representation -' ' found END field with non-reference genotype at '
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL python lets you concatenate strings by just putting them next to each other like this 🤯

chrisvittal added a commit to chrisvittal/hail that referenced this pull request Sep 18, 2024
After split_multi, LGT is dropped from the variant data of a VDS. After
PR hail-is#14560, LGT is added to datasets after creation via the combiner.
After hail-is#14675 the same is true for `from_merged_representation`. We
should keep the GT/LGT field consistent across ref and var data. This
change does so for split_multi.

Resolves hail-is#14694
chrisvittal added a commit to chrisvittal/hail that referenced this pull request Sep 19, 2024
After split_multi, LGT is dropped from the variant data of a VDS. After
PR hail-is#14560, LGT is added to datasets after creation via the combiner.
After hail-is#14675 the same is true for `from_merged_representation`. We
should keep the GT/LGT field consistent across ref and var data. This
change does so for split_multi.

Resolves hail-is#14694
hail-ci-robot pushed a commit that referenced this pull request Sep 19, 2024
After split_multi, LGT is dropped from the variant data of a VDS. After
PR #14560, LGT is added to datasets after creation via the combiner.
After #14675 the same is true for `from_merged_representation`. We
should keep the GT/LGT field consistent across ref and var data. This
change does so for split_multi.

Resolves #14694
@chrisvittal chrisvittal force-pushed the vds/len-end-translation branch from 9bce5c9 to d7d31cb Compare September 20, 2024 17:14
We can't do this anymore since genotype may be something other than
diploid.

Missed this in the original VDS ploidy changes
chrisvittal and others added 9 commits September 23, 2024 14:20
This is the beginning of a series of changes to support export of VDS to
VCF 4.5, the version of VCF that contains the standardized form of our
work that culminated in SVCR/VDS.

Reference blocks were standardized with a LEN rather than an END. So,
now, by default, add LEN to all VDS reads and drop END in favor of LEN
on all VDS writes. Our optimizer will be able to take care of pruning
away the dead field.

We make sure that all VDS creation (other than the combiner), such as
read_vds and from_merged_representation, contains both LEN and END
preserving user code that depends on the presence of the END field.
Co-authored-by: Patrick Schultz <[email protected]>
@chrisvittal chrisvittal force-pushed the vds/len-end-translation branch from 473e90f to f39364c Compare September 23, 2024 18:21
@chrisvittal chrisvittal dismissed patrick-schultz’s stale review September 30, 2024 16:27

dismissing this since I'd like another look after working around one of the core issues discoverd here

Copy link
Collaborator

@patrick-schultz patrick-schultz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks Chris!

@hail-ci-robot hail-ci-robot merged commit 97e7833 into hail-is:main Oct 5, 2024
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants