Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcf_sr_seek() still fails #1362

Closed
freeseek opened this issue Nov 30, 2021 · 6 comments
Closed

bcf_sr_seek() still fails #1362

freeseek opened this issue Nov 30, 2021 · 6 comments
Assignees

Comments

@freeseek
Copy link

freeseek commented Nov 30, 2021

This issue seems related to issue #691 but I was not able to understand the issue in the source code:

Say you have a simple VCF:

(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr1>"
echo "##contig=<ID=chr2>"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"
echo -e "chr1\t10000\tx\tA\tC\t.\t.\t.") | bcftools view --no-version -Oz -o A.vcf.gz
bcftools index -ft A.vcf.gz

And a test.c BCFtools plugin with the following code:

#include <htslib/synced_bcf_reader.h>
#include "bcftools.h"

const char *about(void) { return "\n"; }

int run(int argc, char *argv[]) {
  bcf_srs_t *sr = bcf_sr_init();
  bcf_sr_set_opt(sr, BCF_SR_REQUIRE_IDX);
  if (!bcf_sr_add_reader(sr, "A.vcf.gz")) error("Failed to open %s: %s\n", "A.vcf.gz", bcf_sr_strerror(sr->errnum));
  bcf_hdr_t *hdr = bcf_sr_get_header(sr, 0);
  int rid = 0;

  int i;
  bcf_sr_seek(sr, bcf_hdr_id2name(hdr, rid), 0);
  for (i = 0; bcf_sr_next_line(sr); i++) {
    fprintf(stderr, "round1 chr=%s, pos=%ld\n", bcf_hdr_id2name(hdr, rid), bcf_sr_get_line(sr, 0)->pos + 1);
  }

  bcf_sr_seek(sr, bcf_hdr_id2name(hdr, rid), 0);
  for (i = 0; bcf_sr_next_line(sr); i++) {
    fprintf(stderr, "round2 chr=%s, pos=%ld\n", bcf_hdr_id2name(hdr, rid), bcf_sr_get_line(sr, 0)->pos + 1);
  }

  return 0;
}

The plugin seeks at the beginning of contig chr1 two times and each time it tries to read all the records.

When running the plugin with HTSlib 1.13:

$ ./bcftools +plugins/test.so 
round1 chr=chr1, pos=10000
round2 chr=chr1, pos=10000

When running the plugin with HTSlib 1.14:

$ ./bcftools +plugins/test.so 
round1 chr=chr1, pos=10000

The second time contig chr1 does not produce any records.

@daviesrob
Copy link
Member

I think the problem here is that bcf_sr_seek() doesn't like to seek backwards in the same chromosome. If you want to do this without closing and reopening the synced reader, one possible work-around might be to seek to the start of the file before seeking to the chromosome you want:

  bcf_sr_seek(sr, NULL, 0);
  bcf_sr_seek(sr, bcf_hdr_id2name(hdr, rid), 0);

At least for your example plugin, that seems to be enough to reset the reader and allow it to start again at the desired position.

If that still isn't enough to fix the problem, then you could, of course, fall back to closing the reader and opening a new one for the second pass through.

@freeseek
Copy link
Author

freeseek commented Dec 1, 2021

@daviesrob this seems to work only once and then weird stuff happens. My application segfaults when using the bcf_write() function, though I don't have yet a simple way to replicate that.

Here a more complicated example with two contigs:

(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr1>"
echo "##contig=<ID=chr2>"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"
echo -e "chr1\t10000\tx\tA\tC\t.\t.\t."
echo -e "chr2\t20000\tx\tA\tC\t.\t.\t.") | bcftools view --no-version -Oz -o A.vcf.gz
bcftools index -ft A.vcf.gz

The following plugin tries to read twice the first contig and twice the second contig while using bcf_sr_seek(sr, NULL, 0):

#include <htslib/synced_bcf_reader.h>
#include "bcftools.h"

const char *about(void) { return "\n"; }

int run(int argc, char *argv[]) {
  bcf_srs_t *sr = bcf_sr_init();
  bcf_sr_set_opt(sr, BCF_SR_REQUIRE_IDX);
  if (!bcf_sr_add_reader(sr, "A.vcf.gz")) error("Failed to open %s: %s\n", "A.vcf.gz", bcf_sr_strerror(sr->errnum));
  bcf_hdr_t *hdr = bcf_sr_get_header(sr, 0);
  int i;

  // read first contig
  bcf_sr_seek(sr, NULL, 0);
  bcf_sr_seek(sr, bcf_hdr_id2name(hdr, 0), 0);
  for (i = 0; bcf_sr_next_line(sr); i++) {
    if (bcf_sr_get_line(sr, 0)->rid != 0) {
      fprintf(stderr, "round1: observed contig %s so stopping\n", bcf_hdr_id2name(hdr, bcf_sr_get_line(sr, 0)->rid));
      break;
    }
    fprintf(stderr, "round1: chr=%s, pos=%ld\n", bcf_hdr_id2name(hdr, bcf_sr_get_line(sr, 0)->rid), bcf_sr_get_line(sr, 0)->pos + 1);
  }

  // read first contig a second time
  bcf_sr_seek(sr, NULL, 0);
  bcf_sr_seek(sr, bcf_hdr_id2name(hdr, 0), 0);
  for (i = 0; bcf_sr_next_line(sr); i++) {
    if (bcf_sr_get_line(sr, 0)->rid != 0) {
      fprintf(stderr, "round2: observed contig %s so stopping\n", bcf_hdr_id2name(hdr, bcf_sr_get_line(sr, 0)->rid));
      break;
    }
    fprintf(stderr, "round2: chr=%s, pos=%ld\n", bcf_hdr_id2name(hdr, bcf_sr_get_line(sr, 0)->rid), bcf_sr_get_line(sr, 0)->pos + 1);
  }

  // read second contig
  bcf_sr_seek(sr, NULL, 0);
  bcf_sr_seek(sr, bcf_hdr_id2name(hdr, 1), 0);
  for (i = 0; bcf_sr_next_line(sr); i++) {
    if (bcf_sr_get_line(sr, 0)->rid != 1) {
      fprintf(stderr, "round3: observed contig %s so stopping\n", bcf_hdr_id2name(hdr, bcf_sr_get_line(sr, 0)->rid));
      break;
    }
    fprintf(stderr, "round3: chr=%s, pos=%ld\n", bcf_hdr_id2name(hdr, bcf_sr_get_line(sr, 0)->rid), bcf_sr_get_line(sr, 0)->pos + 1);
  }

  // read second contig a second time
  bcf_sr_seek(sr, NULL, 0);
  bcf_sr_seek(sr, bcf_hdr_id2name(hdr, 1), 0);
  for (i = 0; bcf_sr_next_line(sr); i++) {
    if (bcf_sr_get_line(sr, 0)->rid != 1) {
      fprintf(stderr, "round4: observed contig %s so stopping\n", bcf_hdr_id2name(hdr, bcf_sr_get_line(sr, 0)->rid));
      break;
    }
    fprintf(stderr, "round4: chr=%s, pos=%ld\n", bcf_hdr_id2name(hdr, bcf_sr_get_line(sr, 0)->rid), bcf_sr_get_line(sr, 0)->pos + 1);
  }

  bcf_sr_destroy(sr);

  return 0;
}

The output is as follows:

round1: chr=chr1, pos=10000
round1: observed contig chr2 so stopping
round2: chr=chr1, pos=10000
round2: observed contig chr2 so stopping
round3: chr=chr2, pos=20000
round4: observed contig chr1 so stopping

When trying to seek on the second contig for the second time, bcf_sr_seek() silently fails and somehow returns an iterator to the wrong contig

This issue could arise with an application that is trying for example to phase all chromosomes one by one and needs to read them once to import the genotypes in memory and a second time to write the updated output VCFs with the phased genotypes. And it used to work fine with HTSlib 1.13

@daviesrob
Copy link
Member

Hmm, it looks like it won't be quite so easy to work around then. If the problem appeared between 1.13 and 1.14, then it must have been in #1327 as that was the only change that touched synced_bcf_reader. We will have to dig a little deeper to see exactly what it was in that change that triggered this problem.

Meanwhile, I think closing and reopening the reader may be your best solution. This should be alright as long as you don't do it too often. Or for your phasing example, you could open two readers on the same file - using the first to read in the genotypes, then the second to make the updates.

I tried rewriting your test plugin to close and reopen the files, and it does work correctly:

#include <htslib/synced_bcf_reader.h>
#include "bcftools.h"

const char *about(void) { return "\n"; }

static int get_reader(bcf_srs_t **sr_out, bcf_hdr_t **hdr_out) {
  bcf_srs_t *sr = bcf_sr_init();
  bcf_hdr_t *hdr = NULL;
  int err;
  if (!sr)
    return -1;
  bcf_sr_set_opt(sr, BCF_SR_REQUIRE_IDX);
  if (!bcf_sr_add_reader(sr, "A.vcf.gz"))
    goto fail;
  hdr = bcf_sr_get_header(sr, 0);
  if (!hdr)
    goto fail;
  *sr_out = sr;
  *hdr_out = hdr;
  return 0;

  fail:
  err = sr->errnum;
  bcf_sr_destroy(sr);
  return err ? err : -1;
}

int run(int argc, char *argv[]) {
  bcf_srs_t *sr = NULL;
  bcf_hdr_t *hdr = NULL;
  int i, err;

  // read first contig
  err = get_reader(&sr, &hdr);
  if (err)
    error("Failed to open %s: %s\n", "A.vcf.gz", bcf_sr_strerror(sr->errnum));
  bcf_sr_seek(sr, bcf_hdr_id2name(hdr, 0), 0);
  for (i = 0; bcf_sr_next_line(sr); i++) {
    if (bcf_sr_get_line(sr, 0)->rid != 0) {
      fprintf(stderr, "round1: observed contig %s so stopping\n", bcf_hdr_id2name(hdr, bcf_sr_get_line(sr, 0)->rid));
      break;
    }
    fprintf(stderr, "round1: chr=%s, pos=%ld\n", bcf_hdr_id2name(hdr, bcf_sr_get_line(sr, 0)->rid), bcf_sr_get_line(sr, 0)->pos + 1);
  }
  bcf_sr_destroy(sr);

  // read first contig a second time
  err = get_reader(&sr, &hdr);
  if (err)
    error("Failed to open %s: %s\n", "A.vcf.gz", bcf_sr_strerror(sr->errnum));
  bcf_sr_seek(sr, bcf_hdr_id2name(hdr, 0), 0);
  for (i = 0; bcf_sr_next_line(sr); i++) {
    if (bcf_sr_get_line(sr, 0)->rid != 0) {
      fprintf(stderr, "round2: observed contig %s so stopping\n", bcf_hdr_id2name(hdr, bcf_sr_get_line(sr, 0)->rid));
      break;
    }
    fprintf(stderr, "round2: chr=%s, pos=%ld\n", bcf_hdr_id2name(hdr, bcf_sr_get_line(sr, 0)->rid), bcf_sr_get_line(sr, 0)->pos + 1);
  }
  bcf_sr_destroy(sr);

  // read second contig
  err = get_reader(&sr, &hdr);
  if (err)
    error("Failed to open %s: %s\n", "A.vcf.gz", bcf_sr_strerror(sr->errnum));
  bcf_sr_seek(sr, bcf_hdr_id2name(hdr, 1), 0);
  for (i = 0; bcf_sr_next_line(sr); i++) {
    if (bcf_sr_get_line(sr, 0)->rid != 1) {
      fprintf(stderr, "round3: observed contig %s so stopping\n", bcf_hdr_id2name(hdr, bcf_sr_get_line(sr, 0)->rid));
      break;
    }
    fprintf(stderr, "round3: chr=%s, pos=%ld\n", bcf_hdr_id2name(hdr, bcf_sr_get_line(sr, 0)->rid), bcf_sr_get_line(sr, 0)->pos + 1);
  }
  bcf_sr_destroy(sr);

  // read second contig a second time
  err = get_reader(&sr, &hdr);
  if (err)
    error("Failed to open %s: %s\n", "A.vcf.gz", bcf_sr_strerror(sr->errnum));
  bcf_sr_seek(sr, bcf_hdr_id2name(hdr, 1), 0);
  for (i = 0; bcf_sr_next_line(sr); i++) {
    if (bcf_sr_get_line(sr, 0)->rid != 1) {
      fprintf(stderr, "round4: observed contig %s so stopping\n", bcf_hdr_id2name(hdr, bcf_sr_get_line(sr, 0)->rid));
      break;
    }
    fprintf(stderr, "round4: chr=%s, pos=%ld\n", bcf_hdr_id2name(hdr, bcf_sr_get_line(sr, 0)->rid), bcf_sr_get_line(sr, 0)->pos + 1);
  }

  bcf_sr_destroy(sr);

  return 0;
}
round1: chr=chr1, pos=10000
round1: observed contig chr2 so stopping
round2: chr=chr1, pos=10000
round2: observed contig chr2 so stopping
round3: chr=chr2, pos=20000
round4: chr=chr2, pos=20000

@daviesrob daviesrob self-assigned this Dec 7, 2021
@freeseek
Copy link
Author

freeseek commented Dec 7, 2021

It seems like using bcf_sr_regions_seek(sr->regions, seq) almost recovers the previous behavior. The following BCFtools plugin code:

#include <htslib/synced_bcf_reader.h>
#include "bcftools.h"

const char *about(void) { return "\n"; }

int run(int argc, char *argv[]) {
  bcf_srs_t *sr = bcf_sr_init();
  bcf_sr_set_opt(sr, BCF_SR_REQUIRE_IDX);
  if (!bcf_sr_add_reader(sr, "A.vcf.gz")) error("Failed to open %s: %s\n", "A.vcf.gz", bcf_sr_strerror(sr->errnum));
  bcf_hdr_t *hdr = bcf_sr_get_header(sr, 0);
  int i;

  // read first contig
  bcf_sr_seek(sr, bcf_hdr_id2name(hdr, 0), 0);
  bcf_sr_regions_seek(sr->regions, bcf_hdr_id2name(hdr, 0));
  for (i = 0; bcf_sr_next_line(sr); i++) {
    if (bcf_sr_get_line(sr, 0)->rid != 0) {
      fprintf(stderr, "round1: observed contig %s so stopping\n", bcf_hdr_id2name(hdr, bcf_sr_get_line(sr, 0)->rid));
      break;
    }
    fprintf(stderr, "round1: chr=%s, pos=%ld\n", bcf_hdr_id2name(hdr, bcf_sr_get_line(sr, 0)->rid), bcf_sr_get_line(sr, 0)->pos + 1);
  }

  // read first contig a second time
  bcf_sr_seek(sr, bcf_hdr_id2name(hdr, 0), 0);
  bcf_sr_regions_seek(sr->regions, bcf_hdr_id2name(hdr, 0));
  for (i = 0; bcf_sr_next_line(sr); i++) {
    if (bcf_sr_get_line(sr, 0)->rid != 0) {
      fprintf(stderr, "round2: observed contig %s so stopping\n", bcf_hdr_id2name(hdr, bcf_sr_get_line(sr, 0)->rid));
      break;
    }
    fprintf(stderr, "round2: chr=%s, pos=%ld\n", bcf_hdr_id2name(hdr, bcf_sr_get_line(sr, 0)->rid), bcf_sr_get_line(sr, 0)->pos + 1);
  }

  // read second contig
  bcf_sr_seek(sr, bcf_hdr_id2name(hdr, 1), 0);
  bcf_sr_regions_seek(sr->regions, bcf_hdr_id2name(hdr, 1));
  for (i = 0; bcf_sr_next_line(sr); i++) {
    if (bcf_sr_get_line(sr, 0)->rid != 1) {
      fprintf(stderr, "round3: observed contig %s so stopping\n", bcf_hdr_id2name(hdr, bcf_sr_get_line(sr, 0)->rid));
      break;
    }
    fprintf(stderr, "round3: chr=%s, pos=%ld\n", bcf_hdr_id2name(hdr, bcf_sr_get_line(sr, 0)->rid), bcf_sr_get_line(sr, 0)->pos + 1);
  }

  // read second contig a second time
  bcf_sr_seek(sr, bcf_hdr_id2name(hdr, 1), 0);
  bcf_sr_regions_seek(sr->regions, bcf_hdr_id2name(hdr, 1));
  for (i = 0; bcf_sr_next_line(sr); i++) {
    if (bcf_sr_get_line(sr, 0)->rid != 1) {
      fprintf(stderr, "round4: observed contig %s so stopping\n", bcf_hdr_id2name(hdr, bcf_sr_get_line(sr, 0)->rid));
      break;
    }
    fprintf(stderr, "round4: chr=%s, pos=%ld\n", bcf_hdr_id2name(hdr, bcf_sr_get_line(sr, 0)->rid), bcf_sr_get_line(sr, 0)->pos + 1);
  }

  bcf_sr_destroy(sr);

  return 0;
}

Gives the following output:

round1: chr=chr1, pos=10000
round1: observed contig chr2 so stopping
round2: chr=chr1, pos=10000
round3: chr=chr2, pos=20000
round4: chr=chr2, pos=20000

And by using the full:

bcf_sr_seek(sr, NULL, 0);
bcf_sr_seek(sr, seq, 0);
bcf_sr_regions_seek(sr->regions, seq);

The previous behavior is restored:

round1: chr=chr1, pos=10000
round1: observed contig chr2 so stopping
round2: chr=chr1, pos=10000
round2: observed contig chr2 so stopping
round3: chr=chr2, pos=20000
round4: chr=chr2, pos=20000

@pd3 pd3 self-assigned this Dec 9, 2021
pd3 added a commit to pd3/htslib that referenced this issue Dec 9, 2021
Repeated seeks with implicitly created region list wouldn't initialize
all internal structures to the original clean state, as demonstrated by
the issue samtools#1362, resolved by this commit
@pd3
Copy link
Member

pd3 commented Dec 9, 2021

Thanks for reporting the problem, this should be solved by the pull request #1363

pd3 added a commit to pd3/htslib that referenced this issue Dec 16, 2021
Repeated seeks with implicitly created region list wouldn't initialize
all internal structures to the original clean state, as demonstrated by
the issue samtools#1362, resolved by this commit
pd3 added a commit to pd3/htslib that referenced this issue Jan 4, 2022
Repeated seeks with implicitly created region list wouldn't initialize
all internal structures to the original clean state, as demonstrated by
the issue samtools#1362, resolved by this commit
daviesrob pushed a commit that referenced this issue Jan 4, 2022
Repeated seeks with implicitly created region list wouldn't initialize
all internal structures to the original clean state, as demonstrated by
the issue #1362, resolved by this commit
@daviesrob
Copy link
Member

This should now have been fixed by #1363.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants