Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_int for interleaved_bloom_filter #77

Closed
10 tasks
eseiler opened this issue Apr 28, 2020 · 1 comment · Fixed by seqan/seqan3#1922
Closed
10 tasks

get_int for interleaved_bloom_filter #77

eseiler opened this issue Apr 28, 2020 · 1 comment · Fixed by seqan/seqan3#1922
Labels
ready to tackle This story was discussed and can be immidietly tackled
Milestone

Comments

@eseiler
Copy link
Member

eseiler commented Apr 28, 2020

Description

Provide functionality similar to get_int for the seqan3::interleaved_bloom_filter:

uint64_t get_int(size_type idx, uint8_t len = 64) const

returns an uint64_t that includes the bits [idx, idx + len).

This provides an efficient way to access multiple bits of the seqan3::interleaved_bloom_filter.

I already did a benchmark of the different ways one might access the IBF:

----------------------------
Benchmark         hashes/sec
----------------------------
native/64/512     23.9736M/s
native/64/16384   24.1353M/s
native/8192/4     207.304k/s
native/8192/128   210.824k/s

iterator/64/512   11.9472M/s
iterator/64/16384 11.3783M/s
iterator/8192/4   92.3077k/s
iterator/8192/128 96k/s

get_int/64/512    133.358M/s
get_int/64/16384  124.121M/s
get_int/8192/4    3.50958M/s
get_int/8192/128  3.50958M/s

data/64/512       132.741M/s
data/64/16384     127.431M/s
data/8192/4       2.96145M/s
data/8192/128     2.89564M/s

The format of the benchmark names is method / number of bins / size per bin. All of them use 2 hash functions and sequences of length 1'000.
Links:

The general setup is that we have a query and want to count the occurrence of all k-mers in all bins.

native uses the bulk_contains and uses the subscript operator to increase the counters of a counting vector.
iterator uses bulk_contains and then iterates over the binning bitvector to increase the counters of a counting vector.
get_int uses bulk_contains and then accesses batches of 64 bit via get_int to increase the counters of a counting vector.
data uses bulk_contains and then accesses batches of 64 bit via data to increase the counters of a counting vector.

Acceptance Criteria

  • The seqan3::interleaved_bloom_filter exposes functionality similar to get_int.

Tasks

  • Adapt seqan3::interleaved_bloom_filter::binning_bitvector to allow access of multiple bits.

Definition of Done

  • Implementation and design approved
  • Unit tests pass
  • Test coverage = 100%
  • Microbenchmarks added and/or affected microbenchmarks < 5% performance drop
  • API documentation added
  • Tutorial/teaching material added
  • Test suite compiles in less than 30 seconds (on travis)
  • Changelog entry added
@rrahn rrahn added the needs refinement A story that was not discussed and/or estimated by the team yet but is planned for upcoming sprints. label Apr 30, 2020
@marehr
Copy link
Member

marehr commented Jun 15, 2020

2020-06-15 Core-Meeting:

Problem:
seqan3::interleaved_bloom_filter::bulk_contains returns seqan3::binning_bit_vector.

And seqan3::binning_bit_vector::operator[] is inefficient on sequential access patterns which is bad for counting.

Proposals:

  • expose get_int: gives one integer e.g. 0b1010 where each bit position corresponds whether a k-mer was present or not.
  • expose data: similar as get_int, but the whole raw-data
  • Add a new data structure "counting_vector" (this is currently the only use-case) which has an operator+= overload for binning_bit_vector which can do the efficient counting.
       counting_vector{1,2,4,5,6,7,4} +=
    binning_bit_vector{0,1,0,1,0,1,0}
    
    =>
    
    counting_vector{1,3,4,6,6,8,4}

Resolution:

We implement a counting_vector (name is up to discussion) which can do this use-case more efficient.

Add a documentation note for seqan3::binning_bit_vector::operator[] that this is bad for sequential access pattern and if someone has a different use-case than counting we can provide a better abstraction.

@rrahn rrahn added ready to tackle This story was discussed and can be immidietly tackled and removed needs refinement A story that was not discussed and/or estimated by the team yet but is planned for upcoming sprints. labels Jul 6, 2020
@rrahn rrahn added this to the Sprint 7 milestone Jul 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready to tackle This story was discussed and can be immidietly tackled
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants