Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor selection of Myers' bitparallel edit distance algorithm. #3254

Merged
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/tutorial/08_pairwise_alignment/configurations.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

//! [include_scoring_scheme]
#include <seqan3/alignment/scoring/aminoacid_scoring_scheme.hpp>
#include <seqan3/alignment/scoring/hamming_scoring_scheme.hpp>
#include <seqan3/alignment/scoring/nucleotide_scoring_scheme.hpp>
//! [include_scoring_scheme]

Expand Down
36 changes: 21 additions & 15 deletions doc/tutorial/08_pairwise_alignment/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,15 +40,16 @@ Let us first have a look at an example of computing a global alignment in SeqAn.
\includelineno doc/tutorial/08_pairwise_alignment/pairwise_alignment_first_global.cpp

In the above example, we want to compute the similarity of two seqan3::dna4 sequences using a global alignment.
For this, we need to import the dna4 alphabet, the seqan3::nucleotide_scoring_scheme and the seqan3::align_pairwise in
For this, we need to import the dna4 alphabet, the seqan3::hamming_scoring_scheme and the seqan3::align_pairwise in
the beginning of our file. We also import the seqan3::debug_stream which allows us to print various types in a nice
formatted manner.

At the beginning of the file, we are defining our two DNA sequences `s1` and `s2`.
If you feel puzzled about what the `_dna4` suffix does, we recommend reading the \ref tutorial_alphabets and
\ref tutorial_ranges before.
In line 16-17 we configure the alignment job with the most simplistic configuration possible.
In this case, it is a global alignment with edit distance.
In this case, it is a global alignment with edit distance, i.e. mismatches, insertions and deletions are scored with `-1`
and matches are scored with `0`.
Later in this tutorial we will give a more detailed description of the \ref alignment_configurations "configuration" and
how it can be used.
The minimum requirement for computing a pairwise alignment is to specify the alignment method
Expand Down Expand Up @@ -156,30 +157,36 @@ for sequence 1 set to `true`, and those for sequence 2 with `false`.
## Scoring schemes and gap schemes

A scoring scheme can be queried to get the score for substituting two alphabet values using the `score` member function.
Currently, SeqAn supports two scoring schemes: seqan3::nucleotide_scoring_scheme and
seqan3::aminoacid_scoring_scheme. You can import them with the following includes:
Currently, SeqAn supports three scoring schemes. Besides seqan3::hamming_scoring_scheme there is the
rrahn marked this conversation as resolved.
Show resolved Hide resolved
seqan3::nucleotide_scoring_scheme used for aligning nucleotide sequences and
seqan3::aminoacid_scoring_scheme used for aligning amino acid sequences.
You can import them with the following includes:

\snippet doc/tutorial/08_pairwise_alignment/configurations.cpp include_scoring_scheme

As the names suggests, you need to use the former when scoring \ref alphabet_nucleotide "nucleotides" and the latter
when working with \ref alphabet_aminoacid "aminoacids". You have already used the seqan3::nucleotide_scoring_scheme in
the assignments before.
The scoring scheme was default initialised using the
[Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance). The scoring schemes can also be configured by either
In many biological applications using the seqan3::hamming_scoring_scheme
rrahn marked this conversation as resolved.
Show resolved Hide resolved
([Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance)) is not enough, as it completely ignores the biological
rrahn marked this conversation as resolved.
Show resolved Hide resolved
background of sequence alignments.
If required, you can use the nucleotide scoring scheme when aligning \ref alphabet_nucleotide "nucleotides" and the
the amino acid scoring scheme when working with \ref alphabet_aminoacid "aminoacids".
These scoring schemes allow a finer control of scoring two aligned letters.
For example, you can set different match and mismatch scores for different pairs of aligned letters.
By default, the scoring schemes are initialised by
setting a \ref seqan3::scoring_scheme_base::set_simple_scheme "simple scheme" consisting of seqan3::match_score and
seqan3::mismatch_score or by providing a \ref seqan3::scoring_scheme_base::set_custom_matrix "custom matrix".
seqan3::mismatch_score.
But it is also possible to provide a \ref seqan3::scoring_scheme_base::set_custom_matrix "custom matrix".
The amino acid scoring scheme can additionally be \ref seqan3::aminoacid_scoring_scheme::set_similarity_matrix
"initialised" with a predefined substitution matrix that can be accessed via the seqan3::aminoacid_similarity_matrix
enumeration class.

\snippet doc/tutorial/08_pairwise_alignment/configurations.cpp scoring_scheme

\note You can also provide your own scoring scheme implementation if it models seqan3::scoring_scheme.
\note You can also provide your own scoring scheme implementation. It only has to model the seqan3::scoring_scheme_for concept.

Similarly to the scoring scheme, you can use the seqan3::align_cfg::gap_cost_affine to set the gap penalties used for
Similarly to the scoring scheme, you can use the seqan3::align_cfg::gap_cost_affine to customise the gap penalties used for
the alignment computation. The default initialised seqan3::align_cfg::gap_cost_affine sets the score for a gap to `-1`
and for a gap opening to `0`. Note that the gap open score is added to the gap score when a gap is opened within the
alignment computation. Therefore setting the gap open score to `0` disables affine gaps.
alignment computation. Therefore setting the gap open score to `0` disables affine gaps altogether.
rrahn marked this conversation as resolved.
Show resolved Hide resolved
You can pass a seqan3::align_cfg::extension_score and a seqan3::align_cfg::open_score object to initialise the scheme
with custom gap penalties. The penalties can be changed later by using the respective member variables
`extension_score` and `open_score`.
Expand Down Expand Up @@ -291,8 +298,7 @@ edits necessary to transform one sequence into the other. The cost model for the
the match score is `0` and the scores for a mismatch and a gap is `-1`. Due to the special metric, a fast
[bitvector](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.332.9395&rep=rep1&type=pdf) implementation can be
used to compute the edit distance. This happens in SeqAn automatically if the respective configurations are used.
To do so, you need a scoring scheme initialised with Manhattan distance (at the moment only
seqan3::nucleotide_scoring_scheme supports this) and a gap scheme initialised with `-1` for a gap and `0`
To do so, you need to explicitly set seqan3::hamming_scoring_scheme and the default gap scheme initialised with `-1` for a gap and `0`
rrahn marked this conversation as resolved.
Show resolved Hide resolved
for a gap open score and computing a seqan3::global_alignment.
rrahn marked this conversation as resolved.
Show resolved Hide resolved
To make the configuration easier, we added a shortcut called seqan3::align_cfg::edit_scheme.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,10 @@ int main()
seqan3::align_cfg::free_end_gaps_sequence2_leading{true},
seqan3::align_cfg::free_end_gaps_sequence1_trailing{false},
seqan3::align_cfg::free_end_gaps_sequence2_trailing{true}}
| seqan3::align_cfg::scoring_scheme{
seqan3::aminoacid_scoring_scheme{seqan3::aminoacid_similarity_matrix::blosum62}};
| seqan3::align_cfg::scoring_scheme{seqan3::aminoacid_scoring_scheme{
seqan3::aminoacid_similarity_matrix::blosum62}}
| seqan3::align_cfg::gap_cost_affine{seqan3::align_cfg::open_score{-10},
seqan3::align_cfg::extension_score{-1}};

for (auto const & res : seqan3::align_pairwise(source, config))
seqan3::debug_stream << "Score: " << res.score() << '\n';
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
#include <utility>

#include <seqan3/alignment/pairwise/align_pairwise.hpp>
#include <seqan3/alignment/scoring/nucleotide_scoring_scheme.hpp>
#include <seqan3/alignment/scoring/hamming_scoring_scheme.hpp>
#include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/core/debug_stream.hpp>

Expand All @@ -18,7 +18,7 @@ int main()

// Configure the alignment kernel.
auto config =
seqan3::align_cfg::method_global{} | seqan3::align_cfg::scoring_scheme{seqan3::nucleotide_scoring_scheme{}};
seqan3::align_cfg::method_global{} | seqan3::align_cfg::scoring_scheme{seqan3::hamming_scoring_scheme{}};

// Invoke the pairwise alignment which returns a lazy range over alignment results.
auto results = seqan3::align_pairwise(std::tie(s1, s2), config);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
#include <vector>

#include <seqan3/alignment/pairwise/align_pairwise.hpp>
#include <seqan3/alignment/scoring/nucleotide_scoring_scheme.hpp>
#include <seqan3/alignment/scoring/hamming_scoring_scheme.hpp>
#include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/core/debug_stream.hpp>
#include <seqan3/utility/views/pairwise_combine.hpp>
Expand All @@ -19,7 +19,7 @@ int main()

// Configure the alignment kernel.
auto config =
seqan3::align_cfg::method_global{} | seqan3::align_cfg::scoring_scheme{seqan3::nucleotide_scoring_scheme{}};
seqan3::align_cfg::method_global{} | seqan3::align_cfg::scoring_scheme{seqan3::hamming_scoring_scheme{}};

for (auto const & res : seqan3::align_pairwise(seqan3::views::pairwise_combine(vec), config))
seqan3::debug_stream << "Score: " << res.score() << '\n';
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
#include <vector>

#include <seqan3/alignment/pairwise/align_pairwise.hpp>
#include <seqan3/alignment/scoring/nucleotide_scoring_scheme.hpp>
#include <seqan3/alignment/scoring/hamming_scoring_scheme.hpp>
#include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/core/debug_stream.hpp>
#include <seqan3/utility/views/pairwise_combine.hpp>
Expand All @@ -22,7 +22,7 @@ int main()
seqan3::align_cfg::free_end_gaps_sequence2_leading{false},
seqan3::align_cfg::free_end_gaps_sequence1_trailing{true},
seqan3::align_cfg::free_end_gaps_sequence2_trailing{false}}
| seqan3::align_cfg::scoring_scheme{seqan3::nucleotide_scoring_scheme{}};
| seqan3::align_cfg::scoring_scheme{seqan3::hamming_scoring_scheme{}};

for (auto const & res : seqan3::align_pairwise(seqan3::views::pairwise_combine(vec), config))
seqan3::debug_stream << "Score: " << res.score() << '\n';
Expand Down
4 changes: 2 additions & 2 deletions include/seqan3/alignment/configuration/align_config_edit.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@

#include <seqan3/alignment/configuration/align_config_gap_cost_affine.hpp>
#include <seqan3/alignment/configuration/align_config_scoring_scheme.hpp>
#include <seqan3/alignment/scoring/nucleotide_scoring_scheme.hpp>
#include <seqan3/alignment/scoring/hamming_scoring_scheme.hpp>
#include <seqan3/core/configuration/configuration.hpp>

namespace seqan3::align_cfg
Expand Down Expand Up @@ -46,6 +46,6 @@ namespace seqan3::align_cfg
* first sequence.
*/
inline constexpr configuration edit_scheme =
scoring_scheme{nucleotide_scoring_scheme{}} | gap_cost_affine{open_score{0}, extension_score{-1}};
scoring_scheme{hamming_scoring_scheme{}} | gap_cost_affine{open_score{0}, extension_score{-1}};

} // namespace seqan3::align_cfg
33 changes: 14 additions & 19 deletions include/seqan3/alignment/pairwise/alignment_configurator.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -320,27 +320,22 @@ struct alignment_configurator
if constexpr (config_t::template exists<seqan3::align_cfg::method_global>())
{
// Only use edit distance if ...
auto method_global_cfg = get<seqan3::align_cfg::method_global>(config_with_result_type);
// Only use edit distance if ...
if (gap_cost.open_score == 0 && // gap open score is not set,
!(method_global_cfg.free_end_gaps_sequence2_leading
|| method_global_cfg.free_end_gaps_sequence2_trailing)
&& // none of the free end gaps are set for second seq,
(method_global_cfg.free_end_gaps_sequence1_leading
== method_global_cfg
.free_end_gaps_sequence1_trailing)) // free ends for leading and trailing gaps are equal in first seq.
if constexpr (std::same_as<std::remove_cvref_t<decltype(scoring_scheme)>, hamming_scoring_scheme>)
{
// TODO: Instead of relying on nucleotide scoring schemes we need to be able to determine the edit distance
// option via the scheme.
if constexpr (is_type_specialisation_of_v<std::remove_cvref_t<decltype(scoring_scheme)>,
nucleotide_scoring_scheme>)
auto method_global_cfg = get<seqan3::align_cfg::method_global>(config_with_result_type);
// and ...
if (gap_cost.open_score == 0 && gap_cost.extension_score == -1
&& // gap open score is not set and extension is -1,
!(method_global_cfg.free_end_gaps_sequence2_leading
|| method_global_cfg.free_end_gaps_sequence2_trailing)
&& // none of the free end gaps are set for second seq,
(method_global_cfg.free_end_gaps_sequence1_leading
== method_global_cfg
.free_end_gaps_sequence1_trailing)) // free ends for leading and trailing gaps are equal in first seq.
rrahn marked this conversation as resolved.
Show resolved Hide resolved
{
if ((scoring_scheme.score('A'_dna15, 'A'_dna15) == 0)
&& (scoring_scheme.score('A'_dna15, 'C'_dna15)) == -1)
{
return std::pair{configure_edit_distance<function_wrapper_t>(config_with_result_type),
config_with_result_type};
}

return std::pair{configure_edit_distance<function_wrapper_t>(config_with_result_type),
config_with_result_type};
rrahn marked this conversation as resolved.
Show resolved Hide resolved
}
}
}
Expand Down
7 changes: 4 additions & 3 deletions include/seqan3/alignment/pairwise/all.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,10 @@
*
* \include doc/tutorial/08_pairwise_alignment/pairwise_alignment_first_global.cpp
*
* In this snippet a global alignment over two nucleotide sequences is computed. Here the helper function std::tie
* is used to pass the two sequences as a tuple to the alignment algorithm. The special interface of std::tie allows
* to forward the two sequences as lvalue references such that no copy of the data is involved.
* In this snippet a global alignment over two nucleotide sequences using an edit scoring scheme is computed.
* Here the helper function std::tie is used to pass the two sequences as a tuple to the alignment algorithm.
* The special interface of std::tie allows to forward the two sequences as lvalue references such that no copy of the
* data is involved.
*
* There are a lot of applications that need to compute many pairwise sequence alignments. Accordingly, the
* seqan3::align_pairwise interface offers an overload for ranges over sequence pairs. The following snippet shows
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ class affine_gap_policy
constexpr void initialise_alignment_state(alignment_configuration_t const & config) noexcept
{
auto scheme =
config.get_or(align_cfg::gap_cost_affine{align_cfg::open_score{-10}, align_cfg::extension_score{-1}});
config.get_or(align_cfg::gap_cost_affine{align_cfg::open_score{0}, align_cfg::extension_score{-1}});

alignment_state.gap_extension_score = static_cast<score_t>(scheme.extension_score);
alignment_state.gap_open_score = static_cast<score_t>(scheme.extension_score + scheme.open_score);
Expand Down
1 change: 1 addition & 0 deletions include/seqan3/alignment/scoring/all.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
#pragma once

#include <seqan3/alignment/scoring/aminoacid_scoring_scheme.hpp>
#include <seqan3/alignment/scoring/hamming_scoring_scheme.hpp>
#include <seqan3/alignment/scoring/nucleotide_scoring_scheme.hpp>
#include <seqan3/alignment/scoring/scoring_scheme_base.hpp>
#include <seqan3/alignment/scoring/scoring_scheme_concept.hpp>
92 changes: 92 additions & 0 deletions include/seqan3/alignment/scoring/hamming_scoring_scheme.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
// SPDX-FileCopyrightText: 2006-2024 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2024 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: BSD-3-Clause

/*!\file
* \brief Provides seqan3::hamming_scoring_scheme.
* \author Rene Rahn <rene.rahn AT fu-berlin.de>
*/

#pragma once

#include <concepts>

#include <seqan3/core/platform.hpp>

namespace seqan3
{

/*!\brief A scoring scheme that assigns a score of 0 to matching letters and -1 to mismatching letters.
rrahn marked this conversation as resolved.
Show resolved Hide resolved
* \ingroup alignment_scoring
*
* \details
*
* This stateless scoring scheme is equivalent to the Hamming distance and assigns a score of 0 to matching letters and -1 to
rrahn marked this conversation as resolved.
Show resolved Hide resolved
* mismatching letters.
rrahn marked this conversation as resolved.
Show resolved Hide resolved
* This scheme is independent of the alphabet type and can be used whenever the two compared alphabets model the
* std::equality_comparable_with concept.
* Use this scheme if you want to use the use the more efficient bitparallel alignment algorithm often used in the
* context of computing the edit distance.
* See the documentation for \ref seqan3::align_cfg::edit_scheme for more details.
*/
class hamming_scoring_scheme
{
public:
//!\privatesection
//!\brief The alphabet type of the scoring scheme. This type is only used for the alignment configuration machinery.
using alphabet_type = char;
//!\publicsection

//!\brief The underlying score type.
using score_type = int32_t;

/*!\name Constructors, destructor and assignment
* \{
*/
hamming_scoring_scheme() noexcept = default; //!< Defaulted.
rrahn marked this conversation as resolved.
Show resolved Hide resolved
//!\}

/*!\name Accessors
* \{
*/

/*!\brief Returns the score of two letters.
* \tparam alph1_t The type of the first letter.
* \tparam alph2_t The type of the second letter.
* \param[in] alph1 The first letter to compare.
* \param[in] alph2 The second letter to compare.
*
* \details
*
* This function returns 0 if the two letters are equal and -1 otherwise. Note both the alphabet types of the
rrahn marked this conversation as resolved.
Show resolved Hide resolved
* compared letters must model the std::equality_comparable_with concept, otherwise no overload of this function
rrahn marked this conversation as resolved.
Show resolved Hide resolved
* is available.
*
* \returns The score of the two letters. 0 if the letters are equal, -1 otherwise.
rrahn marked this conversation as resolved.
Show resolved Hide resolved
*/
template <typename alph1_t, typename alph2_t>
requires std::equality_comparable_with<std::remove_reference_t<alph1_t>, std::remove_reference_t<alph2_t>>
rrahn marked this conversation as resolved.
Show resolved Hide resolved
constexpr score_type score(alph1_t const alph1, alph2_t const alph2) const noexcept
{
return alph1 == alph2 ? 0 : -1;
}
//!\}

//!\name Comparison operators
//!\{

//!\brief Always true.
constexpr bool operator==(hamming_scoring_scheme const &) const noexcept
{
return true;
}

//!\brief Always false.
constexpr bool operator!=(hamming_scoring_scheme const &) const noexcept
{
return false;
}
//!\}
};

} // namespace seqan3
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,10 @@ int main()
{
// configure a global alignment for DNA sequences
auto min_cfg = seqan3::align_cfg::method_global{}
| seqan3::align_cfg::scoring_scheme{
seqan3::nucleotide_scoring_scheme{seqan3::match_score{4}, seqan3::mismatch_score{-5}}};
| seqan3::align_cfg::scoring_scheme{seqan3::nucleotide_scoring_scheme{seqan3::match_score{4},
seqan3::mismatch_score{-5}}}
| seqan3::align_cfg::gap_cost_affine{seqan3::align_cfg::open_score{-10},
seqan3::align_cfg::extension_score{-1}};

auto seq1 = "TCGT"_dna4;
auto seq2 = "ACGA"_dna4;
Expand Down
Original file line number Diff line number Diff line change
@@ -1 +1 @@
7
10
3 changes: 2 additions & 1 deletion test/snippet/alignment/pairwise/align_pairwise_range.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@

#include <seqan3/alignment/configuration/all.hpp>
#include <seqan3/alignment/pairwise/align_pairwise.hpp>
#include <seqan3/alignment/scoring/hamming_scoring_scheme.hpp>
#include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/core/debug_stream.hpp>

Expand All @@ -19,7 +20,7 @@ int main()

// Configure the alignment kernel.
auto config =
seqan3::align_cfg::method_global{} | seqan3::align_cfg::scoring_scheme{seqan3::nucleotide_scoring_scheme{}};
seqan3::align_cfg::method_global{} | seqan3::align_cfg::scoring_scheme{seqan3::hamming_scoring_scheme{}};

// Compute the alignment over a range of pairs.
for (auto const & res : seqan3::align_pairwise(seqan3::views::zip(data1, data2), config))
Expand Down
Loading
Loading