Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Add --save-matches to fastmultigather #397

Merged
merged 23 commits into from
Aug 17, 2024
Merged

[MRG] Add --save-matches to fastmultigather #397

merged 23 commits into from
Aug 17, 2024

Conversation

mr-eyes
Copy link
Member

@mr-eyes mr-eyes commented Jul 25, 2024

TODO:

  • Implement the logic
  • Make it optional
  • Testing it
  • Docs

@mr-eyes
Copy link
Member Author

mr-eyes commented Jul 25, 2024

Initial testing is not bad at all!

w/ matches=True
sourmash scripts fastmultigather sig_paths.l hg38-entire.sig.zip -k 51  10000  705.08s user 422.49s system 1036% cpu 1:48.81 total

w/ matches=False
sourmash scripts fastmultigather sig_paths.l hg38-entire.sig.zip -k 51  10000  555.32s user 5.15s system 1159% cpu 48.319 total

main branch version
sourmash scripts fastmultigather sig_paths.l hg38-entire.sig.zip -k 51  10000  447.15s user 66.81s system 1113% cpu 46.153 total

@mr-eyes mr-eyes changed the title [WIP] Add --save-matches to fastgather & fastmultigather [WIP] Add --save-matches to fastmultigather Jul 25, 2024
@mr-eyes mr-eyes changed the title [WIP] Add --save-matches to fastmultigather [MRG] Add --save-matches to fastmultigather Jul 25, 2024
@mr-eyes
Copy link
Member Author

mr-eyes commented Jul 25, 2024

Ready for review @ctb @bluegenes
I couldn't test the matched hashes length because hashes can overlap across targets.

@@ -70,12 +79,20 @@ pub fn fastmultigather(
let prefix = name.split(' ').next().unwrap_or_default().to_string();
let location = PathBuf::new(&prefix).file_name().unwrap();
if let Some(query_mh) = query_sig.minhash() {
let mut matching_hashes = if save_matches { Vec::new() } else { Vec::new() };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line confuses me - isn't it creating a new vec either way? What does the if/else do here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line confused me as well - late-hour coding ;|


// Save matching hashes to .sig file if save_matches is true
if save_matches {
let sig_filename = format!("{}.matches.sig", name);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest using .gz or zip output here. .sig format is bulky.

@ctb
Copy link
Collaborator

ctb commented Jul 25, 2024

Initial testing is not bad at all!

I'm confused as to why --save-matches=False is resulting in any slowdown at all!

@mr-eyes
Copy link
Member Author

mr-eyes commented Jul 25, 2024

Initial testing is not bad at all!

I'm confused as to why --save-matches=False is resulting in any slowdown at all!

Here are the latest results after b546eef (#397)

// --save-matches=True
sourmash scripts fastmultigather sig_paths.l hg38-entire.sig.zip -k 51  10000  659.54s user 784.33s system 917% cpu 2:37.31 total

// --save-matches=False
sourmash scripts fastmultigather sig_paths.l hg38-entire.sig.zip -k 51  10000  443.37s user 3.25s system 1177% cpu 37.942 total

// main branch
sourmash scripts fastmultigather sig_paths.l hg38-entire.sig.zip -k 51  10000  453.75s user 3.63s system 1158% cpu 39.463 total

@ctb
Copy link
Collaborator

ctb commented Aug 12, 2024

I'll take a look at this when I can, but a few quick thoughts -

  • would be good to gzip the output sketches (if we're going to output a sig or sig.gz instead of a zip)
  • I would like to compare to the sourmash output somehow; will think on the test

@ctb
Copy link
Collaborator

ctb commented Aug 13, 2024

Testing at the command line -

sourmash scripts fastmultigather 47.sig.zip 47.sig.zip -s 10000 --save-matches

I get:

NC_009661.1 Shewanella baltica OS185 plasmid pS18501, complete sequence.matches.sig
NC_009661.1.gather.csv
NC_009661.1.prefetch.csv

so I think I'd suggest splitting off the identifier as with gather.csv and prefetch.csv.

In addition, I'd like to suggest compressing the matched sketches on writing; example code to do so appears to be over in sourmash, file src/core/src/ffi/signature.rs. If you don't feel like adding it please just create an issue to do so ;).

@ctb
Copy link
Collaborator

ctb commented Aug 13, 2024

I've created a PR here (into your PR) with a few suggested changes.

#420

* add pyo3 decoration

* do not ignore result

* fix fmt
@ctb
Copy link
Collaborator

ctb commented Aug 17, 2024

ok! a few more suggested changes & we're good to go: see #423

Please merge that, and then this, & then create issues as requested above ;)

* make use of in_directory

* add more tests
@mr-eyes
Copy link
Member Author

mr-eyes commented Aug 17, 2024

I will create the mentioned issues to be added later by myself or others. Many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants