Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downsample a bam by duplicate sets instead of reads #6512

Merged
merged 1 commit into from
Apr 23, 2020
Merged

Conversation

takutosato
Copy link
Contributor

@lbergelson we've chatted about this about a month ago. I recall you wanted me to write a "ReadWalkerBase," which would be the parent class of ReadWalker and DuplicateSetWalker. I haven't done that just yet—in this code DuplicateSetWalker is a subclass of ReadWalker. I wanted to show you this code first before embarking on further refactoring.

@gatk-bot
Copy link

Travis reported job failures from build 29677
Failures in the following jobs:

Test Type JDK Job ID Logs
unit openjdk11 29677.14 logs

@droazen
Copy link
Contributor

droazen commented Apr 1, 2020

@lbergelson Are you happy with this branch as-is or do you want further refactoring as @takutosato hinted at above?

@takutosato
Copy link
Contributor Author

@droazen or @jamesemery, could one of you take a look at this PR?

@lbergelson
Copy link
Member

I'm reviewing this, sorry for the slow response.

**/
public class DownsampleByDuplicateSet extends DuplicateSetWalker {
@Argument(fullName = StandardArgumentDefinitions.OUTPUT_LONG_NAME, shortName = StandardArgumentDefinitions.OUTPUT_SHORT_NAME, doc = "")
public File outputBam;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public File outputBam;
public GATKPathSpecifier outputBam;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're trying to eliminate the use of File wherever possible, since its limiting, and rely the more general GATKPathSpecifier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmnbroad should I avoid createTempFile() in test classes as well? It creates a File object.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@takutosato No - using createTempFile and File in tests is fine for now. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks!

public void onTraversalStart() {
super.onTraversalStart();
rng = RandomGeneratorFactory.createRandomGenerator(new Random(RANDOM_SEED));
outputWriter = createSAMWriter(IOUtils.getPath(outputBam.getAbsolutePath()), false);
Copy link
Collaborator

@cmnbroad cmnbroad Apr 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use my GATKPathSpecifier suggestion above, you can just call .toPath() on the GATKPathSpecifier instance here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-sorted = false means this will resort the output I believe. Is that what you want?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.toPath -> done.
Hmmm. That's not what I want. But I don't think it sorts the output. Will investigate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably doesn't sort it if the sort order is listed as "unsorted"

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@takutosato Checkpointing my review so you can start taking a look at what I'm thinking. I still haven't looked at the tool itself DownsampleByDuplicateSet or the DuplicateSet code. I'll try to get to those tomorrow.

* http://fulcrumgenomics.github.io/fgbio/tools/latest/GroupReadsByUmi.html
*/
public abstract class DuplicateSetWalker extends ReadWalker {
DuplicateSet currentDuplicateSet; // TODO: does it make sense to call clear() and recycle the same object?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't. I don't think the performance gain is worth the potential confusion if someone does something like try to accumulate duplicate sets during the traversal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be private, protected, or marked visiblefortesting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Done


@Override
public final void onStartup(){
// TODO: I had to make onStartUp of ReadWalker not final to do this. The right thing to do here is to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super happy about that, it makes it less clear if downstream tools are meant to override it.

// write a ReadWalkerBase parent class for both ReadWalker and DuplicateSet Walker, as Louis suggested.
super.onStartup();
currentDuplicateSet = new DuplicateSet();
currentDuplicateSet.setMoleduleId(INITIAL_MOLECULAR_ID);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm being dense, but couldn't you initialize this in place when you define the variable and not need to do it here? That would avoid the need for overriding onStartup. If you can't instantiate a DuplicateSet and set the ID at the same time, I would just change duplicate set. It would probably be better though if instead of exposing a magic number, DuplicateSet had a function like isEmpty() that would answer the question of if a read has been added to it yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points. Done.

* read6: ... MI:Z:1/B ...
* read7: ... MI:Z:1/B ...
*
* Thus it's sufficient to go through the reads in order and collect them in a list until
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to tell if the file is sorted in the correct way or not? It would be good to check that it is somehow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do something silly like using a bloom filter to track already seen molecules ID's and warn if you see a surplus of collisions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would check if the bam have any sorting subkeys specified, it's a thing Nil's and Tim wanted to add to the spec so they might be outputting the. I don't think htsjdk knows about them yet but if it exists in the bams it would be a good reason to make htsjdk aware of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bam header has SO:unsorted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check to ensure that the molecular ID is monotonically increasing and that the tag exists. They should catch it if the input bam is not sorted as required—is that good enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sounds great.

*/
public abstract class DuplicateSetWalker extends ReadWalker {
DuplicateSet currentDuplicateSet; // TODO: does it make sense to call clear() and recycle the same object?
public static int INITIAL_MOLECULAR_ID = -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed variable

}

// Strict match e.g. equalsExactly("AGT-GCT", "GCT-AGT") returns false
public boolean equalsExactly(final UMI that) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be javadoc.

// Check for whether the two umis came from the same molecule
// e.g. equalsModuloOrder("AGT-GCT", "GCT-AGT") returns true
// equalsModuloOrder("AGT-GCT", "AGT-GCT") also returns true
public boolean equalsModuloOrder(final UMI that) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be javadoc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this post error correction?

return that.umiSmall.equals(this.umiSmall) && that.umiLarge.equals(this.umiLarge);
}

public String getStandardizedUMI(){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Javadoc


public String getStandardizedUMI(){
if (umi1.compareTo(umi2) > 0){
// umi1 is lexicographically greater e.g. umi1 = TAT, umi2 = AAC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant since you sorted them already, just return umiSmall + umiLong, or better yet just save the combination.

import org.apache.commons.lang3.tuple.Pair;
import org.broadinstitute.hellbender.engine.filters.UMIReadFilter;
import org.broadinstitute.hellbender.utils.read.GATKRead;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class has lots of unused/untested stuff that probably should be used / tested. The umi logic should be implemented here and called from the filter.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@takutosato Some more comments. I have some thoughts about changing DuplicateSet to disallow non-matching values instead of reporting false when adding them. It's possible I don't understand a necessary use case though.

You need unit tests for the rest of the classes here that aren't tested. UMI, UMIReadFilter, and DuplicateSet itself.

*
* The input bam must have been sorted by UMI using FGBio GroupReadsByUmi (http://fulcrumgenomics.github.io/fgbio/tools/latest/GroupReadsByUmi.html).
*
* Use this tool to create, for instance, an insilico mixture of duplex-sequenced samples to simulate tumor subclone.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo? tumor subclones

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Argument(fullName = StandardArgumentDefinitions.OUTPUT_LONG_NAME, shortName = StandardArgumentDefinitions.OUTPUT_SHORT_NAME, doc = "")
public File outputBam;

@Argument(fullName = "DS", doc = "This fraction of duplicate sets in the input bam will be retained")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably give this a more descriptive name. Maybe "fraction-to-keep"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add bounds to this, i.e. minValue = 0.0 maxValue = 1.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done*2


private static final int RANDOM_SEED = 142;
private RandomGenerator rng;
private static int numFragments;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these fields should be static.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

public void onTraversalStart() {
super.onTraversalStart();
rng = RandomGeneratorFactory.createRandomGenerator(new Random(RANDOM_SEED));
outputWriter = createSAMWriter(IOUtils.getPath(outputBam.getAbsolutePath()), false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-sorted = false means this will resort the output I believe. Is that what you want?


@Override
public void onTraversalStart() {
super.onTraversalStart();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unnecessary because the walker bases are obligated to NOT implement this. It's fine to do it anyway though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

.add("O", out.getAbsolutePath());
runCommandLine(args, DownsampleByDuplicateSet.class.getSimpleName());

final ReadsDataSource originalBam = new ReadsDataSource(Paths.get(NA12878_GROUPED));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

declare datasources in a try-with-resources

final Iterator<GATKRead> iterator = readsDataSource.iterator();
while (iterator.hasNext()){
final GATKRead read = iterator.next();
final String molecularID = read.getAttributeAsString("MI");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the constant you defined

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


/** When down-sampling rate is 1.0, the input file is returned unchanged **/
@Test
public void testNoDownsampling(){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test that shows that DS = 0.0 produces an empty file?

* This tool address the above problem by dropping a set fraction of _duplicate sets_, rather than reads, at random.
* Implicit in this approach is that a read and its mate are dropped or retained together.
* While trivial when the input bam is sorted by UMI and query name, this is far from trivial when one attempts
* to downsample reads naively with a tool like {@link PrintReads}.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the doc include an example usage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example usage added.

.add("O", out.getAbsolutePath());
runCommandLine(args, DownsampleByDuplicateSet.class.getSimpleName());

final ReadsDataSource originalBam = new ReadsDataSource(Paths.get(NA12878_GROUPED));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try-with-resources

@lbergelson
Copy link
Member

@takutosato Woops, I just realized you might have just wanted a rough look at this and not an in depth review... sorry if this was premature.

@takutosato
Copy link
Contributor Author

@lbergelson no I appreciate the review. Having read the code, do you think it's worth writing a "ReadWalkerBase," which would be a parent class of ReadWalker and DuplicateSetWalker?

@lbergelson
Copy link
Member

I don't think it's super necessary as long as the weirdness about the double apply method is documented. It wouldn't be bad, but it's a lot of extra work for not too much benefit. I think we should push it to the future. If we have more classes like this it might be worth it.

What might be worthwhile is changing the DuplicateSetWalker to be more abstract, so that it is a walker that groups reads by something instead of always molecular id. An obvious similar walker would be one that groups by readname in readname sorted bams, or by cell in single cell RNA bams. Since we don't have an obvious need for either of those though we could also wait until we have a second example of a walker that needs to do this.

@tedsharpe
Copy link
Contributor

I'm working on a tool that processes aligned contigs from local assemblies that could use the group-by-queryname functionality. It would be nice not to have to buffer everything that comes through the apply method to process at the end of the traversal. Not hugely important, but there's your 2nd example.

@lbergelson
Copy link
Member

@tedsharpe @takutosato Lets do it then, but lets do it in a second PR since this one has a ton of comments already?

@lbergelson
Copy link
Member

@tedsharpe Your tool is a little more sophisticated in it's grouping though isn't it? It's going out and searching for long range mates I thought? This is just batching things as they come in assuming everything is in order.

@takutosato
Copy link
Contributor Author

@lbergelson making DuplicateSetWalker more abstract in a separate PR sounds good!

@tedsharpe
Copy link
Contributor

@lbergelson Not in this part. Just grouping a queryname sorted input by read name. (Need to gather the primary line together with all supplementary alignments.) I think we'd find other uses, too.
@takutosato Separate PR is fine with me.

@takutosato
Copy link
Contributor Author

@lbergelson I addressed your comments, cleaned up the code, and wrote tests. I removed UMI.java and UMIReadFilter.java. They will likely come back in a future PR.

@gatk-bot
Copy link

gatk-bot commented Apr 21, 2020

Travis reported job failures from build 30111
Failures in the following jobs:

Test Type JDK Job ID Logs
unit openjdk11 30111.14 logs
unit openjdk8 30111.3 logs

@droazen
Copy link
Contributor

droazen commented Apr 22, 2020

@takutosato Did you want to try to get this into this week's release?

@takutosato
Copy link
Contributor Author

@droazen that was my goal but next release is fine too

@fleharty
Copy link
Contributor

@takutosato Is there anything holding this up so that it can't be ready for the upcoming release? If it is possible, I would like to see this in the next release.

@takutosato
Copy link
Contributor Author

@fleharty nope just pending review.

@fleharty
Copy link
Contributor

@lbergelson Is there time to re-review this PR before the release, or have we run out of time?

@droazen
Copy link
Contributor

droazen commented Apr 22, 2020

If @lbergelson is able to do a final review on this branch tomorrow, I'd be happy to wait until it's in before releasing.

@takutosato
Copy link
Contributor Author

@droazen incidentally, are you OK with me adding finalizeTraversal() do ReadWalker? From inside apply() we don't know whether the read we're looking at is the last one, so we need to process whatever hasn't been processed before ending traversal.

@droazen
Copy link
Contributor

droazen commented Apr 22, 2020

@takutosato Why can't you override the existing method onTraversalSuccess()? It gets called right after the last apply() call.

@takutosato
Copy link
Contributor Author

Because the reference variable is package private. If I want to call apply() from onTraversalSuccess, it needs to be protected—will push a commit to shortly to show you what I mean.

@droazen
Copy link
Contributor

droazen commented Apr 22, 2020

@takutosato Just call the protected method directlyAccessEngineReferenceDataSource() from onTraversalSuccess() :)

@takutosato
Copy link
Contributor Author

@droazen it looked like it was going to work but then—

@Override
    final protected ReferenceDataSource directlyAccessEngineReferenceDataSource() {
        throw new GATKException("Should never directly access the engine ReferenceDataSource in walker tool classes " +
                "outside of the engine package. Walker tools should get their data via apply() instead.");
    }

Also, this is really not the responsibility of the tool class—it should be handled as part of the walker.

@droazen
Copy link
Contributor

droazen commented Apr 23, 2020

@takutosato Ah yes, I forgot about the final override of that method in WalkerBase. In that case, my suggestion would be to add the additional method in DuplicateSetWalker rather than ReadWalker, since the need for it is caused by the accumulation of reads specific to DuplicateSetWalker.

@tedsharpe
Copy link
Contributor

Stash the reference context you get when passing through the original apply method, and use it in onTraversalSuccess when calling the new apply method to push out the final set.

@droazen
Copy link
Contributor

droazen commented Apr 23, 2020

@tedsharpe The problem with that is that apply() gets a ReferenceContext, not the ReferenceDataSource needed here. I think that an additional finalization method in DuplicateSetWalker is probably the cleanest solution for now.

@takutosato
Copy link
Contributor Author

@tedsharpe @droazen thanks guys!

Just pushed a commit with suggested changes. I defined onTraversalSuccess in duplicateSetWalker to process the remaining set of reads. Then the tool subclass calls super.onTraversalSuccess() first thing inside its onTraversalSuccess.

Is that a good enough workaround for now?

@droazen
Copy link
Contributor

droazen commented Apr 23, 2020

@takutosato A separate method called automatically in DuplicateSetWalker would be more foolproof than requiring the tool author to remember to call super in onTraversalSuccess(), I think.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@takutosato Looks great! Much improved since the rough draft! Thank you.

I have a few comments about minor stuff but I think it's good to merge now and those can be addressed later.

@Argument(fullName = MIN_REQUIRED_READS_NAME, doc = "The mininum total number of reads required in the set", optional = true)
private int minimumRequiredReadsPerUMI = DEFAULT_MINIMUM_READS_PER_SET;

// The user may choose to keep read sets containing both strands (i.d. duplex evidence) only by setting this argument to a positive number
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wording. This would be clearer as "The user may choose to only keep read sets containing both strands by setting this argument to a positive number."

I misread this at first as the "read sets containing both strands will be kept ONLY if this is set to a positive number"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I went back and forth about this and I clearly chose wrong. done.

private int minimumRequiredReadsPerUMI = DEFAULT_MINIMUM_READS_PER_SET;

// The user may choose to keep read sets containing both strands (i.d. duplex evidence) only by setting this argument to a positive number
@Argument(fullName = MIN_REQUIRED_READS_PER_STRAND_NAME, doc = "The mininum total number of reads in each strand", optional = true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add a minValue = 0 here to keep it positive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

private static final int DEFAULT_MINIMUM_READS_PER_STRAND = 0;

@Argument(fullName = MIN_REQUIRED_READS_NAME, doc = "The mininum total number of reads required in the set", optional = true)
private int minimumRequiredReadsPerUMI = DEFAULT_MINIMUM_READS_PER_SET;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a minValue to the argument to make sure it's positive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


if (duplicateSetMoleculeNumber < readMoleculeNumber) {
// The incoming read's molecule ID does not match that of the current duplicate, meaning we've reached the end of the current set.
// Call the apply() method to process the current set and start a new set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment should go on the line below, it's not about the rejection stuff

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return;
}

// The incoming read has the same UMI as the current set; simply add to the current set
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to put the exclusive conditions in an else block to make it clear that none of the options above fall down into this, but that's a style point that some people disagree with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I didn't like the way the if else blocks looked so I did it this way, hope that's ok.


@Override
public void onTraversalStart() {
rng = new Random(RANDOM_SEED);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just initialize this where you declare rng and make the variable final.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return moleculeNumber + ReadsWithSameUMI.FGBIO_MI_TAG_DELIMITER + strand;
}

public static int getMoleculeNumberOfRead(final GATKRead read){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should probably have javadoc too even though they're pretty self explanatory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* @returns Counts of reads from each strand, the first element is always larger than the second
**/
public static Pair<Integer, Integer> countStrands(final List<GATKRead> reads){
final int strandACount = (int) reads.stream().filter(r -> getStrandOfRead(r).equals("A")).count();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's definitely a more efficient way of doing this that only loops through once but this is fine.

public void addRead(final GATKRead read){
Utils.validate(reads.isEmpty() || moleculeID.getMoleculeNumber() == MoleculeID.getMoleculeNumberOfRead(read),
String.format("Molecule number of the set and that of the new read don't match: set number = %d, read number = %d", moleculeID.getMoleculeNumber(), MoleculeID.getMoleculeNumberOfRead(read)));
Utils.validate(interval.getContig().equals(read.getContig()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a interval.contigsMatch(read) method that makes this slightly less verbose (and could be implemented slightly more efficiently than getContig.equals(getContig) in some cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

private void init(GATKRead read){
Utils.validate(reads.isEmpty(), String.format("Initializing a non-empty set"));
moleculeID = new MoleculeID(read);
interval = new SimpleInterval(read.getContig(), read.getStart(), read.getEnd());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just write this as new SimpleInterval(read)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beautiful

@takutosato
Copy link
Contributor Author

I think inside onTraversalSuccess might be the only place in which we can call such a method automatically—right?

@droazen
Copy link
Contributor

droazen commented Apr 23, 2020

@takutosato Why not in the traverse() method of DuplicateSetWalker?

@droazen
Copy link
Contributor

droazen commented Apr 23, 2020

(ie., override traverse() in DuplicateSetWalker, call super.traverse() (the ReadWalker version) first, then do your extra cleanup work)

@takutosato
Copy link
Contributor Author

@droazen that is GENIUS

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@droazen droazen merged commit 4ec2a4b into master Apr 23, 2020
@droazen droazen deleted the ts_downsample branch April 23, 2020 22:12
jonn-smith pushed a commit that referenced this pull request Jul 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants