-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Downsample a bam by duplicate sets instead of reads #6512
Conversation
@lbergelson Are you happy with this branch as-is or do you want further refactoring as @takutosato hinted at above? |
@droazen or @jamesemery, could one of you take a look at this PR? |
I'm reviewing this, sorry for the slow response. |
**/ | ||
public class DownsampleByDuplicateSet extends DuplicateSetWalker { | ||
@Argument(fullName = StandardArgumentDefinitions.OUTPUT_LONG_NAME, shortName = StandardArgumentDefinitions.OUTPUT_SHORT_NAME, doc = "") | ||
public File outputBam; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public File outputBam; | |
public GATKPathSpecifier outputBam; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're trying to eliminate the use of File
wherever possible, since its limiting, and rely the more general GATKPathSpecifier
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmnbroad should I avoid createTempFile() in test classes as well? It creates a File
object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@takutosato No - using createTempFile
and File
in tests is fine for now. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks!
public void onTraversalStart() { | ||
super.onTraversalStart(); | ||
rng = RandomGeneratorFactory.createRandomGenerator(new Random(RANDOM_SEED)); | ||
outputWriter = createSAMWriter(IOUtils.getPath(outputBam.getAbsolutePath()), false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use my GATKPathSpecifier
suggestion above, you can just call .toPath()
on the GATKPathSpecifier
instance here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-sorted = false means this will resort the output I believe. Is that what you want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.toPath -> done.
Hmmm. That's not what I want. But I don't think it sorts the output. Will investigate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably doesn't sort it if the sort order is listed as "unsorted"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@takutosato Checkpointing my review so you can start taking a look at what I'm thinking. I still haven't looked at the tool itself DownsampleByDuplicateSet
or the DuplicateSet
code. I'll try to get to those tomorrow.
* http://fulcrumgenomics.github.io/fgbio/tools/latest/GroupReadsByUmi.html | ||
*/ | ||
public abstract class DuplicateSetWalker extends ReadWalker { | ||
DuplicateSet currentDuplicateSet; // TODO: does it make sense to call clear() and recycle the same object? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't. I don't think the performance gain is worth the potential confusion if someone does something like try to accumulate duplicate sets during the traversal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be private, protected, or marked visiblefortesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Done
|
||
@Override | ||
public final void onStartup(){ | ||
// TODO: I had to make onStartUp of ReadWalker not final to do this. The right thing to do here is to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super happy about that, it makes it less clear if downstream tools are meant to override it.
// write a ReadWalkerBase parent class for both ReadWalker and DuplicateSet Walker, as Louis suggested. | ||
super.onStartup(); | ||
currentDuplicateSet = new DuplicateSet(); | ||
currentDuplicateSet.setMoleduleId(INITIAL_MOLECULAR_ID); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm being dense, but couldn't you initialize this in place when you define the variable and not need to do it here? That would avoid the need for overriding onStartup
. If you can't instantiate a DuplicateSet and set the ID at the same time, I would just change duplicate set. It would probably be better though if instead of exposing a magic number, DuplicateSet had a function like isEmpty()
that would answer the question of if a read has been added to it yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points. Done.
* read6: ... MI:Z:1/B ... | ||
* read7: ... MI:Z:1/B ... | ||
* | ||
* Thus it's sufficient to go through the reads in order and collect them in a list until |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to tell if the file is sorted in the correct way or not? It would be good to check that it is somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could do something silly like using a bloom filter to track already seen molecules ID's and warn if you see a surplus of collisions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would check if the bam have any sorting subkeys specified, it's a thing Nil's and Tim wanted to add to the spec so they might be outputting the. I don't think htsjdk knows about them yet but if it exists in the bams it would be a good reason to make htsjdk aware of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bam header has SO:unsorted
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a check to ensure that the molecular ID is monotonically increasing and that the tag exists. They should catch it if the input bam is not sorted as required—is that good enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that sounds great.
*/ | ||
public abstract class DuplicateSetWalker extends ReadWalker { | ||
DuplicateSet currentDuplicateSet; // TODO: does it make sense to call clear() and recycle the same object? | ||
public static int INITIAL_MOLECULAR_ID = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed variable
} | ||
|
||
// Strict match e.g. equalsExactly("AGT-GCT", "GCT-AGT") returns false | ||
public boolean equalsExactly(final UMI that) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be javadoc.
// Check for whether the two umis came from the same molecule | ||
// e.g. equalsModuloOrder("AGT-GCT", "GCT-AGT") returns true | ||
// equalsModuloOrder("AGT-GCT", "AGT-GCT") also returns true | ||
public boolean equalsModuloOrder(final UMI that) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also be javadoc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this post error correction?
return that.umiSmall.equals(this.umiSmall) && that.umiLarge.equals(this.umiLarge); | ||
} | ||
|
||
public String getStandardizedUMI(){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Javadoc
|
||
public String getStandardizedUMI(){ | ||
if (umi1.compareTo(umi2) > 0){ | ||
// umi1 is lexicographically greater e.g. umi1 = TAT, umi2 = AAC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is redundant since you sorted them already, just return umiSmall + umiLong, or better yet just save the combination.
import org.apache.commons.lang3.tuple.Pair; | ||
import org.broadinstitute.hellbender.engine.filters.UMIReadFilter; | ||
import org.broadinstitute.hellbender.utils.read.GATKRead; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class has lots of unused/untested stuff that probably should be used / tested. The umi logic should be implemented here and called from the filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@takutosato Some more comments. I have some thoughts about changing DuplicateSet
to disallow non-matching values instead of reporting false when adding them. It's possible I don't understand a necessary use case though.
You need unit tests for the rest of the classes here that aren't tested. UMI, UMIReadFilter, and DuplicateSet itself.
* | ||
* The input bam must have been sorted by UMI using FGBio GroupReadsByUmi (http://fulcrumgenomics.github.io/fgbio/tools/latest/GroupReadsByUmi.html). | ||
* | ||
* Use this tool to create, for instance, an insilico mixture of duplex-sequenced samples to simulate tumor subclone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo? tumor subclones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@Argument(fullName = StandardArgumentDefinitions.OUTPUT_LONG_NAME, shortName = StandardArgumentDefinitions.OUTPUT_SHORT_NAME, doc = "") | ||
public File outputBam; | ||
|
||
@Argument(fullName = "DS", doc = "This fraction of duplicate sets in the input bam will be retained") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably give this a more descriptive name. Maybe "fraction-to-keep"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should add bounds to this, i.e. minValue = 0.0 maxValue = 1.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done*2
|
||
private static final int RANDOM_SEED = 142; | ||
private RandomGenerator rng; | ||
private static int numFragments; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these fields should be static.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
public void onTraversalStart() { | ||
super.onTraversalStart(); | ||
rng = RandomGeneratorFactory.createRandomGenerator(new Random(RANDOM_SEED)); | ||
outputWriter = createSAMWriter(IOUtils.getPath(outputBam.getAbsolutePath()), false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-sorted = false means this will resort the output I believe. Is that what you want?
|
||
@Override | ||
public void onTraversalStart() { | ||
super.onTraversalStart(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unnecessary because the walker bases are obligated to NOT implement this. It's fine to do it anyway though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
.add("O", out.getAbsolutePath()); | ||
runCommandLine(args, DownsampleByDuplicateSet.class.getSimpleName()); | ||
|
||
final ReadsDataSource originalBam = new ReadsDataSource(Paths.get(NA12878_GROUPED)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
declare datasources in a try-with-resources
final Iterator<GATKRead> iterator = readsDataSource.iterator(); | ||
while (iterator.hasNext()){ | ||
final GATKRead read = iterator.next(); | ||
final String molecularID = read.getAttributeAsString("MI"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the constant you defined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
/** When down-sampling rate is 1.0, the input file is returned unchanged **/ | ||
@Test | ||
public void testNoDownsampling(){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test that shows that DS = 0.0 produces an empty file?
* This tool address the above problem by dropping a set fraction of _duplicate sets_, rather than reads, at random. | ||
* Implicit in this approach is that a read and its mate are dropped or retained together. | ||
* While trivial when the input bam is sorted by UMI and query name, this is far from trivial when one attempts | ||
* to downsample reads naively with a tool like {@link PrintReads}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the doc include an example usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example usage added.
.add("O", out.getAbsolutePath()); | ||
runCommandLine(args, DownsampleByDuplicateSet.class.getSimpleName()); | ||
|
||
final ReadsDataSource originalBam = new ReadsDataSource(Paths.get(NA12878_GROUPED)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try-with-resources
@takutosato Woops, I just realized you might have just wanted a rough look at this and not an in depth review... sorry if this was premature. |
@lbergelson no I appreciate the review. Having read the code, do you think it's worth writing a "ReadWalkerBase," which would be a parent class of ReadWalker and DuplicateSetWalker? |
I don't think it's super necessary as long as the weirdness about the double What might be worthwhile is changing the DuplicateSetWalker to be more abstract, so that it is a walker that groups reads by something instead of always molecular id. An obvious similar walker would be one that groups by readname in readname sorted bams, or by cell in single cell RNA bams. Since we don't have an obvious need for either of those though we could also wait until we have a second example of a walker that needs to do this. |
I'm working on a tool that processes aligned contigs from local assemblies that could use the group-by-queryname functionality. It would be nice not to have to buffer everything that comes through the apply method to process at the end of the traversal. Not hugely important, but there's your 2nd example. |
@tedsharpe @takutosato Lets do it then, but lets do it in a second PR since this one has a ton of comments already? |
@tedsharpe Your tool is a little more sophisticated in it's grouping though isn't it? It's going out and searching for long range mates I thought? This is just batching things as they come in assuming everything is in order. |
@lbergelson making DuplicateSetWalker more abstract in a separate PR sounds good! |
@lbergelson Not in this part. Just grouping a queryname sorted input by read name. (Need to gather the primary line together with all supplementary alignments.) I think we'd find other uses, too. |
@lbergelson I addressed your comments, cleaned up the code, and wrote tests. I removed UMI.java and UMIReadFilter.java. They will likely come back in a future PR. |
@takutosato Did you want to try to get this into this week's release? |
@droazen that was my goal but next release is fine too |
@takutosato Is there anything holding this up so that it can't be ready for the upcoming release? If it is possible, I would like to see this in the next release. |
@fleharty nope just pending review. |
@lbergelson Is there time to re-review this PR before the release, or have we run out of time? |
If @lbergelson is able to do a final review on this branch tomorrow, I'd be happy to wait until it's in before releasing. |
@droazen incidentally, are you OK with me adding |
@takutosato Why can't you override the existing method |
Because the |
@takutosato Just call the protected method |
@droazen it looked like it was going to work but then— @Override
final protected ReferenceDataSource directlyAccessEngineReferenceDataSource() {
throw new GATKException("Should never directly access the engine ReferenceDataSource in walker tool classes " +
"outside of the engine package. Walker tools should get their data via apply() instead.");
} Also, this is really not the responsibility of the tool class—it should be handled as part of the walker. |
@takutosato Ah yes, I forgot about the |
Stash the reference context you get when passing through the original apply method, and use it in onTraversalSuccess when calling the new apply method to push out the final set. |
@tedsharpe The problem with that is that |
@tedsharpe @droazen thanks guys! Just pushed a commit with suggested changes. I defined Is that a good enough workaround for now? |
@takutosato A separate method called automatically in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@takutosato Looks great! Much improved since the rough draft! Thank you.
I have a few comments about minor stuff but I think it's good to merge now and those can be addressed later.
@Argument(fullName = MIN_REQUIRED_READS_NAME, doc = "The mininum total number of reads required in the set", optional = true) | ||
private int minimumRequiredReadsPerUMI = DEFAULT_MINIMUM_READS_PER_SET; | ||
|
||
// The user may choose to keep read sets containing both strands (i.d. duplex evidence) only by setting this argument to a positive number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wording. This would be clearer as "The user may choose to only keep read sets containing both strands by setting this argument to a positive number."
I misread this at first as the "read sets containing both strands will be kept ONLY if this is set to a positive number"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I went back and forth about this and I clearly chose wrong. done.
private int minimumRequiredReadsPerUMI = DEFAULT_MINIMUM_READS_PER_SET; | ||
|
||
// The user may choose to keep read sets containing both strands (i.d. duplex evidence) only by setting this argument to a positive number | ||
@Argument(fullName = MIN_REQUIRED_READS_PER_STRAND_NAME, doc = "The mininum total number of reads in each strand", optional = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add a minValue = 0 here to keep it positive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
private static final int DEFAULT_MINIMUM_READS_PER_STRAND = 0; | ||
|
||
@Argument(fullName = MIN_REQUIRED_READS_NAME, doc = "The mininum total number of reads required in the set", optional = true) | ||
private int minimumRequiredReadsPerUMI = DEFAULT_MINIMUM_READS_PER_SET; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a minValue to the argument to make sure it's positive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
if (duplicateSetMoleculeNumber < readMoleculeNumber) { | ||
// The incoming read's molecule ID does not match that of the current duplicate, meaning we've reached the end of the current set. | ||
// Call the apply() method to process the current set and start a new set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment should go on the line below, it's not about the rejection stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
return; | ||
} | ||
|
||
// The incoming read has the same UMI as the current set; simply add to the current set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like to put the exclusive conditions in an else block to make it clear that none of the options above fall down into this, but that's a style point that some people disagree with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I didn't like the way the if else blocks looked so I did it this way, hope that's ok.
|
||
@Override | ||
public void onTraversalStart() { | ||
rng = new Random(RANDOM_SEED); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just initialize this where you declare rng and make the variable final.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
return moleculeNumber + ReadsWithSameUMI.FGBIO_MI_TAG_DELIMITER + strand; | ||
} | ||
|
||
public static int getMoleculeNumberOfRead(final GATKRead read){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should probably have javadoc too even though they're pretty self explanatory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* @returns Counts of reads from each strand, the first element is always larger than the second | ||
**/ | ||
public static Pair<Integer, Integer> countStrands(final List<GATKRead> reads){ | ||
final int strandACount = (int) reads.stream().filter(r -> getStrandOfRead(r).equals("A")).count(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's definitely a more efficient way of doing this that only loops through once but this is fine.
public void addRead(final GATKRead read){ | ||
Utils.validate(reads.isEmpty() || moleculeID.getMoleculeNumber() == MoleculeID.getMoleculeNumberOfRead(read), | ||
String.format("Molecule number of the set and that of the new read don't match: set number = %d, read number = %d", moleculeID.getMoleculeNumber(), MoleculeID.getMoleculeNumberOfRead(read))); | ||
Utils.validate(interval.getContig().equals(read.getContig()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a interval.contigsMatch(read)
method that makes this slightly less verbose (and could be implemented slightly more efficiently than getContig.equals(getContig) in some cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
private void init(GATKRead read){ | ||
Utils.validate(reads.isEmpty(), String.format("Initializing a non-empty set")); | ||
moleculeID = new MoleculeID(read); | ||
interval = new SimpleInterval(read.getContig(), read.getStart(), read.getEnd()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just write this as new SimpleInterval(read)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beautiful
I think inside |
@takutosato Why not in the |
(ie., override |
@droazen that is GENIUS |
5ce1412
to
44557b9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
… an input bam sorted by UMI. (#6512)
@lbergelson we've chatted about this about a month ago. I recall you wanted me to write a "ReadWalkerBase," which would be the parent class of ReadWalker and DuplicateSetWalker. I haven't done that just yet—in this code DuplicateSetWalker is a subclass of ReadWalker. I wanted to show you this code first before embarking on further refactoring.