Write Python code to generate a report on the statistical occurrences of various amino acids in a given string.
Your job is to complete the missing pieces and update the code wherever necessary to make all the unit tests work.
Completion of this task requires the use of Python 3
.
Note: Please do NOT modify any tests unless specifically told to do so.
Implement the clean_strands
method in app.dna_analyser.DNAAnalyser
class.
It should read the DNA strands from a given file (filename
argument) in which every line represents a single strand.
The method should return the list of valid strands only. A strand is valid when:
- its length is greater than 10,
- its length is less than 100,
- it contains only
A
,T
,G
andC
characters, - it's not an empty string,
- all characters are in uppercase.
When the number of clean strands is less than 3, the method should return an empty list.
Implement the create_strands
method in app.dna_analyser.DNAAnalyser
class.
This method receives a list of clean strands, computed by the clean_strands
method. Using this list, it should find overlapping elements and build one long strand which should be returned as a string.
Two strands overlap when last 3 characters of the first one match exactly 3 first characters of the second strand.
The following rules should be used for building the resulting string.
- All input strands must overlap.
- When two strands overlap they should be merged, e.g.
AAACCCAATTT
andTTTACACAGCT
should be merged toAAACCCAATTTACACAGCT
- the overlapping part should not be repeated. - One strand must overlap with another one only by the end of it - this is the starting strand.
- One strand must overlap with another one only by the begging of it - this is the ending strand.
- All other strands must overlap on both ends.
Please note, the starting and ending strands can be anywhere in the received list.
For the following input list
[AGTGGGGGGGGG, AAACCCAATTT, TTTACACAGCT, GCTGGGCCCAGT]
the method should return
AAACCCAATTTACACAGCTGGGCCCAGTGGGGGGGGG
where the strands would be matched and merged as shown below.
AAACCCAA[TTT] -> [TTT]ACACA[GCT] -> [GCT]GGGCCC[AGT] -> [AGT]GGGGGGGGG
The last part of this task is to actually generate the report. Please implement the get_amino_acids_report
method in the app.dna_analyser.DNAAnalyser
class.
This method should split the dna_sequence
argument into 3-characters long substrings. Each substring is a codon specifying an amino acid. Based on the data from the codon.tsv
file count amino acids occurrences in the given sequence.
The method should return a dictionary with amino acids as keys and their count as values.
For sequence AAATTTGGGAAA
and codon.tsv
file with following content:
AAA Lysine
GGG Glycine
TTT Phenylalanine
This method should return a dictionary like the one below.
{
'Lysine': 2 ,
'Phenylalanine': 1 ,
'Glycine': 1
}
You shouldn't modify the unit tests.
To execute the unit tests, use:
pip install -q -e . && python3 setup.py pytest