Refactor code for finding ORFs and improve variable naming

camilogarciabotero · Jul 8, 2024 · 6aeddf5 · 6aeddf5
1 parent db91d60
commit 6aeddf5
Showing 1 changed file with 11 additions and 5 deletions.
diff --git a/docs/src/features.md b/docs/src/features.md
@@ -103,9 +103,17 @@ As mentioned above the `lors` calculates the log odds ratio of the ORF sequence
 Now we can even analyse how is the distribution of the ORFs' scores as a function of their lengths compared to random sequences.
 
 ```julia
-lambda = fasta2bioseq("test/data/NC_001416.1.fasta")[1]
+using FASTX, CairoMakie
 
-lambaorfs = findorfs(lambda, finder=NaiveFinder, minlen=100, scheme=lors)
+lambdafile = "test/data/NC_001416.1.fasta"
+
+# read the lambda genome as a `BioSequence`
+open(FASTA.Reader, lambdafile) do reader
+    lambdaseq = FASTX.sequence(LongDNA{4}, collect(reader)[1])
+end
+
+# find the ORFs in the lambda genome
+lambaorfs = findorfs(lambdaseq, finder=NaiveFinder, minlen=100, scheme=lors)
 
 lambdascores = score.(lambaorfs)
 lambdalengths = length.(lambaorfs)
@@ -121,8 +129,6 @@ randlengths = length.(vseqs)
 randscores = lors.(vseqs)
 
 ## plot the scores as a function of the lengths
-using CairoMakie
-
 f = Figure()
 ax = Axis(f[1, 1], xlabel="Length", ylabel="Log-odds ratio (Bits)")
 
@@ -150,4 +156,4 @@ f
 
 ![](assets/lors-lambda.png)
 
-What this plot shows is that the ORFs in the lambda genome have a higher scores than random sequences of the same length. The score is a measure of how likely a sequence given the coding model is compared to the non-coding model. In other words, the higher the score the more likely the sequence is coding. So, the plot shows that the ORFs in the lambda genome are more likely to be coding regions than random sequences. It also shows that the longer the ORF the higher the score, which is expected since longer ORFs are more likely to be coding regions than shorter ones.
+What this plot shows is that the ORFs in the lambda genome have a higher scores than random sequences of the same length. The score is a measure of how likely a sequence given the coding model is compared to the non-coding model. In other words, the higher the score the more likely the sequence is coding. So, the plot shows that the ORFs in the lambda genome are more likely to be coding regions than random sequences. It also shows that the longer the ORF the higher the score, which is expected since longer ORFs are more likely to be coding regions than shorter ones.