因为目的是看外群中保留下来了wgd的两份拷贝,而目标的物种对中分别保留了不同的旁系同源基因,所以可以先根据基因家族中目标物种的基因数目进行筛选,要求:1.外群至少有一个基因;2.近缘种有至少两个基因;3.目标物种对各至少有一个基因。其次为了方便后续分析,将Orthogoups.txt
文件中的其他物种基因序列删除,只保留感兴趣的基因列表即可。最终获得两个文件:
FilteredOrthogroups.txt
文件就简单粗暴的把别的物种的基因删掉就行了,为了画树之后方便定根,建议采用4个物种:1.外群;2.共享wgd的近缘种;3.两个目标物种。
-
FilteredOrthogroups.list
OG0000000 OG0000017 ... OG0001117
-
FilteredOrthogroups.txt
OG0000085: Bcy|TJE003394 Bcy|TJE007772 Bgy|evm.model.Bruguiera_gymnorrhiza_PacBio_hic_scaffold_16.428
💡 记得根据提取的是蛋白质还是cds序列修改脚本38行的后缀。 一定要注意蛋白质序列和cds序列命名是否一致,尤其是phytozome上面下载的文件常常基因名后面有个.p之类的,切记去除!!
perl 01Ortho2Fa.pl FilteredOrthogroups.list FilteredOrthogroups.txt [Seq file]
mkdir pepdir
mkdir cdsdir
mv *pep pepdir
mv *cds cdsdir
01Ortho2Fa.pl
### This script is used to extract the corresponding protein or cds sequences of genes in the genefamilies.
### Change the postfix of output files in LINE 38 according to the type of sequences.
use 5.010;
use autodie;
use Bio::SeqIO;
say "Usage: perl $0 [Target Orthogroup List] [Orthogroups.txt] [Protein File]" if @ARGV<3;
open TARGETORTHO,"<$ARGV[0]";
open ORTHO,"<$ARGV[1]";
my $pepfile=$ARGV[2];
my %pep;
my $fa=Bio::SeqIO->new(-file=>$pepfile,-format=>'fasta');
while (my $seq_obj=$fa->next_seq){
my $pep_name=$seq_obj->display_name;
my $seq=$seq_obj->seq;
my ($spe,$gene)=split/\|/,$pep_name;
$pep_name="$gene"."_"."$spe";
$pep{$pep_name}=$seq;
}
my @TargetOrtho;
while (<TARGETORTHO>) {
s/\s+$//;
my $orth=(split/\s+/,$_)[0];
push @TargetOrtho,$orth;
}
close TARGETORTHO;
while (<ORTHO>) {
s/\s+$//;
my @eles=split/\s+/;
my $ortho=shift @eles;
$ortho=~s/://;
if ($ortho~~@TargetOrtho) {
open OUT,">$ortho.cds";
map {
my ($spe,$gene)=split/\|/,$_;
my $pep_name="$gene"."_"."$spe";
print OUT ">$pep_name\n$pep{$pep_name}\n";
} @eles;
close OUT;
}
}
close ORTHO;
💡 如果mafft文件行数较多,可以通过split命令进行分割,然后并行。
split -l 1000 [mafft.sh]
perl 02CreateMafftSh.pl pepdir mafft.sh mafftoutdir
bash mafft.sh
02CreateMafftSh.pl
### Add the path to Mafft software in Line 16 if Mafft is not added to environment variable.
use warnings;
use strict;
my $pepdir=shift;
my @files=<$pepdir/*pep>;
my $output_sh=shift;
open (O,">$output_sh");
my $outdir=shift;
system("mkdir $outdir");
foreach my $file (@files){
$file=~/OG(\d+)/;
my $sign=$1;
my $outfile="$outdir/mafft$1";
print O "mafft --maxiterate 1000 --localpair $file >$outfile\n";
}
close O;
perl 03CreatePal2nalSh.pl mafftoutdir cdsdir pal2naloutdir pal2nal.sh
## 同样的,如果行数比较多,可以用split的方法分割,然后并行。
03CreatePal2nalSh.pl
use strict;
my $pepdir=shift;
my @mafftfiles=<$pepdir/*>;
my $cdsdir=shift;
my $outdir=shift;
system ("mkdir $outdir");
my $output_sh=shift;
open(O,">$output_sh");
my $pal2alnpath="/public1/users/lisen/software/pal2nal.v14/pal2nal.pl";
foreach my $pepfile (@mafftfiles){
$pepfile=~/mafft(\d+)/;
my $sign=$1;
my $cdsfile="OG$sign".".cds";
print O "perl $pal2alnpath $pepfile $cdsdir/$cdsfile -output fasta >$outdir/pal2nal_$sign\n";
}
close O;
cd pal2naloutdir
perl 04Fa2Phy.pl
mkdir phys fas
mv *phy phys
mv pal* fas
04Fa2Phy.pl
use Bio::SeqIO;
my @files=glob("pal*");
foreach my $file (@files) {
my $out="$file.phy";
my %seq;
my $seq_num=0;
my $seq_length;
my $fa=Bio::SeqIO->new(-file=>$file,-format=>'fasta');
while (my $obj=$fa->next_seq) {
my $seq_name=$obj->display_name;
my $seq=$obj->seq;
$seq{$seq_name}=$seq;
$seq_num++;
$seq_length=length $seq;
}
open OUT,">$out";
print OUT "$seq_num $seq_length\n";
foreach my $seq_name (keys %seq) {
print OUT "$seq_name $seq{$seq_name}\n";
}
close OUT;
}
建议用v1.1,v0.9有时候会报错。56服务器是可以用v1.1的,而41不可以。如果出现likelihood太低的报错,可以试一下GTR+G model。
05CreateRaxmlngSh.pl
use warnings;
use strict;
my @files=glob("*phy");
L: foreach my $file (@files) {
(my $prefix=$file)=~s/\.phy//;
my $err="$prefix.err";
print "raxml-ng --all -msa $file --model GTR+I+G --prefix $prefix --seed 2 --threads 1 --bs-metric fbp,tbe --bs-trees 100 2>>$err\n";
}
cd phys
perl 05CreateRaxmlngSh.pl >raxml.sh
nohup bash raxml.sh &
## 同样的,如果行数比较多,可以用split的方法分割,然后并行。
echo ../species.tree >batch.txt
ls *bestTree >>batch.txt
nohup java -jar -Xmx8g -XX:ParallelGCThreads=4 ~/software/Notung-2.9.1.5.jar -b batch_tree.txt --root --speciestag postfix --edgeweights name --treeoutput nhx --nolosses --outputdir ../root/
这里只关心每个基因家族中目标的三个物种的串联复制情况,可以从gff文件进行筛选,要求如果两个基因之间相隔的基因不超过5个就认为是串联复制。