forked from gshattow/Mutation_Sets
-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME
202 lines (161 loc) · 7.6 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
Last updated: 29 April 2016
The purpose of this code is to fetch, parse, and analyze data from the 1000
genomes project. Because the files are stored on the 1000genomes server, it
does not make sense to store them locally as well (unless you have limited
internet access). So this code fetches and unzips the files, extracts the
necessary information, and gets rid of the local copy. It then cross-matches
the extracted data (which person has which mutations), with the mutation's
sift score (how bad it is). Then it performs a few analyses, including
plotting. A more detailed description and a how-to is below.
_________________________
Code in the repository:
---
README (this document, right here)
individuals.bash
sifting.bash
pair_overlap_mutations.py
frequency_overlap_mutations.py
Extra data in repository:
---
populations/1kg_POP.txt
for POP = ALL, AFR, AMR, EAS, EUR, SAS, GBR
for definitions of each population, see:
http://www.1000genomes.org/category/frequently-asked-questions/population
each txt file has a list of individual names/codes that are in that
population, with ALL including all of them.
columns.txt with all the indivuals ids.
Important information:
1. Four directories needs to be created:
- data
- code
- analysis_1
- analysis_2
2. Inside 'data' directory, the '20130502' directory needs to be created:
- data/20130502
3. Inside 'analysis_1' and 'analysis_2' directories, 'output_no_sift' and 'plots_no_sift' directories needs to be created:
- analysis_1/output_no_sift analysis_1/plots_no_sift
- analysis_2/output_no_sift analysis_2/plots_no_sift
3. All the scripts explained bellow need to be stored in the code directory
4. columns.txt needs to be stored in data and code directories
_________________________
Step 1:
./download_individuals.bash -c <Number Chromosome>
Step 2:
./individuals.bash -c <Number Chromosome>
Step 3:
./download_sifting.bash -c <Number Chromosome>
Step 4
./sifting.bash -c <Number Chromosome>
Step 5:
python pair_overlap_mutations.py -c <Number Chromsome> -pop <POP: 0:ALL, 1:AFR, 2:AMR, 3:EAS, 4:EUR, 5:SAS, 6:GBR>
e.g. python pair_overlap_mutations.py -c 1 -pop 6
Step 6:
python frequency_overlap_mutations.py -c <Chromosome> -pop <POP: 0:ALL, 1:AFR, 2:AMR, 3:EAS, 4:EUR, 5:SAS, 6:GBR>
e.g. python frequency_overlap_mutations.py -c 1 -pop 6
_________________________
download_individuals.bash
---
To use:
./download_individuals.bash -c <Number Chromosome>
download_individuals.bash downloads comprossed files from 1000 geneome (Phase3)
by chromosome.
_________________________
individuals.bash
---
To use:
./individuals.bash -c <Number Chromosome>
---
individuals.bash decompresses, fetches and parses the files previously downloaded by
download_individuals.bash script. These files list all of the variants in that chromosome and which
individuals have each one. Each individual is tested twice. If both tests have
values greater than 0, then the individual is said to have that variant. The
indices of these variants are then copied into files for each individual. This
has to be done for each chromosome, which can individually take a few hours to
half a day (as run on an 8GB core). To run them all in succession, use
run_individuals.bash. It takes no arguments and runs chromosomes 1-22.
_________________________
downlod_sifting.bash
---
To use:
./download_sifiting.bash -c <Number Chromosome>
---
download_sifting.bash downloads all the SIFT scores of all the variants
as caluclated by the VEP in compressed files.
_________________________
sifting.bash
---
To use:
./sifting.bash <Number Chromosome>
---
sifting.bash finds the sift scores of all of the variants, as calculated by the
Variant Effect Predictor (more information: www.ensembl.org/info/docs/tools/vep/).
The VEP has been run on the 1000 genomes data, but these files do not have the
results linked to the individual patients. sifting.bash decompresses the download_sifting.bash
files, fetches the release data for each chromosome.
Not every variant has a sift score, so this script pulls out
the variants that do. It copies the variant ids (the index for this code set, rs
number, and within ENSEMBL), the sift score, and the phenotype.
_________________________
pair_overlap_mutations.py
---
To use:
python pair_overlap_mutations.py -c <Number Chromosome> -pop POP
---
The script measures overlap in mutations among pairs of individuals by
population. Considering two people, if both people have a mutation in a given
variant then they are considered to have an overlap. Individuals are paired
randomly without replacement, giving N/2 pairs, with N being the total number
of individuals. The script performs several analyses by using different configurations:
a) modifying the number of pair of individuals:
a.1 )all individuals
a.2) half of individuals
a.3) 100 random individuals)
b) modifying the number of variants:
b.1) only the variants with a score less than 0.05
b.2) all the variants.
This analysis is run 100 times to reduce any signal
from accidental pairings between family members.
These results are then plotted as a histogram of the number of pairs (y-axis)
and the pairs are also output as text files named.
As an output, this script generates:
a) Text files (stored in the 'analysis_1/output_no_sift' directory)
a.1)individual_half_pairs_overlap_chr<NumChromosome>_s<SIFT_Level>_<POP>.txt
a.2)total_individual_pairs_overlap_chr<NumChromosome>_s<SIFT_Level>_<POP>.txt
a.3)100_individual_overlap_chr<NumChromosome>_s<SIFT_Level>_<POP>.txt
a.4)random_mutations_individual<NumChromosome>_s<SIFT_Level>_<POP>_run_<NumRun>.txt
b) Plot figures (stored in the 'analysis_1/plots_no_sift' directory)
b.1)half_distribution_c<NumChromosome>_s<SIFT_Level>_<POP>.png
b.2)total_distribution_c<NumChromosome>_s<SIFT_Level>_<POP>.png
b.3)100_distribution_c<NumChromosome>_s<SIFT_Level>_<POP>.png
b.4)colormap_distribution_c<NumChromosome>_s<SIFT_Level>_<POP>.png
c) Another interesting files stored 'analysis_1/output_no_sift' directory
c.1) map_variations<NumChromosome>_s<SIFT_Level>_<POP>.txt
c.2) mutation_index_array<NumChromosome>_s<SIFT_Level>_<POP>.txt
Note: This script works with the SNPs variations (rs numbers),
and not with the gens ids (ENSEMBL), since we are interested to
measure the overlapping of mutations (variations) of the genes.
_________________________
frequency_overlap_mutations.py
---
To use:
python frequency_overlap_mutations.py -c <Number Chromosome> -pop POP
The script asks measures the frequency of overlappings in mutations
by selecting a number of random individuals and selecting all variants without
taking into account their SIFT scores. For example, if variant 1, variant 20,
variant 42, and variant 80 are the only variants which presents
3 overlappings among individuals, we could say that the frequency of 3 overlappings
among that group of individuals is 4 mutations.
This analysis has been repeated 100 times.
As an output, this script generates:
a) Text files (stored in the 'analysis_2/output_no_sift' directory)
a.1)Histogram_mutation_overlap_chr<NumChromosome>_s<SIFT_Level>_<POP>_<NumRun>.txt
b) Plot figures (stored in the 'analysis_2/plots_no_sift' directory)
b.1)Frequency_mutations<NumChromosome>_s<SIFT_Level>_<POP><NumRun>.png
c) Another interesting files stored in 'analysis_2/output_no_sift' directory
c.1)map_variations<NumChromosome>_s<SIFT_Level>_<POP>.txt
c.2)mutation_index_array<NumChromosome>_s<SIFT_Level>_<POP>.txt
c.3)Mutation_overlap_chr<NumChromosome>_s<SIFT_Level>_<POP>_<NumRun>.txt
Note: This script works with the SNPs variations (rs numbers),
and not with the gens ids (ENSEMBL), since we are interested to
measure the frequency in mutations (varaiations) of the genes.
---