forked from BenLangmead/crossbow
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathMANUAL
1708 lines (1267 loc) · 74.3 KB
/
MANUAL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% Crossbow: Parallel short read genotyping in the cloud
% Ben Langmead and Michael C. Schatz
% http://bowtie-bio.sf.net/crossbow
# What is Crossbow?
[Crossbow] is a scalable, portable, and automatic Cloud Computing tool for
finding SNPs from short read data. Crossbow employs [Bowtie] and a modified
version of [SOAPsnp] to perform the short read alignment and SNP calling
respectively. Crossbow is designed to be easy to run (a) in "the cloud" (in
this case, Amazon's [Elastic MapReduce] service), (b) on any [Hadoop] cluster,
or (c) on any single computer, without [Hadoop]. Crossbow exploits the
availability of multiple computers and processors where possible.
[Crossbow]: http://bowtie-bio.sf.net/crossbow
[Bowtie]: http://bowtie-bio.sf.net
[SOAPsnp]: http://soap.genomics.org.cn/soapsnp.html
[Elastic MapReduce]: http://aws.amazon.com/elasticmapreduce "Amazon Elastic MapReduce"
# A word of caution
Renting resources from [Amazon Web Services] (AKA [AWS]), costs money,
regardless of whether your experiment ultimately succeeds or fails. In some
cases, Crossbow or its documentation may be partially to blame for a failed
experiment. While we are happy to accept bug reports, we do not accept
responsibility for financial damage caused by these errors. Crossbow is
provided "as is" with no warranty. See `LICENSE` file.
[Amazon Web Services]: http://aws.amazon.com
[Amazon EC2]: http://aws.amazon.com/ec2
[Amazon S3]: http://aws.amazon.com/s3
[Amazon EMR]: http://aws.amazon.com/elasticmapreduce
[Amazon SimpleDB]: http://aws.amazon.com/simpledb
[AWS]: http://aws.amazon.com
# Crossbow modes and prerequisites
Crossbow can be run in four different ways.
1. **Via the [Crossbow web interface]**
In this case, the [Crossbow] code and the user interface are installed on EC2
web servers. Also, the computers running the Crossbow computation are rented
from Amazon, and the user must have [EC2], [EMR], [S3] and [SimpleDB]
accounts and must pay the [going rate] for the resources used. The user does
not need any special software besides a web browser and, in most cases, an
[S3 tool].
[Crossbow web interface]: http://bowtie-bio.sf.net/crossbow/ui.html
2. **On Amazon [Elastic MapReduce] via the command-line**
In this case, the Crossbow code is hosted by Amazon and the computers running
the Crossbow computation are rented from Amazon. However, the user must
install and run (a) the Crossbow scripts, which require [Perl] 5.6 or later,
(b) Amazon's `elastic-mapreduce` script, which requires Ruby 1.8 or later,
and (c) an [S3 tool]. The user must have [EC2], [EMR], [S3] and [SimpleDB]
accounts and must pay the [going rate] for the resources used.
3. **On a [Hadoop] cluster via the command-line**
In this case, the Crossbow code is hosted on your [Hadoop] cluster, as are
supporting tools: [Bowtie], [SOAPsnp], and possibly `fastq-dump`.
Supporting tools must be installed on all cluster nodes, but the Crossbow
scripts need only be installed on the master. Crossbow was tested with
[Hadoop] versions 0.20 and 0.20.205, and might also be compatible with other
versions newer than 0.20. Crossbow scripts require [Perl] 5.6 or later.
4. **On any computer via the command-line**
In this case, the Crossbow code and all supporting tools ([Bowtie],
[SOAPsnp], and possibly `fastq-dump`) must be installed on the computer
running Crossbow. Crossbow scripts require [Perl] 5.6 or later. The user
specifies the maximum number of CPUs that Crossbow should use at a time.
This mode does *not* require [Java] or [Hadoop].
[Amazon EMR]: http://aws.amazon.com/elasticmapreduce
[Elastic MapReduce]: http://aws.amazon.com/elasticmapreduce
[EMR]: http://aws.amazon.com/elasticmapreduce
[S3]: http://aws.amazon.com/s3
[EC2]: http://aws.amazon.com/ec2
[going rate]: http://aws.amazon.com/ec2/#pricing
[Elastic MapReduce web interface]: https://console.aws.amazon.com/elasticmapreduce/home
[AWS Console]: https://console.aws.amazon.com
[AWS console]: https://console.aws.amazon.com
`elastic-mapreduce`: http://aws.amazon.com/developertools/2264?_encoding=UTF8&jiveRedirect=1
[Java]: http://java.sun.com/
[Hadoop]: http://hadoop.apache.org/
[R]: http://www.r-project.org/
[Bioconductor]: http://www.bioconductor.org/
[Perl]: http://www.perl.org/get.html
# Preparing to run on Amazon Elastic MapReduce
Before running Crossbow on [EMR], you must have an [AWS] account with the
appropriate features enabled. You may also need to [install Amazon's
`elastic-mapreduce` tool]. In addition, you may want to install an [S3 tool],
though most users can simply use [Amazon's web interface for S3], which requires
no installation.
If you plan to run Crossbow exclusively on a single computer or on a [Hadoop]
cluster, you can skip this section.
[Amazon's web interface for S3]: https://console.aws.amazon.com/s3/home
1. Create an AWS account by navigating to the [AWS page]. Click "Sign Up Now"
in the upper right-hand corner and follow the instructions. You will be asked
to accept the [AWS Customer Agreement].
2. Sign up for [EC2] and [S3]. Navigate to the [Amazon EC2] page, click on
"Sign Up For Amazon EC2" and follow the instructions. This step requires you
to enter credit card information. Once this is complete, your AWS account
will be permitted to use [EC2] and [S3], which are required.
3. Sign up for [EMR]. Navigate to the [Elastic MapReduce] page, click on "Sign
up for Elastic MapReduce" and follow the instructions. Once this is complete,
your AWS account will be permitted to use [EMR], which is required.
4. Sign up for [SimpleDB]. With [SimpleDB] enabled, you have the option of
using the [AWS Console]'s [Job Flow Debugging] feature. This is a convenient
way to monitor your job's progress and diagnose errors.
5. *Optional*: Request an increase to your instance limit. By default, Amazon
allows you to allocate EC2 clusters with up to 20 instances (virtual
computers). To be permitted to work with more instances, fill in the form on
the [Request to Increase] page. You may have to speak to an Amazon
representative and/or wait several business days before your request is
granted.
To see a list of AWS services you've already signed up for, see your [Account
Activity] page. If "Amazon Elastic Compute Cloud", "Amazon Simple Storage
Service", "Amazon Elastic MapReduce" and "Amazon SimpleDB" all appear there, you
are ready to proceed.
Be sure to make a note of the various numbers and names associated with your
accounts, especially your Access Key ID, Secret Access Key, and your EC2 key
pair name. You will have to refer to these and other account details in the
future.
[AWS Customer Agreement]: http://aws.amazon.com/agreement/
[Request to Increase]: http://aws.amazon.com/contact-us/ec2-request/
[Job Flow Debugging]: http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/DebuggingJobFlows.html
[SimpleDB]: http://aws.amazon.com/simpledb/
[Account Activity]: http://aws-portal.amazon.com/gp/aws/developer/account/index.html?ie=UTF8&action=activity-summary
## Installing Amazon's `elastic-mapreduce` tool
Read this section if you plan to run Crossbow on [Elastic MapReduce] via the
command-line tool. Skip this section if you are not using [EMR] or if you plan
to run exclusively via the [Crossbow web interface].
To install Amazon's `elastic-mapreduce` tool, follow the instructions in Amazon
Elastic MapReduce developer's guide for [How to Download and Install Ruby and
the Command Line Interface]. That document describes:
[How to Download and Install Ruby and the Command Line Interface]: http://aws.amazon.com/developertools/2264?_encoding=UTF8&jiveRedirect=1
1. Installing an appropriate version of [Ruby], if necessary.
2. Setting up an EC2 keypair, if necessary.
3. Setting up a credentials file, which is used by the `elastic-mapreduce` tool
for authentication.
For convenience, we suggest you name the credentials file `credentials.json`
and place it in the same directory with the `elastic-mapreduce` script.
Otherwise you will have to specify the credential file path with the
`--credentials` option each time you run `cb_emr`.
We strongly recommend using a version of the `elastic-mapreduce` Ruby script
released on or after December 8, 2011. This is when the script switched to
using Hadoop v0.20.205 by default, which is the preferred way of running Myrna.
[Ruby]: http://www.ruby-lang.org/
[Setting up an EC2 keypair]: http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/index.html?download_ruby.html
We also recommend that you add the directory containing the `elastic-mapreduce`
tool to your `PATH`. This allows Crossbow to locate it automatically.
Alternately, you can specify the path to the `elastic-mapreduce` tool via the
`--emr-script` option when running `cb_emr`.
[AWS]: http://aws.amazon.com/ "Amazon Web Services"
[AWS page]: http://aws.amazon.com/ "Amazon Web Services"
[AWS Getting Started Guide]: http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/
## S3 tools
Running on [EMR] requires exchanging files via the cloud-based [S3] filesystem.
[S3] is organized as a collection of [S3 buckets] in a global namespace. [S3
charges] are incurred when transferring data to and from [S3] (but transfers
between [EC2] and [S3] are free), and a per-GB-per-month charge applies when
data is stored in [S3] over time.
To transfer files to and from [S3], use an S3 tool. Amazon's [AWS Console] has
an [S3 tab] that provides a friendly web-based interface to [S3], and doesn't
require any software installation. [s3cmd] is a very good command-line tool
that requires [Python] 2.4 or later. [S3Fox Organizer] is another GUI tool that
works as a [Firefox] extension. Other tools include [Cyberduck] (for Mac OS
10.6 or later) and [Bucket Explorer] (for Mac, Windows or Linux, but commercial
software).
[S3]: http://aws.amazon.com/s3/
[S3 tab]: https://console.aws.amazon.com/s3/home
[s3cmd]: http://s3tools.org/s3cmd
[Python]: http://www.python.org/download/
[Firefox]: http://www.mozilla.com/firefox/
[S3 buckets]: http://docs.amazonwebservices.com/AmazonS3/latest/gsg/
[S3 bucket]: http://docs.amazonwebservices.com/AmazonS3/latest/gsg/
[S3 charges]: http://aws.amazon.com/s3/#pricing
[S3Fox Organizer]: http://www.s3fox.net/
[Cyberduck]: http://cyberduck.ch/
[Bucket Explorer]: http://www.bucketexplorer.com/
# Installing Crossbow
Crossbow consists of a set of [Perl] and shell scripts, plus supporting tools:
[Bowtie] and [SOAPsnp] . If you plan to run Crossbow via the [Crossbow web
interface] exclusively, there is nothing to install. Otherwise:
1. Download the desired version of Crossbow from the [sourceforge site]
2. [Extract the zip archive]
3. Set the `CROSSBOW_HOME` environment variable to point to the extracted
directory (containing `cb_emr`)
4. *If you plan to run on a local computer or [Hadoop] cluster*:
If using Linux or Mac OS 10.6 or later, you likely don't have to install
[Bowtie] or [SOAPsnp], as Crossbow comes with compatible versions of both
pre-installed. Test this by running:
$CROSSBOW_HOME/cb_local --test
If the install test passes, installation is complete.
If the install test indicates [Bowtie] is not installed, obtain or build a
`bowtie` binary v0.12.8 or higher and install it by setting the
`CROSSBOW_BOWTIE_HOME` environment variable to `bowtie`'s enclosing
directory. Alternately, add the enclosing directory to your `PATH` or
specify the full path to `bowtie` via the `--bowtie` option when running
Crossbow scripts.
If the install test indicates that [SOAPsnp] is not installed, build the
`soapsnp` binary using the sources and makefile in `CROSSBOW_HOME/soapsnp`.
You must have compiler tools such as GNU `make` and `g++` installed for this
to work. If you are using a Mac, you may need to install the [Apple
developer tools]. To build the `soapsnp` binary, run:
make -C $CROSSBOW_HOME/soapsnp
Now install `soapsnp` by setting the `CROSSBOW_SOAPSNP_HOME` environment
variable to `soapsnp`'s enclosing directory. Alternately, add the enclosing
directory to your `PATH` or specify the full path to `soapsnp` via the
`--soapsnp` option when running Crossbow scripts.
5. *If you plan to run on a [Hadoop] cluster*, you may need to manually copy
the `bowtie` and `soapsnp` executables, and possibly also the `fastq-dump`
executable, to the same path on each of your [Hadoop] cluster nodes. You
can avoid this step by installing `bowtie`, `soapsnp` and `fastq-dump` on a
filesystem shared by all [Hadoop] nodes (e.g. an [NFS share]). You can also
skip this step if [Hadoop] is installed in [pseudo distributed] mode,
meaning that the cluster really consists of one node whose CPUs are treated
as distinct slaves.
[NFS share]: http://en.wikipedia.org/wiki/Network_File_System_(protocol)
[pseudo distributed]: http://hadoop.apache.org/common/docs/current/quickstart.html#PseudoDistributed
## The SRA toolkit
The [Sequence Read Archive] (SRA) is a resource at the [National Center for
Biotechnology Information] (NCBI) for storing sequence data from modern
sequencing instruments. Sequence data underlying many studies, including very
large studies, can often be downloaded from this archive.
The SRA uses a special file format to store archived read data. These files end
in extensions `.sra`, and they can be specified as inputs to Crossbow's
preprocessing step in exactly the same way as [FASTQ] files.
However, if you plan to use `.sra` files as input to Crossbow in either
[Hadoop] mode or in single-computer mode, you must first install the [SRA
toolkit]'s `fastq-dump` tool appropriately. See the [SRA toolkit] page for
details about how to download and install.
When searching for the `fastq-dump` tool at runtime, Crossbow searches the
following places in order:
1. The path specified in the `--fastq-dump` option
2. The directory specified in the `$CROSSBOW_SRATOOLKIT_HOME` environment
variable.
3. In the system `PATH`
[Sequence Read Archive]: http://www.ncbi.nlm.nih.gov/books/NBK47533/
[National Center for Biotechnology Information]: http://www.ncbi.nlm.nih.gov/
[SRA toolkit]: http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software
# Running Crossbow
The commands for invoking Crossbow from the command line are:
`$CROSSBOW_HOME/cb_emr` (or just `cb_emr` if `$CROSSBOW_HOME` is in the `PATH`)
for running on [EMR]. See [Running Crossbow on EMR via the command line] for
details.
`$CROSSBOW_HOME/cb_hadoop` (or just `cb_hadoop` if `$CROSSBOW_HOME` is in the
`PATH`) for running on [Hadoop]. See [Running Crossbow on a Hadoop cluster via
the command line] for details.
`$CROSSBOW_HOME/cb_local` (or just `cb_local` if `$CROSSBOW_HOME` is in the
`PATH`) for running locally on a single computer. See [Running Crossbow on a
single computer via the command line] for details.
[Apple developer tools]: http://developer.apple.com/technologies/tools/
[NFS share]: http://en.wikipedia.org/wiki/Network_File_System_(protocol)
[pseudo distributed]: http://hadoop.apache.org/common/docs/current/quickstart.html#PseudoDistributed
[sourceforge site]: http://bowtie-bio.sf.net/crossbow
[Extract the zip archive]: http://en.wikipedia.org/wiki/ZIP_(file_format)
# Running Crossbow on EMR via the EMR web interface
## Prerequisites
1. Web browser
2. [EC2], [S3], [EMR], and [SimpleDB] accounts. To check which ones you've
already enabled, visit your [Account Activity] page.
3. A tool for browsing and exchanging files with [S3]
a. The [AWS Console]'s [S3 tab] is a good web-based tool that does not
require software installation
b. A good command line tool is [s3cmd]
c. A good GUI tool is [S3Fox Organizer], which is a Firefox Plugin
d. Others include [Cyberduck], [Bucket Explorer]
3. Basic knowledge regarding:
a. [What S3 is], [what an S3 bucket is], how to create one, how to upload a
file to an S3 bucket from your computer (see your S3 tool's documentation).
b. How much AWS resources [will cost you]
[Account Activity]: http://aws-portal.amazon.com/gp/aws/developer/account/index.html?ie=UTF8&action=activity-summary
[s3cmd]: http://s3tools.org/s3cmd
[S3Fox Organizer]: http://www.s3fox.net/
[Cyberduck]: http://cyberduck.ch/
[Bucket Explorer]: http://www.bucketexplorer.com/
[What S3 is]: http://aws.amazon.com/s3/
[what an S3 bucket is]: http://docs.amazonwebservices.com/AmazonS3/latest/gsg/
[will cost you]: http://aws.amazon.com/ec2/#pricing
## To run
1. *If the input reads have not yet been preprocessed by Crossbow* (i.e. input
is [FASTQ] or `.sra`), then first (a) prepare a [manifest file] with URLs
pointing to the read files, and (b) upload it to an [S3] bucket that you
own. See your [S3] tool's documentation for how to create a bucket and
upload a file to it. The URL for the [manifest file] will be the input URL
for your [EMR] job.
*If the input reads have already been preprocessed by Crossbow*, make a note
of of the [S3] URL where they're located. This will be the input URL for
your [EMR] job.
2. *If you are using a pre-built reference jar*, make a note of its [S3] URL.
This will be the reference URL for your [EMR] job. See the [Crossbow
website] for a list of pre-built reference jars and their URLs.
*If you are not using a pre-built reference jar*, you may need to [build the
reference jars] and/or upload them to an [S3] bucket you own. See your [S3
tool]'s documentation for how to create a bucket and upload to it. The URL
for the main reference jar will be the reference URL for your [EMR] job.
[Crossbow website]: http://bowtie-bio.sf.net/crossbow
`.sra`: http://www.ncbi.nlm.nih.gov/books/NBK47540/
3. In a web browser, go to the [Crossbow web interface].
4. Fill in the form according to your job's parameters. We recommend filling in
and validating the "AWS ID" and "AWS Secret Key" fields first. Also, when
entering S3 URLs (e.g. "Input URL" and "Output URL"), we recommend that users
validate the entered URLs by clicking the link below it. This avoids failed
jobs due to simple URL issues (e.g. non-existence of the "Input URL"). For
examples of how to fill in this form, see the [E. coli EMR] and [Mouse
chromosome 17 EMR] examples.
# Running Crossbow on EMR via the command line
## Prerequisites
1. [EC2], [S3], [EMR], and [SimpleDB] accounts. To check which ones you've
already enabled, visit your [Account Activity] page.
2. A tool for browsing and exchanging files with [S3]
a. The [AWS Console]'s [S3 tab] is a good web-based tool that does not
require software installation
b. A good command line tool is [s3cmd]
c. A good GUI tool is [S3Fox Organizer], which is a Firefox Plugin
d. Others include [Cyberduck], [Bucket Explorer]
3. Basic knowledge regarding:
a. [What S3 is], [what an S3 bucket is], how to create one, how to upload a
file to an S3 bucket from your computer (see your S3 tool's documentation).
b. How much AWS resources [will cost you]
[Account Activity]: http://aws-portal.amazon.com/gp/aws/developer/account/index.html?ie=UTF8&action=activity-summary
[s3cmd]: http://s3tools.org/s3cmd
[S3Fox Organizer]: http://www.s3fox.net/
[Cyberduck]: http://cyberduck.ch/
[Bucket Explorer]: http://www.bucketexplorer.com/
[What S3 is]: http://aws.amazon.com/s3/
[What an S3 bucket is]: http://docs.amazonwebservices.com/AmazonS3/latest/gsg/
[will cost you]: http://aws.amazon.com/ec2/#pricing
## To run
1. *If the input reads have not yet been preprocessed by Crossbow* (i.e. input
is [FASTQ] or `.sra`), then first (a) prepare a [manifest file] with URLs
pointing to the read files, and (b) upload it to an [S3] bucket that you
own. See your [S3] tool's documentation for how to create a bucket and
upload a file to it. The URL for the [manifest file] will be the input URL
for your [EMR] job.
*If the input reads have already been preprocessed by Crossbow*, make a note
of of the [S3] URL where they're located. This will be the input URL for
your [EMR] job.
2. *If you are using a pre-built reference jar*, make a note of its [S3] URL.
This will be the reference URL for your [EMR] job. See the [Crossbow
website] for a list of pre-built reference jars and their URLs.
*If you are not using a pre-built reference jar*, you may need to [build the
reference jars] and/or upload them to an [S3] bucket you own. See your [S3
tool]'s documentation for how to create a bucket and upload to it. The URL
for the main reference jar will be the reference URL for your [EMR] job.
[Crossbow website]: http://bowtie-bio.sf.net/crossbow
3. Run `$CROSSBOW_HOME/cb_emr` with the desired options. Options that are unique
to [EMR] jobs are described in the following section. Options that apply to
all running modes are described in the [General Crossbow options] section.
For examples of how to run `$CROSSBOW_HOME/cb_emr` see the [E. coli EMR] and
[Mouse chromosome 17 EMR] examples.
## EMR-specific options
--reference <URL>
[S3] URL where the reference jar is located. URLs for pre-built reference jars
for some commonly studied species (including human and mouse) are available from
the [Crossbow web site]. Note that a [Myrna] reference jar is not the same as a
[Crossbow] reference jar. If your desired genome and/or SNP annotations are not
available in pre-built form, you will have to make your own reference jar and
upload it to one of your own S3 buckets (see [Reference jars]). This option
must be specified.
[Myrna]: http://bowtie-bio.sf.net/myrna
[Crossbow web site]: http://bowtie-bio.sf.net/crossbow
--input <URL>
[S3] URL where the input is located. If `--preprocess` or
`--just-preprocess` are specified, `<URL>` sould point to a [manifest file].
Otherwise, `<URL>` should point to a directory containing preprocessed reads.
This option must be specified.
--output <URL>
[S3] URL where the output is to be deposited. If `--just-preprocess` is
specified, the output consists of the preprocessed reads. Otherwise, the output
consists of the SNP calls calculated by [SOAPsnp] for each chromosome in the
[Crossbow output format], organized as one file per chromosome. This option
must be specified.
--intermediate <URL>
[S3] URL where all intermediate results should be be deposited. This can be
useful if you later want to resume the computation from partway through the
pipeline (e.g. after alignment but before SNP calling). By default,
intermediate results are stored in [HDFS] and disappear once the cluster is
terminated.
--preprocess-output <URL>
[S3] URL where the preprocessed reads should be stored. This can be useful if
you later want to run Crossbow on the same input reads without having to re-run
the preprocessing step (i.e. leaving `--preprocess` unspecified).
--credentials <id>
Local path to the credentials file set up by the user when the
`elastic-mapreduce` script was installed (see [Installing Amazon's
`elastic-mapreduce` tool]). Default: use `elastic-mapreduce`'s default (i.e.
the `credentials.json` file in the same directory as the `elastic-mapreduce`
script). If `--credentials` is not specified and the default `credentials.json`
file doesn't exist, `elastic-mapreduce` will abort with an error message.
--emr-script <path>
Local path to the `elastic-mapreduce` script. By default, Crossbow looks first
in the `$CROSSBOW_EMR_HOME` directory, then in the `PATH`.
--name <string>
Specify the name by which the job will be identified in the [AWS Console].
--stay-alive
By default, [EMR] will terminate the cluster as soon as (a) one of the stages
fails, or (b) the job complete successfully. Specify this option to force [EMR]
to keep the cluster alive in either case.
--instances <int>
Specify the number of instances (i.e. virtual computers, also called nodes) to
be allocated to your cluster. If set to 1, the 1 instance will funcion as both
[Hadoop] master and slave node. If set greater than 1, one instance will
function as a [Hadoop] master and the rest will function as [Hadoop] slaves. In
general, the greater the value of `<int>`, the faster the Crossbow computation
will complete. Consider the desired speed as well as the [going rate] when
choosing a value for `<int>`. Default: 1.
--instance-type <type>
Specify the type of [EC2] instance to use for the computation. See Amazon's
[list of available instance types] and be sure to specify the "API name" of the
desired type (e.g. `m1.small` or `c1.xlarge`). **The default of `c1.xlarge` is
strongly recommended** because it has an appropriate mix of computing power and
memory for a large breadth of problems. Choosing an instance type with less
than 5GB of physical RAM can cause problems when the reference is as large (e.g.
a mammalian genome). Stick to the default unless you're pretty sure the
specified instance type can handle your problem size.
[list of available instance types]: http://aws.amazon.com/ec2/instance-types/
`<instance-type>`: http://aws.amazon.com/ec2/instance-types/
--emr-args "<args>"
Pass the specified extra arguments to the `elastic-mapreduce` script. See
documentation for the `elastic-mapreduce` script for details.
--logs <URL>
Causes [EMR] to copy the log files to `<URL>`. Default: [EMR] writes logs to
the `logs` subdirectory of the `--output` URL. See also `--no-logs`.
--no-logs
By default, Crossbow causes [EMR] to copy all cluster log files to the `log`
subdirectory of the `--output` URL (or another destination, if `--logs` is
specified). Specifying this option disables all copying of logs.
--no-emr-debug
Disables [Job Flow Debugging]. If this is *not* specified, you must have a
[SimpleDB] account for [Job Flow Debugging] to work. You will be subject to
additional [SimpleDB-related charges] if this option is enabled, but those fees
are typically small or zero (depending on your account's [SimpleDB tier]).
[Job Flow Debugging]: http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/DebuggingJobFlows.html
[SimpleDB]: http://aws.amazon.com/simpledb/
[SimpleDB-related charges]: http://aws.amazon.com/simpledb/#pricing
[SimpleDB tier]: http://aws.amazon.com/simpledb/#pricing
# Running Crossbow on a Hadoop cluster via the command line
## Prerequisites
1. Working installation of [Hadoop] v0.20.2 or v0.20.205. Other versions newer
than 0.20 might also work, but haven't been tested.
2. A `bowtie` v0.12.8 executable must exist at the same path on all cluster
nodes (including the master). That path must be specified via the
`--bowtie` option OR located in the directory specified
in the `CROSSBOW_BOWTIE_HOME` environment variable, OR in a subdirectory of
`$CROSSBOW_HOME/bin` OR in the `PATH` (Crossbow looks in that order).
`$CROSSBOW_HOME/bin` comes with pre-built Bowtie binaries for Linux and Mac
OS X 10.5 or later. An executable from that directory is used automatically
unless the platform is not Mac or Linux or unless overridden by
`--bowtie` or by defining `CROSSBOW_BOWTIE_HOME`.
3. A Crossbow-customized version of `soapsnp` v1.02 must be installed
at the same path on all cluster nodes (including the master). That
path must be specified via the `--soapsnp` option OR located in the
directory specified in the `CROSSBOW_SOAPSNP_HOME` environment
variable, OR in a subdirectory of `$CROSSBOW_HOME/bin` OR in the
`PATH` (Crossbow searches in that order). `$CROSSBOW_HOME/bin` comes
with pre-built SOAPsnp binaries for Linux and Mac OS X 10.6 or
later. An executable from that directory is used automatically
unless the platform is not Mac or Linux or unless overridden by
`--soapsnp` or by defining `CROSSBOW_SOAPSNP_HOME`.
4. If any of your inputs are in [Sequence Read Archive] format (i.e. end in
`.sra`), then the `fastq-dump` tool from the [SRA Toolkit] must be installed
at the same path on all cluster nodes. The path to the `fastq-dump` tool
must be specified via the (`--fastq-dump`) option OR
`fastq-dump` must be located in the directory specified in the
`CROSSBOW_FASTQ_DUMP_HOME` environment variable, OR `fastq-dump` must be
found in the `PATH` (Myrna searches in that order).
5. Sufficient memory must be available on all [Hadoop] slave nodes to
hold the Bowtie index for the desired organism in addition to any
other loads placed on those nodes by [Hadoop] or other programs.
For mammalian genomes such as the human genome, this typically means
that slave nodes must have at least 5-6 GB of RAM.
## To run
Run `$CROSSBOW_HOME/cb_hadoop` with the desired options. Options that are
unique to [Hadoop] jobs are described in the following subsection. Options that
apply to all running modes are described in the [General Crossbow options]
subsection. To see example invocations of `$CROSSBOW_HOME/cb_hadoop` see the
[E. coli Hadoop] and [Mouse chromosome 17 Hadoop] examples.
## Hadoop-specific options
--reference <URL>
[HDFS] URL where the reference jar is located. Pre-built reference jars for
some commonly studied species (including human and mouse) are available from the
[Crossbow web site]; these can be downloaded and installed in HDFS using `hadoop
dfs` commands. If your desired genome and/or SNP annotations are not available
in pre-built form, you will have to make your own reference jars, install them
in HDFS, and specify their HDFS path here. This option must be specified.
[Crossbow web site]: http://bowtie-bio.sf.net/crossbow
[HDFS]: http://hadoop.apache.org/common/docs/current/hdfs_design.html
--input <URL>
[HDFS] URL where the input is located. If `--preprocess` or
`--just-preprocess` are specified, `<URL>` sould point to a manifest file.
Otherwise, `<URL>` should point to a directory containing preprocessed reads.
This option must be specified.
--output <URL>
[HDFS] URL where the output is to be deposited. If `--just-preprocess` is
specified, the output consists of the preprocessed reads. Otherwise, the output
consists of the SNP calls calculated by SOAPsnp for each chromosome, organized
as one file per chromosome. This option must be specified.
--intermediate <URL>
[HDFS] URL where all intermediate results should be be deposited. Default:
`hdfs:///crossbow/intermediate/<PID>`.
--preprocess-output <URL>
[HDFS] URL where the preprocessed reads should be stored. This can be useful if
you later want to run Crossbow on the same input reads without having to re-run
the preprocessing step (i.e. leaving `--preprocess` unspecified).
--bowtie <path>
Local path to the [Bowtie] binary Crossbow should use. `bowtie` must be
installed in this same directory on all [Hadoop] worker nodes. By default,
Crossbow searches the `PATH` and in the directory pointed to by the
`CROSSBOW_HOME` environment variable.
--fastq-dump <path>
Path to the directory containing `fastq-dump`, which is part of the [SRA
Toolkit]. This overrides all other ways that Crossbow searches for
`fastq-dump`, including the `CROSSBOW_SRATOOLKIT_HOME` environment variable, the
subdirectories of the `$CROSSBOW_HOME/bin` directory, and the `PATH`.
--soapsnp <path>
Local path to the SOAPsnp executable to use when running the Call SNPs step.
`soapsnp` must be installed in this same directory on all [Hadoop] worker nodes
This overrides all other ways that Crossbow searches for `soapsnp`, including
the `CROSSBOW_SOAPSNP_HOME` environment variable, the subdirectories of the
`$CROSSBOW_HOME/bin` directory, and the `PATH`.
# Running Crossbow on a single computer via the command line
## Prerequisites
1. A `bowtie` v0.12.8 executable must exist on the local computer. The
path to `bowtie` must be specified via the `--bowtie` option OR be located
in the directory specified in the `$CROSSBOW_BOWTIE_HOME` environment
variable, OR in a subdirectory of `$CROSSBOW_HOME/bin` OR in the `PATH`
(search proceeds in that order). `$CROSSBOW_HOME/bin` comes with
pre-built Bowtie binaries for Linux and Mac OS X 10.6 or later, so most
Mac and Linux users do not need to install either tool.
2. A Crossbow-customized version of `soapsnp` v1.02 must exist. The path
to `soapsnp` must be specified via the `--soapsnp` option OR be in
the directory specified in the `$CROSSBOW_SOAPSNP_HOME` environment
variable, OR in a subdirectory of `$CROSSBOW_HOME/bin` OR in the `PATH` (Crossbow searches in that order).
`$CROSSBOW_HOME/bin` comes with pre-built SOAPsnp binaries for Linux and
Mac OS X 10.6 or later. An executable from that directory is used
automatically unless the platform is not Mac or Linux or unless
overridden by `--soapsnp` or `$CROSSBOW_SOAPSNP_HOME`.
3. If any of your inputs are in [Sequence Read Archive] format (i.e. end in
`.sra`), then the `fastq-dump` tool from the [SRA Toolkit] must be installed
on the local computer. The path to the `fastq-dump` tool must be specified
via the (`--fastq-dump`) option OR `fastq-dump` must be
located in the directory specified in the `MYRNA_FASTQ_DUMP_HOME` environment
variable, OR `fastq-dump` must be found in the `PATH` (Myrna searches in that
order).
4. Sufficient memory must be available on the local computer to hold one copy of
the Bowtie index for the desired organism *in addition* to all other running
workloads. For mammalian genomes such as the human genome, this typically
means that the local computer must have at least 5-6 GB of RAM.
## To run
Run `$CROSSBOW_HOME/cb_local` with the desired options. Options unique to local
jobs are described in the following subsection. Options that apply to all
running modes are described in the [General Crossbow options] subsection. To
see example invocations of `$CROSSBOW_HOME/cb_local` see the [E. coli local] and
[Mouse chromosome 17 local] examples.
## Local-run-specific options
--reference <path>
Local path where expanded reference jar is located. Specified path should have
a `index` subdirectory with a set of Bowtie index files, a `sequences`
subdirectory with a set of FASTA files, a `snps` subdirectory with 0 or more
per-chromosome SNP description files, and a `cmap.txt` file. Pre-built
reference jars for some commonly studied species (including human and mouse) are
available from the [Crossbow web site]; these can be downloaded and expanded
into a directory with the appropriate structure using an `unzip` utility. If
your desired genome and/or SNP annotations are not available in pre-built form,
you will have to make your own reference jars and specify the appropriate path.
This option must be specified.
[Crossbow web site]: http://bowtie-bio.sf.net/crossbow
[HDFS]: http://hadoop.apache.org/common/docs/current/hdfs_design.html
`unzip`: http://en.wikipedia.org/wiki/Unzip
--input <path>
Local path where the input is located. If `--preprocess` or
`--just-preprocess` are specified, this sould point to a [manifest file].
Otherwise, this should point to a directory containing preprocessed reads. This
option must be specified.
--output <path>
Local path where the output is to be deposited. If `--just-preprocess` is
specified, the output consists of the preprocessed reads. Otherwise, the output
consists of the SNP calls calculated by SOAPsnp for each chromosome, organized
as one file per chromosome. This option must be specified.
--intermediate <path>
Local path where all intermediate results should be kept temporarily (or
permanently, if `--keep-intermediates` or `--keep-all` are specified).
Default: `/tmp/crossbow/intermediate/<PID>`.
--preprocess-output <path>
Local path where the preprocessed reads should be stored. This can be useful if
you later want to run Crossbow on the same input reads without having to re-run
the preprocessing step (i.e. leaving `--preprocess` unspecified).
--keep-intermediates
Keep intermediate directories and files, i.e. the output from all stages prior
to the final stage. By default these files are deleted as soon as possible.
--keep-all
Keep all temporary files generated during the process of binning and sorting
data records and moving them from stage to stage, as well as all intermediate
results. By default these files are deleted as soon as possible.
--cpus <int>
The maximum number of processors to use at any given time during the job.
Crossbow will try to make maximal use of the processors allocated. Default: 1.
--max-sort-records <int>
Maximum number of records to be dispatched to the sort routine at one time when
sorting bins before each reduce step. For each child process, this number is
effectively divided by the number of CPUs used (`--cpus`). The default is
200000.
--max-sort-files <int>
Maximum number of files that can be opened at once by the sort routine when
sorting bins before each reduce step. For each child process, this number is
effectively divided by the number of CPUs used (`--cpus`). The default is 40.
--bowtie <path>
Path to the Bowtie executable to use when running the Align step. This
overrides all other ways that Crossbow searches for `bowtie`, including the
`CROSSBOW_BOWTIE_HOME` environment variable, the subdirectories of the
`$CROSSBOW_HOME/bin` directory, and the `PATH`.
--fastq-dump <path>
Path to the directory containing the programs in the [SRA toolkit], including
`fastq-dump`. This overrides all other ways that Crossbow searches for
`fastq-dump`, including the `CROSSBOW_SRATOOLKIT_HOME` environment variable, the
subdirectories of the `$CROSSBOW_HOME/bin` directory, and the `PATH`.
--soapsnp <path>
Path to the SOAPsnp executable to use when running the Call SNPs step. This
overrides all other ways that Crossbow searches for `soapsnp`, including the
`CROSSBOW_SOAPSNP_HOME` environment variable, the subdirectories of the
`$CROSSBOW_HOME/bin` directory, and the `PATH`.
# General Crossbow options
The following options can be specified regardless of what mode ([EMR],
[Hadoop] or local) Crossbow is run in.
--quality { phred33 | phred64 | solexa64 }
Treat all input reads as having the specified quality encoding. `phred33`
denotes the [Phred+33] or "Sanger" format whereby ASCII values 33-126 are used
to encode qualities on the [Phred scale]. `phred64` denotes the [Phred+64] or
"Illumina 1.3+" format whereby ASCII values 64-126 are used to encode qualities
on the [Phred scale]. `solexa64` denotes the [Solexa+64] or "Solexa/Illumina
1.0" format whereby ASCII values 59-126 are used to encode qualities on a
[log-odds scale] that includes values as low as -5. Default: `phred33`.
[Phred scale]: http://en.wikipedia.org/wiki/Phred_quality_score
[Phred+33]: http://en.wikipedia.org/wiki/FASTQ_format#Encoding
[Phred+64]: http://en.wikipedia.org/wiki/FASTQ_format#Encoding
[Solexa+64]: http://en.wikipedia.org/wiki/FASTQ_format#Encoding
[log-odds scale]: http://en.wikipedia.org/wiki/FASTQ_format#Variations
--preprocess
The input path or URL refers to a [manifest file] rather than a directory of
preprocessed reads. The first step in the Crossbow computation will be to
preprocess the reads listed in the [manifest file] and store the preprocessed
reads in the intermediate directory or in the `--preprocess-output` directory if
it's specified. Default: off.
--just-preprocess
The input path or URL refers to a [manifest file] rather than a directory of
preprocessed reads. Crossbow will preprocess the reads listed in the [manifest
file] and store the preprocessed reads in the `--output` directory and quit.
Default: off.
--just-align
Instead of running the Crossbow pipeline all the way through to the end, run the
pipeline up to and including the align stage and store the results in the
`--output` URL. To resume the run later, use `--resume-align`.
--resume-align
Resume the Crossbow pipeline from just after the alignment stage. The
`--input` URL must point to an `--output` URL from a previous run using
`--just-align`.
--bowtie-args "<args>"
Pass the specified arguments to [Bowtie] for the Align stage. Default: `-M
1`. See the [Bowtie manual] for details on what options are available.
`-M 1`: http://bowtie-bio.sf.net/manual.shtml#bowtie-options-M
[Bowtie manual]: http://bowtie-bio.sf.net/manual.shtml
--discard-reads <fraction>
Randomly discard a fraction of the input reads. E.g. specify `0.5` to discard
50%. This applies to all input reads regardless of type (paired vs. unpaired)
or length. This can be useful for debugging. Default: 0.0.
--discard-ref-bins <fraction>
Randomly discard a fraction of the reference bins prior to SNP calling. E.g.
specify `0.5` to discard 50% of the reference bins. This can be useful for
debugging. Default: 0.0.
--discard-all <fraction>
Equivalent to setting `--discard-reads` and `--discard-ref-bins` to
`<fraction>`. Default: 0.0.
--soapsnp-args "<args>"
Pass the specified arguments to [SOAPsnp] in the SNP calling stage. These
options are passed to SOAPsnp regardless of whether the reference sequence under
consideration is diploid or haploid. Default: `-2 -u -n -q`. See the [SOAPsnp
manual] for details on what options are available.
[SOAPsnp manual]: http://soap.genomics.org.cn/soapsnp.html
--soapsnp-hap-args "<args>"
Pass the specified arguments to [SOAPsnp] in the SNP calling stage. when the
reference sequence under consideration is haploid. Default: `-r 0.0001`. See
the [SOAPsnp manual] for details on what options are available.
--soapsnp-dip-args "<args>"
Pass the specified arguments to [SOAPsnp] in the SNP calling stage. when the
reference sequence under consideration is diploid. Default: `-r 0.00005 -e
0.0001`. See the [SOAPsnp manual] for details on what options are available.
--haploids <chromosome-list>
The specified comma-separated list of chromosome names are to be treated as
haploid by SOAPsnp. The rest are treated as diploid. Default: all chromosomes
are treated as diploid.
--all-haploids
If specified, all chromosomes are treated as haploid by SOAPsnp.
--partition-len <int>
The bin size to use when binning alignments into partitions prior to SNP
calling. If load imbalance occurrs in the SNP calling step (some tasks taking
far longer than others), try decreasing this. Default: 1,000,000.
></tr><tr><td id="cb-dry-run">
--dry-run
Just generate a script containing the commands needed to launch the job, but
don't run it. The script's location will be printed so that you may run it
later.
--test
Instead of running Crossbow, just search for the supporting tools ([Bowtie] and
[SOAPsnp]) and report whether and how they were found. If running in Cloud Mode,
this just tests whether the `elastic-mapreduce` script is locatable and
runnable. Use this option to debug your local Crossbow installation.
--tempdir `<path>`
Local directory where temporary files (e.g. dynamically generated scripts)
should be deposited. Default: `/tmp/Crossbow/invoke.scripts`.
# Crossbow examples
The following subsections guide you step-by-step through examples included with
the Crossbow package. Because reads (and sometimes reference jars) must be
obtained over the Internet, running these examples requires an active Internet
connection.
## E. coli (small)
Data for this example is taken from the study by [Parkhomchuk et al].
[Parkhomchuk et al]: http://www.pnas.org/content/early/2009/11/19/0906681106.abstract
EMR
Via web interface
Identify an [S3] bucket to hold the job's input and output. You may
need to create an [S3 bucket] for this purpose. See your [S3 tool]'s
documentation.
[S3 bucket]: http://docs.amazonwebservices.com/AmazonS3/latest/index.html?UsingBucket.html
Use an [S3 tool] to upload `$CROSSBOW_HOME/example/e_coli/small.manifest` to
the `example/e_coli` subdirectory in your bucket. You can do so with this
[s3cmd] command:
s3cmd put $CROSSBOW_HOME/example/e_coli/small.manifest s3://<YOUR-BUCKET>/example/e_coli/
Direct your web browser to the [Crossbow web interface] and fill in the form as
below (substituting for `<YOUR-BUCKET>`):
1. For **AWS ID**, enter your AWS Access Key ID
2. For **AWS Secret Key**, enter your AWS Secret Access Key
3. *Optional*: For **AWS Keypair name**, enter the name of
your AWS keypair. This is only necessary if you would like to be
able to [ssh] into the [EMR] cluster while it runs.
4. *Optional*: Check that the AWS ID and Secret Key entered are
valid by clicking the "Check credentials..." link
5. For **Job name**, enter `Crossbow-Ecoli`
6. Make sure that **Job type** is set to "Crossbow"
7. For **Input URL**, enter
`s3n://<YOUR-BUCKET>/example/e_coli/small.manifest`, substituting
for `<YOUR-BUCKET>`
8. *Optional*: Check that the Input URL exists by clicking the
"Check that input URL exists..." link
9. For **Output URL**, enter
`s3n://<YOUR-BUCKET>/example/e_coli/output_small`, substituting for
`<YOUR-BUCKET>`
10. *Optional*: Check that the Output URL does not exist by
clicking the "Check that output URL doesn't exist..." link
11. For **Input type**, select "Manifest file"
12. For **Genome/Annotation**, select "E. coli" from the drop-down
menu
13. For **Chromosome ploidy**, select "All are haploid"
14. Click Submit
This job typically takes about 30 minutes on 1 `c1.xlarge` [EC2] node. See
[Monitoring your EMR jobs] for information on how to track job progress. To
download the results, use an [S3 tool] to retrieve the contents of the
`s3n://<YOUR-BUCKET>/example/e_coli/output_small` directory.
[ssh]: http://en.wikipedia.org/wiki/Secure_Shell
Via command line