-
Notifications
You must be signed in to change notification settings - Fork 15
/
Copy pathREADME
143 lines (108 loc) · 5.33 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
Summary:
Software to extract sequence from a fasta or fastq. Also filter
sequences by a minimum length or maximum length. Fast, written in C,
using kseq.h library.
Pullseq Summary:
pullseq - extract sequences from a fasta/fastq file. This program is
fast, and can be useful in a variety of situations. You can use it to
extract sequences from one fasta/fastq file into a new file, given
either a list of header ids to include or a regular expression
pattern to match. Results can be included (default) or excluded,
and they can additionally be filtered with minimum / maximum sequence
lengths.
Additionally, it can convert from fastq to fasta or visa-versa and
can change the length of the output sequence lines.
NOTE: pullseq prints to standard out, so you need to use redirection
(e.g. pullseq input.fasta -m 10 *>* output.fasta ) to create output files.
Synopsis:
pullseq -i <input fasta/fastq file> -n <header names to select>
pullseq -i <input fasta/fastq file> -m <minimum sequence length>
pullseq -i <input fasta/fastq file> -g <regex name to match>
pullseq -i <input fasta/fastq file> -m <minimum sequence length> -a <max sequence length>
pullseq -i <input fasta/fastq file> -t
cat <names to select from STDIN> | pullseq -i <input fasta/fastq file> -N
Options:
-i, --input, Input fasta/fastq file (required)
-n, --names, File of header id names to search for
-N, --names_stdin, Use STDIN for header id names
-g, --regex, Regular expression to match (PERL compatible; always case-insensitive)
-m, --min, Minimum sequence length
-a, --max, Maximum sequence length
-l, --length, Sequence characters per line (default 50)
-c, --convert, Convert input to fastq/fasta (e.g. if input is fastq, output will be fasta)
-q, --quality, ASCII code to use for fasta->fastq quality conversions
-e, --excluded, Exclude the header id names in the list (-n)
-t, --count, Just count the possible output, but don't write it
-h, --help, Display this help and exit
-v, --verbose, Print extra details during the run
--version, Output version information and exit
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Seqdiff Summary:
seqdiff - compare two fasta (or fastq) files to determine overlap of
sequences. This overlap can be at the sequence level (are two
sequences exactly the same in both files?) or at the header name
level (do two sequences contain the same header name between the two
files?).
Synopsis:
seqdiff -1 first_file.fa -2 second_file.fa
Usage:
seqdiff -1 <first input fasta/fastq file> -2 <second fasta/fastq file>
Options:
-1, --first, First sequence file (required)
-2, --second, Second sequence file (required)
-a, --a_output, File name for uniques from first file
-b, --b_output, File name for uniques from second file
-c, --c_output, File name for common entries
-d, --headers, Compare headers instead of sequences (default: false)
-s, --summary, Just show summary stats? (default: false)
-h, --help, Display this help and exit
-v, --verbose, Print extra details during the run
--version, Output version information and exit
REQUIREMENTS:
Pullseq/Seqdiff require a C compiler and has been tested to work with
either GCC or clang. They also require (and include) kseq.h (Heng
Li) and uthash.h (http://troydhanson.github.com/uthash/).
kseq.h also requires Zlib (so your linker should be able to handle
the '-lz' option). You can obtain zlib from http://www.zlib.net/
or commonly from your OS package manager (e.g. apt-get zlib or
emerge zlib).
NEW INSTALL:
Pullseq uses CMake, so you must have CMake installed on your system.
git clone: https://github.com/bcthomas/pullseq.git
cd pullseq
mkdir build
cd build
cmake ..
This will build binaries in build/src/
> build/src/pullseq
> build/src/seqdiff
OLD INSTALL:
To install, do the following in a shell on your system...
From Git:
git clone https://github.com/bcthomas/pullseq.git # checkout the code using git
cd pullseq
./bootstrap # get set up for config/build after cloning
./configure # configure the application based on your system
make # will build the application
make install # will install in /usr/local by default
From a Release file (tar or zip):
tar xvf pullseq_version.tar.gz
cd pullseq_version
./autoconf # make sure configuration is set
./configure # configure the application based on your system
make # will build the application
make install # will install in /usr/local by default
NOTE: If you have PCRE (perl-compatible regular expression library)
installed in a non-standard location (e.g. on a mac using brew), the
./configure script will fail. You'll need to update your CFLAGS and
LDFLAGS env settings to define where your PCRE library files were
installed.
For example, on a mac with pcre installed by brew, you can do this:
pcre-config --cflags
-I/usr/local/Cellar/pcre/8.39/include
Then you can just add this to a env CFLAGS variable and run the
configure command, like so...
export CFLAGS="-I/usr/local/Cellar/pcre/8.39/include"
./configure
If your pcre library is installed somewhere else, you just update
the CFLAGS env variable accordingly.