This file is not yet complete
TABLE OF CONTENTS
This document describes installation and usage of the LASTZ sequence alignment program. LASTZ is a replacement for BLASTZ, and is command-line compatible with BLASTZ's options. [Backward compatible but adds more.]
LASTZ -- Tool for (1) Pairwise DNA sequence alignment and (2) alignment scores inference.
Platform: | This package was developed on a Macintosh OSX system, but should work on other Linux or Unix platforms with little change (if any). LASTZ was written in C and compiled with gcc. Some ancillary tools were written in Python, but only use modules available in typical python installations. |
Author: | Bob Harris, <rsharris at bx dot psu dot edu> |
Date: | January XX, 2008 |
Since this is the first release of LASTZ, we refer here to how LASTZ is different from BLASTZ
The short-by-two error has been corrected in LASTZ.
You can produce textual alignment output by including the command line option
--format=text
. Similarly, LASTZ can also create files in maf and
axt format.
[Other stuff-- generic seeds, scoring inference etc. etc.]
If you have received the distribution as a packed archive, unpack the archive
by whatever means are appropriate for your computer. The result should be a
directory <somepath>/lastz-distrib-X.XX.XX
that
contains a src
subdirectories (and some others). You may find it
convenient to remove the revision number (-X.XX.XX
) from the
directory name.
Before building or installing any of the programs, you will need to do one
of two things. Either create the shell variable
$LASTZ = <somepath>/lastz-distrib-X.XX.XX
and add <somepath>/lastz-distrib-X.XX.XX
to your
$PATH
, or edit
<somepath>/lastz-distrib-X.XX.XX/make-include.mak
and change the definition of installDir
to some directory already
in your path.
Then to build the LASTZ executable, from bash (or a similar command line
shell), do the commands below. This will build two executables (lastz and
lastz_D) and copy them into your installDir
.
cd <somepath>/lastz-distrib-X.XX.XX/src make make installThe two executables are the same program. lastz uses integer scores, while lastz_D uses floating-point scores.
A simple self test is included so you can test that the build succeeded. To run it, do this command:
make testIf the test is successful, you will see no output from this command. Otherwise, you will see the differences between the expected output and the output of your build, plus a line that looks like this:
make: *** [test] Error 1
[Discuss a few simple command lines here.] [--nogfextend --nogapped produces only the seed hits]
[move this to the end of the document.] When a sequence is being aligned to itself, the full alignment result will contain mirror-image copies of each alignment block. It is computationally wasteful to process both copies. LASTZ can address this problem in three different ways.
The first way is to simply give LASTZ the same sequence as target and query. In this case, LASTZ does not know that it is aligning a sequence to itself, and performs computation on both copies. A typical result would look like this:
The second way is to replace the query by the --self
option.
LASTZ will save computation by only computing on one block of each mirror-image
pair. It still reports both copies in the output, as shown below. Note that
it leaves out the trivial self-alignment block along the diagonal.
The third way is to replace the query by --self
and
add the --nomirror
option. In this case LASTZ only reports one
copy of each each mirror-image pair, as shown here:
If you are familiar with BLASTZ, you can run LASTZ the same as you ran BLASTZ, with the same options and input files. In addition to BLASTZ compatibility, LASTZ provides other options.
The general format of the lastz command line is
lastz target_specifier [query_specifier] [options]The square bracket symbols (
[
and ]
) indicate
optional command-line elements. Elements can appear in any order, the only
constraint being that, if present, the query_specifier
must appear
after the target_specifier
.
The target_specifier
and query_specifier
are usually
just the names of files containing the two sequences to be aligned,
either in FASTA or nib format. They can also specify subsequences; details
are given below.
The general format for options is --<name>[=<value>]
.
For BLASTZ compatibility some options can be set with
<letter>=<number>.
[How to prevent line break inside the double hyphen?]
Running the command lastz
without specifiers or options gives a list
of the most commonly used options. Running
lastz --allgives a list of all the options.
A sequence specifier normally just indicates the file to be used in the alignment. However, a file's subrange and strand can be specified, as well as a masking file, a sequence-selector and a nickname for the sequence. the alignment. The format is
[nickname::]file_name[/select_name][{mask_file}][[start,end]][-]
<start>
and <end>
are positions indicating a subrange. When present, the subrange must
be surrounded by square brackets. Subrange indices begin with
1 and are inclusive. For example, [201,300]
is a 100 bp subrange
that skips the first 200 bp in the file. A minus sign suffix indicates that
the reverse complement of the subrange should be used, or of the whole file if
no subrange is included. Additionally, the reverse complement is used if the
subrange has <start>
larger than <end>
.
<select_name>
is only valid for the 2BIT file format, and
specifies the single sequence from that file to use, rather than all sequences.
<nickname>
is a name to use for this sequence in any output
files.
{mask_file} specifies a file to use to mask bases. [Need to discuss masking file format in formats section.]
The --self
option can be used, in place of the query specifier,
to perform self-alignment of the target sequence. Using --self
,
rather than specifying the same file as target and query, lets LASTZ know not
to report duplicate copies of the same alignment blocks.
LASTZ gives the “user” many seeding choices. The intent is that these will be selected by some program (those the quote marks around “user”), but they are available from the command line for any user.
N-mer match: A space-free seed can be specified by the length of the N-mer match required.
--seed=match(<length>)
General seed patterns: Any spaced seed pattern can be specified. The
pattern is a string of 1s, 0s and Ts, where a 1
indicates that a
match is required in that position, a 0
indicates a mismatch is
allowed, and a T
indicates a transition mismatch is allowed.
--seed=<pattern>
The default seed is
--seed=1110100110010101111
which is the 12 of 19
seed used as the default in BLASTZ.
Half-weight seed patterns: If a seed pattern consists of only 0s and Ts, it is implemented internally as a half-weight seed. These seeds can be used in conjuction with filtering on the number of matches or transitions in a seed hit. For example,
--seed=TTT0T00TT00T0T0TTTT --filter=2:15
implements
the default seed pattern, requiring the twelve T positions to be matches or
transitions, requiring ten matches total (among the 19 positions), and allowing
not more than two transversions.
Additionally, --seed=half(<length>)
can be
used to specify a space-free half-weight seed.
Half-weight seeds use much less memory internally. A half-weight seed uses the same amount of memory as a normal seed pattern half as long.
Single, double or no transitions: By default, any single match position
(a 1 in a spaced seed, or any position in an N-mer match) is allowed to be a
transition. --notransition
disables this. Further,
--transition=2
allows any two match positions
to be transitions.
Twin hit seeds: The sensitivity of the seed can be decreased by ignoring seed hits that don't have a second hit nearby.
--twins=[<min>:]<maxgap>
requires
two seed hits on the same diagonal. The distance between the hits (the number
of bases between the end of the first hit and the beginning of the second) must
be between <min>
and <maxgap>
. If
<min>
is absent, zero is used (which means the twin seed
hits may be adjacent but not overlap). Negative values can be used.
For example --twins=-5:10
means the twins can
overlap by as much as 5 bases or can have as much as 10 bases between them.
[Discuss CL options for seeds here.] [Add subsections for other option groups.]
[This is probably not up-to-date.]
--self |
The target sequence is also the query. |
--quantum |
The query sequence contains quantum DNA. |
--seed=<pattern> |
Use a pattern for seed/hit discovery. |
--seed=match(<length>) |
Use a word with no gaps instead of a seed pattern. |
--seed=half(<length>) |
Use space-free half-weight word instead of seed pattern. |
--[no]trans[ition][=2] |
Allow one or two transitions in a seed hit.
(by default a transition is allowed) |
--word=<bits> |
Set max bits for word hash; use this to trade time for
memory, eliminating thrashing for heavy seeds.
(default is 28 bits) |
--[no]filter=[<T>:]<M> |
Filter half-weight seed hits, requiring at least M
matches and allowing no more than T transversions.
(default is no filtering) |
--notwins |
Require just one seed hit. |
--twins=[<min>:]<maxgap> |
Require two nearby seed hits on the same diagonal.
(default is twins aren't required) |
--recoverhits |
Recover hash-collision seed hits.
(default is not to recover seed hits) |
--step=<length> |
Set step length. (default is 1) |
--both[strands] |
Search both strands. |
--plus[strand] |
Search + strand only (strand matching the query specifier). |
--minus[strand] |
Search - strand only (opposite strand of query specifier).
(by default both strands are searched) |
--[no]gfextend |
Perform gap-free extension of seed hits to HSPs.
(by default extension is performed) |
--[no]chain |
Perform chaining. |
--chain=<diag,anti> |
Perform chaining with given penalties for diagonal and
anti-diagonal.
(by default no chaining is performed) |
--[no]gapped |
Perform gapped alignment (instead of gap-free).
(by default gapped alignment is performed) |
--score[s]=<file> |
Read substitution scores from a file.
(default is HOXD70) |
--unitscore[s] |
Scores are +1/-1 for match/mismatch. |
--gap=<[open,]extend> |
Set gap open and extend penalties. (default is 400,30) |
--xdrop=<score> |
Set x-drop threshold. (default is 10*sub[A][A]) |
--ydrop=<score> |
Set y-drop threshold. (default is open+300extend) |
--infer[=<control>] |
Infer scores from the sequences, then use them to align the sequences. Parameters controlling the inference process are read from the control file (see the format in the file formats section) |
--inferonly[=<control>] |
Infer scores, but don't use them (requires --infscores). |
--infscores[=<file>] |
Write inferred scores to a file (or to stdout). |
--hspthresh=<score> |
Set threshold for high scoring pairs;
ungapped extensions scoring lower are discarded.
(default is 3000) |
--inner=<score> |
Set threshold for HSPs during interpolation.
(default is no interpolation) |
--gappedthresh=<score> |
Set threshold for gapped alignments;
gapped extensions scoring lower are discarded.
(default is to use same value as --hspthresh) |
--ball=<score> |
Set minimum score required of words ‘in’ a quantum ball. |
--[no]entropy |
Involve entropy in filtering high scoring pairs.
(default is “entropy”) |
--[no]mirror |
Report/use mirror image of all gap-free alignments.
(default is “mirror” for self-alignments only) |
--traceback=<bytes> |
Space for trace-back information.
(default is 80.0M) |
--masking=<count> |
Mask any position in target hit this many times.
Zero indicates no masking.
(default is no masking) |
--[no]census |
Count/report how many times each target base aligns.
(default is to not report census) |
--identity=[<min>..]<max> |
Filter alignments by percent identity, 0 ≤ min ≤ max ≤ 100;
blocks (or HSPs) outside the range are discarded.
[how to keep double hyphen from wrapping?]
(default is no identity filtering) |
--code=<file> |
Give quantum code for query sequence (only for display). |
--format=<type> |
Specify output format; one of lav, axt, maf, text,
lav+text, gfa, identity, infstats([<min>..]<max>).
<min> and <max> are identity filtering range.
(by default output format is LAV) |
--verbosity=<level> |
Set info level (0 is minimum, 10 is everything).
(default is 0) |
--[no]runtime |
Report runtime in the output file.
(default is to not report runtime) |
--tableonly[=count] |
Just produce the target position table, don't search for seeds. |
--[no]stats[=<file>] |
Show search statistics (or don't)
(only available in lastz_stats build) |
--help |
List all options. |
--short[cuts] |
List blastz-compatible shortcuts. |
[need more here.] Log-odds scores can be inferred from the sequences given. The resulting scoring set can be saved to a file or immediately used to align the sequences.
To infer scores, use the --infer
or --inferonly
option. The latter will stop after inferring scores. The
--infscores[=<file>]
option causes the inferred scoring
set to be written to a file. Otherwise, the scoring set is written to the
header of the alignment output file (as a comment). As a last resort, if
alignment is not performed the scoring set is written to the console. The
file format is the same used to input scoring sets.
Inference is achieved by computing the probability of each of the 18 different alignment events (gap open, gap extend and 16 substitution events). The probabilities are estimated from alignments of the sequences. Of course, we don't have alignments, so we start by using a generic scoring set to generate alignments, infer scores, then realign, and so on, until scores “converge”. We perform ungapped alignments until the substitution scores converge, then perform gapped alignments (holding substitution scores constant) until gap scores converge.
Control options for the inference process can be
provided in a control file specified as part of
the --infer
or --inferonly
options.
Usually it is undesirable to use all alignment blocks during inference. Blocks with a high rate of subsitutions (low identity) are likely to be false positives. On the other hand, blocks with few subsitutions (high identity) will be found regardless of what scores we use. Thus it is desirable to base inference only on statistics from a mid range of identity. By default, we use the middle 50% (that is, the 25th through 75th percentile of identity found in the alignment), but this can be changed in the control file.
[check this.] LASTZ usually receives two sequences and a scores file as inputs, and produces an alignment file as output.
DNA sequences can be provided in FASTA, NIB or 2BIT format. These sequences can contain a series of A, C, G, T and N in upper or lower case. Lower case indicates repeat-masked bases, while N indicates unknown sequence (and is also often used to separator). Additionally, a quantum DNA sequence can be provided as the query.
FASTA and 2BIT formats support more than one sequence within the same file. Files containing multiple sequences can only be used as the query file, not the target. However, an exception is made for 2BIT targets (see the select_name field above).
FASTA format stores DNA sequences as plain text. The first line should begin with a “>” followed by the name of the sequence. Remaining lines should contain DNA. They can be of any length.
If the file contains multiple sequences, each should start with the “>” header line.
NIB format stores a single unnamed DNA sequence, packed as two bases per byte. As of January, 2008, a spec for NIB files can be found at http://genome.ucsc.edu/FAQ/FAQformat#format8.
2BIT format stores multiple DNA sequences encoded as four bases per byte with with some additional information describing runs of masked bases or Ns. As of Jan/2008, a spec for 2BIT files can be found at http://genome.ucsc.edu/FAQ/FAQformat#format7.
[Discuss Quantum DNA and quantum code files.]
[Discuss Scores Files.] [blastz score matrix files] [quantum] Here's an example:
# This matches the default scoring set for blastz bad_score = X:-1000 # used for sub['X'][*] and sub[*]['X'] fill_score = -100 # used when sub[*][*] not defined gap_open_penalty = 30 gap_extend_penalty = 400 A C G T A 91 -114 -31 -123 C -114 100 -125 -31 G -31 -125 100 -114 T -123 -31 -114 91The score set consists of a substitution matrix and other settings. The other settings come first. Any line may contain a comment (# is the comment character).
<row>:<col>:<score>
.
Both <row>
and <col>
are optional.
A C G T 91 -114 -31 -123 -114 100 -125 -31 -31 -125 100 -114 -123 -31 -114 91
[Need more information here.]
Here's an example:
# base inference on alignments in the middle 50 percentile # by percent-identity min_identity = 25.0% # 25th percentile max_identity = 75.0% # 75th percentile # scale scores so max substitution will be 100 and only use # alignments scoring as well as 20 substitutions inference_scale = 100 # score for max substitution hsp_threshold = 20*inference_scale gapped_threshold = hsp_threshold # allow substitution score inference to iterate at most # 20 times; don't perform gap score inference-- instead # hardwire gap scores relative to max substitution max_sub_iterations = 20 max_gap_iterations = 0 gap_open_penalty = 4*inference_scale gap_extend_penalty = 0.3*inference_scale
min_identity
and max_identity
specify the range of
sequence identity upon which inference is based. Only alignment blocks within
this range contribute to inference. If the value ends with a percent sign, the
range is a percentile of the values found in the overall alignment. Otherwise
it is a fixed percentage. For example, min_identity=70
and
max_identity=90
indicates that blocks with identity ranging from
70 to 90 percent will be used, while min_identity=25%
and
max_identity=75%
indicates that 50 percent of the blocks will
be used (the middle 50 percent).
inference_scale
specifies a value for the largest substitution
score (i.e. the score for the best match). All other scores are scaled
accordingly. If this is set to none
, scores are log-odds using
base 2 logs.
hsp_threshold
and gapped_threshold
correspond to
the command line --hspthresh
and --gappedthresh
options (also known as K
and L
in BLASTZ lingo).
max_sub_iterations
and max_gap_iterations
specify
limits on the number of iterations that will be performed. For example, if
you only want a substitution scoring matrix, you can set
max_gap_iterations=0
.
gap_open_penalty
and gap_extend_penalty
correspond to
the command line --gap=<[open,]extend>
option (also known as
O
and E
in BLASTZ lingo). These are the values used
for the first iteration of gap-scoring inference.
step
(which is not shown in the example above) corresponds to the
command line --step
option (also known as Z
in BLASTZ
lingo). A large step, e.g. step=100
, could potentially speed up
the inference process. Ideally, this would base inference on a subsample of
only one percent of the whole. However, the subsample actually ends up larger
than that and is biased toward HSPs that are either longer or have a lower
substitution rate. This happens because subsampling occurs at the seed-hit
level, and such HSPs generally have more seed hits. Future versions of LASTZ
may include a means to compensate for this bias.
entropy
(which is not shown in the example above) corresponds to
the command line --entropy
option (also known as P
in
BLASTZ lingo). Legal values are “on” or “off”. If on,
sequence entropy is incorporated in the filtering of high scoring pairs.
[Discuss LAV.] Here's a typical lav file:
#:lav d { "lastz.v0.3 malus.fa aurantium.fa C=2 W=8 T=0 A C G T 91 -114 -31 -123 -114 100 -125 -31 -31 -125 100 -114 -123 -31 -114 91 O = 400, E = 30, K = 3000, L = 3000, M = 0" } #:lav s { "malus.fa" 1 191411218 0 1 "aurantium.fa" 1 90634903 0 1 } h { "> apple" "> orange" } a { s 20643 b 46566766 2083211 e 46567353 2083795 l 46566766 2083211 46566796 2083241 61 l 46566797 2083245 46566814 2083262 78 l 46566821 2083263 46567353 2083795 65 } a { s 4233 b 47246530 10635696 e 47246660 10635826 l 47246530 10635696 47246660 10635826 63 } ... many more a-stanzas ... #:lav s { "malus.fa" 1 191411218 0 1 "aurantium.fa-" 1 90634903 1 1 } h { "> apple" "> orange (reverse complement)" } a { s 13897 b 1005819 5352698 e 1006099 5352978 l 1005819 5352698 1006099 5352978 74 } ... many more a-stanzas ... #:eofA lav file primarily consists of a series of “stanzas”, each being a single letter type code followed by a brace-delimited block. Additionally there are lines containing
#:lav
which break the file into
sections, and #:eof
indicating the end of the file. Programs that
read lav files should consider the file bad if the #:eof
is absent
(or if anything appears after it).
<filename> <start> <stop> <rev_comp_flag> <contig>
<start>
and <stop>
are origin 1 and
inclusive. Usually <start>
is 1 and
<stop>
is the length of the
sequence. However, they can indicate any subsequence in the file.
<rev_comp_flag>
is 1 if the sequence has been
reverse-complemented by LASTZ. <contig>
is used when the file contains multiple sequences. The first
contig is 1, the second is 2, and so on. This is only valid for the second
sequence file.
a { s 13916 b 4886 21292 e 5171 21537 l 4886 21292 4899 21305 79 l 4900 21308 4924 21332 92 l 4925 21334 5024 21433 88 l 5027 21434 5040 21447 100 l 5086 21448 5117 21479 84 l 5118 21484 5171 21537 87 }Indices in an a-stanza are origin 1 and inclusive, and are relative to the subsequences indicated in the most recent s-stanza. In the example below the alignment is of apple 1301..1400 to orange 2501..2600.
s { "malus.fa" 1001 2000 0 1 "aurantium.fa" 2001 5000 0 1 } ... a { s 7321 b 301 501 e 400 600 l 301 501 400 600 82 }For reverse-complemented sequences, indices are counted from the end of the subsequence. So the example below represents an aligment of apple 1301..1400 to the reverse complement of orange 90632304..90632403. In detail, aurantium.fa contains 90634903 bp (this information is not available in the lav file), which the s-stanza indicates should be read as reverse-complement. Only bp 2001..5000 of the reverse-complement are read, which correspond to 90629904..90632903 in the unreversed sequence. Within this 3000 bp subsequence, the alignment is at 501..600, or 2501..2600 along the reversed sequence, or 90632304..90632403 in the unreversed sequence.
s { "malus.fa" 1001 2000 0 1 "aurantium.fa-" 2001 5000 0 1 } ... a { s 7321 b 301 501 e 400 600 l 301 501 400 600 82 }The fifth numeric column in an a-stanza's l line is the match percentage value (often called “percent identity” or “percent id”). This is used by viewer tools such as laj and pipmaker (available at http://www.bx.psu.edu/miller_lab).
x { n <count> }A single m-stanza is then included in the final section, and looks like this:
m { x <start> <end> x <start> <end> x <start> <end> x <start> <end> ... }Each line describes intervals in which the positions occur in at least as many alignments as the
--masking=<count>
option.
Using --census
option, you will get a Census-stanza. 1, 2, 3,
... are positions in the sequence (sequence 1). The counts indicate the number
of times the corresponding position appears in an alignment.
Census { 1 <count> 2 <count> ... }
[Discuss GFA.]
[Discuss MAF.]
[Discuss AXT.]
[Discuss Textual alignments.] [ Warn that we may change the textual format in future versions of LASTZ, it is only provided for eyeballing alignment blocks, not for programs to read. Programs are better off reading Lav, Maf, or Axt formats.]
LASTZ source code includes support for other output formats which are intended mainly for the convenience of the developers. If you have specific questions, please contact us.
[Provide guidance on what params to use when.]