This README file has not undergone internal review and should be considered as preliminary. The information herein is believed to be correct, but (except for this statement) may in fact be incorrect.
TABLE OF CONTENTS
This document describes installation and usage of the LASTZ sequence alignment program. LASTZ is a drop-in replacement for BLASTZ, and is backward compatible with BLASTZ's command-line syntax. That is, it supports all of BLASTZ's options but also has additional ones, and may produce slightly different alignment results.
LASTZ — | A tool for (1) aligning two DNA sequences, and (2) inferring appropriate scoring parameters automatically. | |
Platform: | This package was developed on a Macintosh OS X system, but should work on other Linux or Unix platforms with little change (if any). LASTZ is written in C and compiled with gcc. Some ancillary tools are written in Python, but only use modules available in typical python installations. | |
Author: | Bob Harris <rsharris at bx dot psu dot edu> | |
Date: | January 2009 |
If you have received the distribution as a packed archive, unpack it
by whatever means are appropriate for your computer. The result should be
a directory <somepath>/lastz-distrib-X.XX.XX
that contains
a src
subdirectory (and some others). You may find it convenient
to remove the revision number (-X.XX.XX
) from the directory name.
Before building or installing any of the programs, you will need to tell the
installer where to put the executable, either by setting the shell variable
$LASTZ_INSTALL
, or by editing the make-include.mak
file to set the definition of installDir
. Also, be sure to add
the directory you choose to your $PATH
.
Then to build the LASTZ executable, enter the following commands from bash
(or a similar command-line shell). This will build two executables
(lastz
and lastz_D
) and copy them into your
installDir
.
cd <somepath>/lastz-distrib-X.XX.XX/src make make installThe two executables are basically the same program; the only difference is that
lastz
uses integer scores, while lastz_D
uses
floating-point scores.
A simple self test is included so you can test whether the build succeeded. To run it, enter the following command:
make testIf the test is successful, you will see no output from this command. Otherwise, you will see the differences between the expected output and the output of your build, plus a line that looks like this:
make: *** [test] Error 1
Comparing a human chromosome and a chicken chromosome
It is often adequate to use a lower sensitivity level than is achieved with LASTZ's defaults. For example, to compare two complete chromosomes, even for species as distant as human and chicken, the alignment landscape is evident even at very low sensitivity settings. This can speed up the alignment process considerably.
This example compares human chromosome 4 to chicken chromosome 4. These sequences can be found at the downloads section of the UCSC Genome browser. They are 191 90 megabases, respectively. To run a quick low-sensitivity alignment of these sequences, we use this command:
lastz hg18.chr4.fa galGal3.chr4.fa \ --notransition --step=20 --nochain --nogapped \ --maf > hg18_4_vs_galGal3_4.mafThis runs in about two and a half minutes on a 2GHz workstation, requiring only 400 Mb of RAM. Figure 1 shows the results. LASTZ output, in MAF format, can also be browsed with the GMAJ interactive viewer for multiple alignments, available from Penn State's Miller Lab.
Using --notransition
lowers
seeding sensitivity and reduces runtime (by a factor of about 10 in this case).
--step=20
also lowers seeding
sensitivity, reducing runtime and also reducing memory consumption (by a factor
of about 3.3 in this case).
--nochain
skips the syntenic
chaining stage, and --nogapped
eliminates the computation of gapped alignments. The complete alignment using
default settings takes 4.5 hours (and uses 1.3 Gb of RAM) on machine with
enough RAM, running at 2.83GHz.
![]() Figure 1 |
Aligning shotgun reads to a human chromosome
This example compares a simulated set of primate shotgun reads to human
chromosome 21. The chromosome can be found at the downloads section of the
UCSC Genome browser (it is
about 47 megabases). The reads are part of the LASTZ distribution, in
test_data/fake_chimp_reads.2bit
, and consist of ten
thousand 50 base reads.
To see where these reads map on the human chromosome, we use this command:
lastz hg18.chr21.fa[unmask] fake_chimp_reads.2bit \ --step=10 --seed=match12 --notrans --exact=20 \ --match=1,5 --ambiguousn --nochain \ --coverage=90 --identity=95 \ --format=general:name1,start1,length1,name2,strand2 \ > hg18_4_vs_reads.datAttaching
unmask
to the
chromosome filename tells LASTZ ignore masking information and treat repeats
the same as any other part of the chromosome.
Since we know the two species are close, we want to reduce sensitivity. Using
--step=10
, we will only be
looking for seeds at every 10th base. We change from the default seed, using
--seed=match12
and
--notransition
so that our
seeds will be exact matches of 12 bases. Instead of the default
x-drop extension
we use
--exact=20
so that a 20 base
exact match is required to qualify for gapped extension.
We replace the default score set, which is for more distant species, with the
stricter --match=1,5
. This
scores matching bases as +1
and mismatches as
-1
. We also use
--ambiguousn
so that
N
s will be score appropriately
(see the discussion in Non-ACGT Characters).
We turn off chaining with --nochain
since there is little need to chain alignments within a read.
We are only inerested in alignments that involve nearly an entire read, and
since the species are close we doen't want alignments with low identity. So
we use --coverage=90
and
--identity=95
.
For output, we are only interested in where the reads align. So we use the
--format=general
and specify
that we want the position on the chromosome
(name1
,name1
,length1
)
and the read and orientation
(name2
,strand2
). This creates
a tab-delimited output file with one line per alignment block, a format that is
well-suited for downstream processing by other programs. For example, to count
the number of different reads that we've mapped, we can do this command:
cat hg18_4_vs_reads.dat | grep -v "#" | awk '{print $4}' | sort -u | wc
Seeds, HSPs, Gapped Alignments, Chaining
This example demonstrates the primary
alignment processing stages, using the
α-globin regions of cow and human. The α-globin data is included
in the LASTZ distribution in test_data/aglobin.2bit
.
The file contains a 70K base pair segment of human DNA and a 66K base pair
segment of cow DNA. We will follow this example through seeding, gap-free
extension, chaining and gapped extension.
Figure 2(a) shows the result of the seeding on a small window (3K bp) in the middle of these regions. Seeds are short near matches, in this case each seed is 19 bp and could have as many as 8 mismatches. There are 338 seeds in this window, but regions where there are many seeds are indistinguishable from line segments.
Figure 2(b) shows high-scoring segment pairs, the result of gap-free extension upon the seeds. There are 11 HSPs (only 10 are apparent in the figure, but one of those is split by a 1 bp shift to the next diagaonal). Note that many seeds were discarded because their extensions were low scoring or overlapped.
Figure 2(c) shows the alginment blocks resulting from gap-free extension of the HSPs. There are four alignment blocks.
Figure 2(d) zooms out and shows the HSPs for the full sequences. The red box indicates the small region shown in the earlier figures. Figure 2(e) shows the gapped alignment blocks. Figure 2(f) demonstrates how chaining reduces the alignment blocks to a single syntenic alignment. Note that one call already tell quite a bit about how the sequences align just from looking at the HSPs.
![]() lastz \ aglobin.2bit/human[34000..37000] \ aglobin.2bit/cow[35000..38000] \ --nogfextend --nochain --nogappedFigure 2(a) |
![]() lastz \ aglobin.2bit/human[34000..37000] \ aglobin.2bit/cow[35000..38000] \ --gfextend --nochain --nogappedFigure 2(b) |
![]() lastz \ aglobin.2bit/human[34000..37000] \ aglobin.2bit/cow[35000..38000] \ --gfextend --nochain --gappedFigure 2(c) |
![]() lastz \ aglobin.2bit/human \ aglobin.2bit/cow \ --gfextend --nochain --nogappedFigure 2(d) |
![]() lastz \ aglobin.2bit/human \ aglobin.2bit/cow \ --gfextend --nochain --gappedFigure 2(e) |
![]() lastz \ aglobin.2bit/human \ aglobin.2bit/cow \ --gfextend --chain --gappedFigure 2(f) |
Aligning a sequence with itself
When a sequence is aligned to itself, the full result will contain mirror-image copies of each alignment block. It is computationally wasteful to process both copies. LASTZ can handle this situation in four different ways.
--self
option
in place of the query sequence. LASTZ will save work by computing with only one
block of each mirror-image pair. It still reports both copies in the output,
but note that it leaves out the trivial self-alignment block along the main
diagonal.
--self
in place of the query and also add the
--nomirror
option. In this case LASTZ reports only one copy
of each mirror-image pair, as well as omitting the trivial block.
In the following figure, we suppose we have a sequence with repeated motifs, in the order α1 β1 γ1 β2 δ1 α2 δ2' γ2. That is, α1 and α2 are ancient duplications, as are β1 and β2, and γ1 and γ2. δ2 is an inversion, a reverse-complement duplicate of δ1.
![]() lastz target target
Figure 3(a) |
![]() lastz target target --notrivial
Figure 3(b) |
![]() lastz target --self
Figure 3(c) |
![]() lastz target --self --nomirror
Figure 3(d) |
LASTZ is optimized to preprocess one sequence or set of sequences (which we collectively call the target) and then align several queries to it. The general flow of the program is like a pipeline. The output of one stage is the input to the next. The user can choose to skip most stages via command-line options. Stages that are skipped pass their input along to the next stage. Two of the stages, scoring inference and interpolation, are special in that they perform a miniature copy of the pipeline within them.
The general flow is as follows. We read the target sequence(s) into memory, and use that to build an index table that will allow us to quickly map any word in the target to the positions containing that word. Then we read each query sequence in turn, processessing them independently. We look up each word in the query and use the index table to find matches, called seeds in the target. The seeds are extended to longer matches called HSPs (which is an acronym for high-scoring segment pair) and filtered based on score. HSPs are chained into the highest-scoring set of syntenic alignments. The remaining HSPs are reduced to single locations (points in the DP matrix) called anchors. The anchors are then extended to local alignments (which may contain gaps), and again filtered by score. We then perform back-end filtering to discard alignment blocks without certain statistical traits. Then, we interpolate-- we repeat the entire process at a higher sensitivity in the holes between the alignment blocks. Finally, we write the alignment information to a file.
An additional stage not described above is scoring inference. This is not usually performed-- in typical use it is performed only when sequences for two new species are acquired, to create a scoring file that is used for subsequent alignments of those species.
If you are familiar with BLASTZ, you can run LASTZ the same way you ran BLASTZ, with the same options and input files. In addition to this BLASTZ compatibility, LASTZ provides other options.
The general format of the LASTZ command line is
lastz <target> [<query>] [<options>]
The angle brackets <>
indicate meta-syntactic variables that
should be replaced with your values, while the square ones []
indicate elements that are optional. Spaces separate fields on the command
line; a field that needs to contain a space (e.g. within a file name) must be
enclosed in double quotes ""
. Elements can appear in
any order, the only constraint being that, if present, the
<query>
must appear after the <target>
.
Output is generally written to stdout
, unless specified otherwise
for a particular option.
The <target>
and <query>
are usually
just the names of files containing the two sequences to be aligned, in either
FASTA, Nib, or
2Bit format. However they can also specify
pre-processing actions such as selecting a subsequence from the file; see
Sequence Specifiers for details. With certain options
such as --self
the <query>
is not needed;
otherwise if it is left unspecified the query sequence is read from
stdin
.
A special case is made when the --targetcapsule=<file>
option in used. In this case, no target sequence file should be listed, since
the target sequence is embedded within the capsule file.
For options, the general format is --<keyword>
or
--<keyword>=<value>
, but for BLASTZ compatibility
some options also have an alternative syntax
<letter>=<number>
.
(Be careful when copying options from the tables below, as some of the hyphens
here are special characters to avoid awkward line wrapping in certain web
browsers. If you have trouble, replace the pasted hyphens with ordinary typed
ones on your command line.)
Running the command lastz
without any arguments prints a help
message with the most commonly used options, while running
lastz --helplists all of the options.
Option | BLASTZ equivalent | Meaning | |||||||||||||||||||||||||||
Where To Look | |||||||||||||||||||||||||||||
--strand=both |
B=2 |
Search both strands. | |||||||||||||||||||||||||||
--strand=plus |
B=0 |
Search the forward strand only (the one corresponding to the query specifier). | |||||||||||||||||||||||||||
--strand=minus |
B=-1 |
Search only the reverse complement of the query specifier. | |||||||||||||||||||||||||||
--self |
Perform a self-alignment: the target sequence is also the query. Computation is more efficient than it would be without this option, since only one of each mirror-image pair of alignment blocks is processed (redundant mirror-image alignment blocks are omitted). Moreover, the trivial self-alignment block along the main diagonal is omitted from the output. | ||||||||||||||||||||||||||||
--nomirror |
Inhibit the output of mirror-image alignments. Output consists of only one
copy of each meaningful alignment block in a self-alignment. This is only
allowed with the --self option.
|
||||||||||||||||||||||||||||
--masking=<count> |
M=<count> |
Dynamically mask the target sequence by excluding any positions that appear
in too many alignments from further consideration for seeds.
Specifically, a cumulative count is maintained of the number of times each
location occurs in an alignment block. After each query sequence is processed,
any locations that have occured in at least
Use of this option requires one byte of memory for each target location. The
maximum value allowed for |
|||||||||||||||||||||||||||
Defaults: |
By default both strands are searched, the target is assumed to be different
from the query, and dynamic masking is not performed.
If |
||||||||||||||||||||||||||||
Seeding | |||||||||||||||||||||||||||||
--seed=12of19 |
T=1 or T=2 |
Seeds require a 19 bp word with matches in 12 specific positions
(1110100110010101111 ). |
|||||||||||||||||||||||||||
--seed=14of22 |
T=3 or T=4 |
Seeds require a 22 bp word with matches in 14 specific positions
(1110101100110010101111 ). |
|||||||||||||||||||||||||||
--seed=match<length> |
W=<length> |
Seeds require a <length> bp word with matches in all
positions. |
|||||||||||||||||||||||||||
--seed=half<length> |
Seeds require a <length> bp word with matches or
transitions in all positions. |
||||||||||||||||||||||||||||
--seed=<pattern> |
Specifies an arbitrary pattern of 1 s, 0 s, and
T s for seed discovery. |
||||||||||||||||||||||||||||
--transition |
T=1 or T=3 |
In each seed, allow any one match position to be a transition instead. | |||||||||||||||||||||||||||
--transition=2 |
In each seed, allow any two match positions to be transitions instead. | ||||||||||||||||||||||||||||
--notransition |
T=2 or T=4 |
Don't allow any match positions in seeds to be satisfied by transitions. | |||||||||||||||||||||||||||
--step=<offset> |
Z=<offset> |
Offset between the starting positions of successive target words considered for potential seeds. | |||||||||||||||||||||||||||
--maxwordcount=<limit> |
Words occurring more often than <limit> in the target
are not eligible for seeds. Specifically, after the target word table is
built, any words exceeding this count are removed from the table. |
||||||||||||||||||||||||||||
--twins=[<minsep>..]<maxsep> |
Require two nearby seeds on the same diagonal, separated by a number of bases in the given range. See seeding stage for more information. | ||||||||||||||||||||||||||||
--notwins |
Allow single, isolated seeds. | ||||||||||||||||||||||||||||
--recoverseeds |
Recover seeds that are lost in hash collisions. This will slow the alignment process considerably, and usually does not improve the results significantly. | ||||||||||||||||||||||||||||
--norecoverseeds |
Use a hashing mechanism for seeds that greatly improves memory usage, at the expense of missing some seeds. Note that missing seeds usually does not mean missing alignments, since most alignable regions have many seed hits. | ||||||||||||||||||||||||||||
--filter=[<transv>,]<matches> |
Filter the resulting seeds, requiring at least
<matches> exact matches and allowing no more than
<transv> transversions.
Currently, this option is only valid for half-weight seeds.
|
||||||||||||||||||||||||||||
--nofilter |
Don't filter seeds. | ||||||||||||||||||||||||||||
--anchors=<anchor_file> |
Read anchors from a file, instead of discovering them via seeding. This replaces any other options related to seeding or gap-free extension. The entire seeding and gap-free extension stages are skipped, and processing begins with the chaining stage. See anchoring for more information. | ||||||||||||||||||||||||||||
--targetcapsule=<file> |
The target seed word position table and the step offset (as well as the
target sequence) are read from the specified file.
If this is used,
|
||||||||||||||||||||||||||||
Defaults: | By default the 12-of-19 seed is used, one transition is allowed, a step of 1 is used, no words are removed from the target word table, twins are not required, hash collisions are not recovered, and the hits are not filtered. | ||||||||||||||||||||||||||||
HSPs (Gap-Free Extension) | |||||||||||||||||||||||||||||
--gfextend |
Perform gap-free extension of seeds to HSPs (high scoring segment pairs), according to the other options in this section. | ||||||||||||||||||||||||||||
--nogfextend |
Skip the gap-free extension stage. | ||||||||||||||||||||||||||||
--hspthresh=<score> |
K=<score> |
Use x-drop extension with the indicated threshold; HSPs scoring lower are discarded. | |||||||||||||||||||||||||||
--hspthresh=top<basecount> |
Use x-drop extension with an adaptive scoring threshold. The scoring threshold
is chosen to limit the number of target bases in HSPs to about
<basecount> . See the
adaptive HSP threshold case study for
more information.
|
||||||||||||||||||||||||||||
--hspthresh=top<percentage>[%] |
Use x-drop extension with an adaptive scoring threshold. The scoring threshold
is chosen to limit the number of target bases in HSPs to about
<percentage> percent of the target. See the
adaptive HSP threshold case study for
more information.
|
||||||||||||||||||||||||||||
--xdrop=<dropoff> |
X=<dropoff> |
Set the x-drop extension termination threshold. This determines the endpoints
of each gap-free segment. The segment ends when a segment scoring worse than
this threshold is encountered (i.e., worse than
−<dropoff> ).
This option is only valid if one of the |
|||||||||||||||||||||||||||
--exact=<length> |
Find HSPs using the exact match extension method with the given length threshold, instead of using the x-drop method. | ||||||||||||||||||||||||||||
--entropy |
P=1 |
Adjust for entropy when qualifying HSPs. Those that score slightly above the HSP threshold are adjusted downward according to the entropy of their nucleotides, and any that then fall below the threshold are discarded. | |||||||||||||||||||||||||||
--entropy=report |
P=2 |
Adjust for entropy when qualifying HSPs, and report (to stderr )
any HSPs that are discarded as a result.
|
|||||||||||||||||||||||||||
--noentropy |
P=0 |
Don't adjust for entropy when qualifying HSPs. | |||||||||||||||||||||||||||
Defaults: |
By default seeds are extended to HSPs using x-drop extension, with entropy
adjustment.
If
If |
||||||||||||||||||||||||||||
Chaining | |||||||||||||||||||||||||||||
--chain |
C=1 or C=2 |
Perform chaining of HSPs. | |||||||||||||||||||||||||||
--chain=<diag>,<anti> |
G=<diag> R=<anti> |
Perform chaining with the given penalties for diagonal and anti-diagonal in the dynamic programming matrix. | |||||||||||||||||||||||||||
--nochain |
C=0 or C=3 |
Skip the chaining stage. | |||||||||||||||||||||||||||
Defaults: | By default the chaining stage is skipped. | ||||||||||||||||||||||||||||
Gapped Extension | |||||||||||||||||||||||||||||
--gapped |
C=0 or C=2 |
Perform gapped extension of anchors. | |||||||||||||||||||||||||||
--nogapped |
C=1 or C=3 |
Skip the gapped extension and interpolation stages. | |||||||||||||||||||||||||||
--gappedthresh=<score> |
L=<score> |
Set the threshold for gapped extension; alignments scoring lower than
<score> are discarded.
|
|||||||||||||||||||||||||||
--ydrop=<dropoff> |
Y=<dropoff> |
Set the gapped extension termination threshold. This determines the local
region around each anchor in which we perform gapped extension. Extension
(of each anchor) ends when all remaining sub-alignment possibilites
(paths in the dynamic programming matrix) score worse than
this threshold (i.e., worse than −<dropoff> ).
|
|||||||||||||||||||||||||||
Defaults: |
By default gapped extension is performed.
If
If |
||||||||||||||||||||||||||||
Back-end Filtering | |||||||||||||||||||||||||||||
--identity=<min>[..<max>] |
Filter alignments by percent identity,
0 ≤ min ≤ max ≤ 100.
Identity is the fraction of aligned bases
that are matches, expressed as a percentage. Alignment blocks (or HSPs if
gapped extension is not being performed) outside the given range are discarded.
|
||||||||||||||||||||||||||||
--coverage=<min>[..<max>] |
Filter alignments by the percentage of the shorter sequence that is covered,
0 ≤ min ≤ max ≤ 100.
Coverage is the fraction of bases in the
shorter sequence that are aligned, expressed as a percentage. Alignment blocks
(or HSPs if gapped extension is not being performed) outside the given range
are discarded.
|
||||||||||||||||||||||||||||
--notrivial |
Do not output a trivial self-alignment block if the target and query are
identical. This only applies when a query sequence matches the entire
target file
This option is not allowed if the target is comprised of multiple sequences.
Note that using |
||||||||||||||||||||||||||||
Defaults: | By default no back-end filtering is performed and the trivial block is included if the sequences happen to be identical. | ||||||||||||||||||||||||||||
Interpolation | |||||||||||||||||||||||||||||
--inner=<score> |
H=<score> |
Perform additional alignment between the gapped alignment blocks, using
(presumably) more sensitive alignment parameters. This is only valid if gapped
extension is performed.
Another complete alignment round (seeding, HSP, chaining, and gapped stages) is
performed in the small areas between the alignment blocks found in the main
gapped extension stage. Seeding for this alignment requires a 7 bp match with
no transitions, and uses a scoring threshold of See the interpolation case study for more information. |
|||||||||||||||||||||||||||
Defaults: | By default interpolation is not performed. | ||||||||||||||||||||||||||||
Scoring | |||||||||||||||||||||||||||||
--scores=<score_file> |
Q=<file> |
Read the substitution and gap scores (and possibly other score-related options) from a file. | |||||||||||||||||||||||||||
--match=<reward>[,<penalty>] |
Set the score values for a match (+<reward> )
and mismatch (−<penalty> ). When
<penalty> is not specified it is the same as
<reward> .
Note that specifying |
||||||||||||||||||||||||||||
--gap=[<open>,]<extend> |
O=<open> E=<extend> |
Set the score penalties for opening and extending a gap.
This option is only valid if gapped extension is being performed. |
|||||||||||||||||||||||||||
--ambiguousn |
Treat each N in the input sequences as an ambiguous nucleotide.
Substitutions with N are scored as zero, instead of using the
fill_score value from the score file
(which is -100 by default).
See Non-ACGT Characters for a more thorough discussion.
|
||||||||||||||||||||||||||||
--infer[=<control_file>] |
Infer substitution and gap scores from the sequences, then use them to align the sequences. Parameters controlling the inference process are read from the control file. | ||||||||||||||||||||||||||||
--inferonly[=<control_file>] |
Infer substitution and gap scores, but don't perform the final alignment
(requires --infscores ). |
||||||||||||||||||||||||||||
--infscores[=<output_file>] |
Report the inferred scores to the specified file (or to
stdout ). |
||||||||||||||||||||||||||||
Defaults: |
By default the HOXD70 substitution scores are used
(see [Chiaromonte 2002] for a description of
how this scoring matrix was created).
Default gap penalties are determined as follows. If
By default, a run of |
||||||||||||||||||||||||||||
Output | |||||||||||||||||||||||||||||
--output=<file> |
Specify a file to write alignments to. If this option is not used, alignments are written to stdout. | ||||||||||||||||||||||||||||
--format=<type> |
Specify the output format:
lav ,
lav+text ,
axt ,
axt+ ,
maf ,
maf+ ,
maf- ,
cigar ,
rdotplot ,
text ,
general ,
or
general:<fields> .
|
||||||||||||||||||||||||||||
--census[=<output_file>] |
c=1 |
Count and report how many times each target base aligns, up to 255.
Even bases aligning to gaps are counted.
Requires one byte of memory for each target location.
For any of the |
|||||||||||||||||||||||||||
--census16[=<output_file>] |
Count and report how many times each target base aligns, up to ≈65 thousand. Requires two bytes of memory for each target location. | ||||||||||||||||||||||||||||
--census32[=<output_file>] |
Count and report how many times each target base aligns, up to ≈4 billion. Requires four bytes of memory for each target location. | ||||||||||||||||||||||||||||
--nocensus |
c=0 |
Do not report a census of aligning bases. | |||||||||||||||||||||||||||
--tableonly |
Just write out the target word position table and quit; don't search for seeds. | ||||||||||||||||||||||||||||
--tableonly=count |
Just write out the target word count table and quit; don't search for seeds. | ||||||||||||||||||||||||||||
--writecapsule=<file> |
Just write out a target capsule file and quit; don't search for seeds. The capsule file contains the target sequence, the target seed word position table, the step offset and other related information. | ||||||||||||||||||||||||||||
Defaults: |
By default output is in lav format, neither the word position nor
count table is written out, and no census is reported. |
||||||||||||||||||||||||||||
Quantum DNA | |||||||||||||||||||||||||||||
--ball=<score> |
Set the quantum seeding threshold, the minimum score required of a word to be considered "in" the quantum seeding ball. | ||||||||||||||||||||||||||||
--ball=<percentage>% |
Set the minimum score required of a word to be considered "in" the quantum seeding ball, as a percentage of the maximum word score possible. | ||||||||||||||||||||||||||||
Defaults: |
The default is to assume that the query is an ordinary DNA sequence, not
quantum DNA.
If |
||||||||||||||||||||||||||||
Housekeeping | |||||||||||||||||||||||||||||
--traceback=<bytes> |
m=<bytes> |
Set the amount of memory to allocate (in RAM) for trace-back information during
the gapped extension stage. <bytes> may contain an
M or K unit suffix if desired (indicating a
multiplier of 1,024 or 1,048,576, respectively). For example,
--traceback=80.0M is the same as
--traceback=83886080 .
|
|||||||||||||||||||||||||||
--word=<bits> |
Set the maximum number of bits for the word hash. Use this to spend less memory (in exchange for more time) and thereby avoid thrashing for heavy seeds. | ||||||||||||||||||||||||||||
Defaults: | The default traceback space is 80.0M , and the default
word hash is 28 bits. |
||||||||||||||||||||||||||||
Help | |||||||||||||||||||||||||||||
--version |
Report the program version and quit. | ||||||||||||||||||||||||||||
--help |
List all options. | ||||||||||||||||||||||||||||
--help=files |
Describe the syntax for sequence specifiers. | ||||||||||||||||||||||||||||
--help=formats |
Describe the available output formats. | ||||||||||||||||||||||||||||
--help=shortcuts |
List BLASTZ-compatible shortcuts. | ||||||||||||||||||||||||||||
--help=yasra |
List Yasra-specific shortcuts. |
There are several shortcut options to support the
Yasra mapping assembler. These
provide canned sets of option settings that work well for aligning an assembled
reference sequence (as the target) with a set of shotgun reads (as the query).
They are selected based on the expected level of identity between the sequences.
For example, --yasra90
should be used when we expect 90% identity.
The --yasraXXshort
options are appropriate when the reads are very
short (less than 50 bp).
Option | Equivalent |
--yasra98 |
T=2 Z=20 --match=1,6
O=8 E=1 Y=20 K=22 L=30 --identity=98 |
--yasra95 |
T=2 Z=20 --match=1,5
O=8 E=1 Y=20 K=22 L=30 --identity=95 |
--yasra90 |
T=2 Z=20 --match=1,5
O=6 E=1 Y=20 K=22 L=30 --identity=90 |
--yasra85 |
T=2 --match=1,2
O=4 E=1 Y=20 K=22 L=30 --identity=85 |
--yasra75 |
T=2 --match=1,1
O=3 E=1 Y=20 K=22 L=30 --identity=75 |
--yasra95short |
T=2 --match=1,7
O=6 E=1 Y=14 K=10 L=14 --identity=95 |
--yasra85short |
T=2 --match=1,3
O=4 E=1 Y=14 K=11 L=14 --identity=85 |
A target or query sequence specifier normally just indicates a file to be used in the alignment; however various pre-processing actions can also be specified. These are performed as the sequences are read from the file, and may include selecting a particular sequence and/or subrange, masking, adjusting sequence names, etc.
The format of a sequence specifier is
<file_name>[[<actions>]]*
The <file_name>
field is required; the actions list is
optional. Note that the <actions>
are enclosed in literal
square brackets (in addition to the ones that just indicate they are optional),
and are a comma-separated list (with no spaces), e.g.
[action1,action2,...]
. The asterisk indicates that
several action lists can be appended; they are treated the same as if they were
in a single list.
Note that the actions apply to every sequence in the file. For example, if you
include the revcomp
action, every sequence in the file will be
reverse-complemented. And if you specify a subrange of, say,
[100..]
, you will skip the first 99 bp in every sequence.
The following actions are supported:
Action | Meaning |
<subrange> |
Only a subrange of the sequence is processed. The usual form of a subrange
is [<start>]..[<end>] . Either
<start> or <end> can be omitted, in which
case the start or end of the sequence is used. Subrange indices begin with 1
and are inclusive. For example, 201..300 is a 100 bp subrange
that skips the first 200 bp in the sequence.
For BLASTZ compatibility, the alternative syntax
Another useful syntax for this is
Yet another useful syntax for this is
Additionally, if a subrange has Note that subrange positions are always measured from the start of the sequence provided in the file, even if the sequence is being reverse complemented. |
revcomp |
The reverse complement of the sequence is used instead of the sequence itself. |
multiple |
The file's sequences are internally treated as a single sequence. This action is required when the target (not the query) is comprised of multiple sequences. |
subset=<names_file> |
The name of a file containing a list of desired
sequence names; only these sequences will be processed. The names can be
piped in by specifying /dev/stdin as the file. This action is
only valid for FASTA or 2Bit
sequence files.
|
|
Convert any lowercase bases to uppercase. Lowercase bases usually indicate instances of biological repeats, and are excluded from the seeding stage of the alignment process. |
xmask=<mask_file> |
Mask the segments specified in
<mask_file> by replacing them
with X s. See Non-ACGT Characters for
information on how X s affect the alignment.
|
nmask=<mask_file> |
Mask the segments specified in
<mask_file> by replacing them
with N s. See Non-ACGT Characters for
information on how N s affect the alignment.
|
nickname=<name> |
Ignore any sequence names in the input file, instead using
<name> in the output. See Sequence Name
Mangling for more information on how the name used for output is derived.
|
nameparse=full |
Report full sequence names in the output, instead of short names. As described in Sequence Name Mangling, LASTZ normally shortens FASTA and 2Bit sequence names in an attempt to include only the distinguishing core of the name. This action is provided in case LASTZ's choice of names is not helpful. This action is only valid for FASTA or 2Bit sequence files. |
nameparse=alphanum |
Extract the first word from the sequence header line, discarding any directory information and keeping only an alphanumeric string. See Sequence Name Mangling for more information on how the name used for output is derived. This action is only valid for FASTA sequence files. |
nameparse=tag:<marker> |
Use the specified marker to extract a short name from the sequence header
line. For example, name=foo: will look for the string
foo: in the header line, and copy the name from the text
following that, up to the next non-alphanumeric character.
See Sequence Name Mangling for more information on
how the name used for output is derived.
This action is only valid for FASTA or
2Bit sequence files.
|
quantum=<code_file> |
The sequence contains quantum DNA. |
quantum=<code_file> |
The sequence contains quantum DNA corresponding to the specified <code_file>. The code file sets nucleotide probabilties corresponding to the quantum alphabet. This is only used to augment the display of alignments in the text output format. |
In addition to the sequence specifier syntax shown above, LASTZ supports a more complicated syntax. This is to maintain compatibility with BLASTZ and early versions of LASTZ. All of the functionality described here can be performed using the newer syntax above.
The complete format of a sequence specifier is
[<nickname>::]<file_name>[/<select_name>][{<mask_file>}][[<actions>]][-]
As with the simpler syntax, the <file_name>
field is
required; all other fields are optional. The <file_name>
and <actions>
fields have the same meaning as in the simpler
syntax.
<nickname>::
is equivalent to the <name>
field in the nickname=<name>
action.
/<select_name>
is only valid for the
2Bit file format, and only when the file name ends with
".2bit". It specifies a single sequence from the file to use, rather than all
sequences. This is similar to the subset=<names_file>
action, except that here a single sequence name is given instead of a file of
names. Note that the name must match the mangled
sequence name extracted from the file.
{<mask_file>}
is identical to the
xmask=<mask_file>
action.
A -
(minus sign) is equivalent to the revcomp
action
(but if both are included they cancel).
LASTZ typically receives two sequences and possibly a score file or inference control file as inputs, and produces an alignment file as output.
DNA sequences can be provided in FASTA,
Nib, or 2Bit format. These sequences
contain a series of A
, C
,
G
, T
, and
N
characters in upper or lower case. Lower case
indicates repeat-masked bases, while N
s represent unknown bases
(if the --ambiguousn
option is specified).
By default, a run of N
s (or X
s) is used to separate
sequences that have been catenated together for processing
See Non-ACGT Characters for a discussion of the use of
N
s (or X
s).
As an alternative to DNA sequence, quantum DNA using an
abstract alphabet can be provided as the query (but not as the target).
FASTA and 2Bit formats support more than one sequence within the same file.
Files containing multiple sequences can normally only be used as the query file,
not as the target. However, the subset
action allows one or more sequences to be selected from a file, and the
multiple
action allows more than one
sequence to be given as the target.
FASTA format stores DNA sequences as plain text. The first line begins with
a >
followed by the name of the sequence, and all subsequent
lines contain nucleotide characters. The lines can be of any length.
If the file contains multiple sequences, each should start with its own
>
header line.
NCBI FASTA specification
It has become common for suppliers of FASTA files to pack a plethora of additional information into a sequence's header line. This extra information can create difficulties for many sequence processing tools. For example, headers often contain spaces but file formats such as MAF do not allow spaces in sequence names. To compensate for this, LASTZ provides several options for extracting a concise name from sequence headers; see Sequence Name Mangling for details.
Nib format stores a single unnamed DNA sequence, packed as two bases per byte. UCSC Nib specification
2Bit format stores multiple DNA sequences, encoded as four bases per byte with
some additional information describing runs of masked bases or N
s.
UCSC 2Bit specification
Sequence names in 2Bit files have all the same problems as in FASTA files, so Sequence Name Mangling applies to these files as well.
A quantum DNA file describes a single sequence of "quantum" DNA, which uses
an abstract, user-defined alphabet. Each position in the sequence is a byte
with a value in the range 0x01
..0xFF
, which can
represent an ambiguity code, amino acid, or any other meaning you desire.
LASTZ does not try to interpret these in any way; it just aligns them as
abstract symbols corresponding to columns in the scoring matrix. Note that
the value 0x00
is prohibited.
The file itself is stored in a binary format described by the table below. It can be written on either a big-endian or little-endian machine; LASTZ determines the byte order of multi-byte fields by examining the magic number at the start of the file.
File Offset | Data | Meaning |
0x00 |
C4 B4 71 97
—or— 97 71 B4 C4
|
Magic number indicating big-endian byte order.
Magic number indicating little-endian byte order. |
0x04 |
00 00 02 00 |
File conforms to version 2.0 of the Quantum DNA file format. |
0x08 |
00 00 00 14 |
Header length in bytes, including this field through the all-zero field. |
0x0C |
xx xx xx xx |
SOFF : offset (from file start) to data sequence.
|
0x10 |
xx xx xx xx |
NOFF : offset (from file start) to name; 0 indicates no name. |
0x14 |
xx xx xx xx |
SLEN : length of data sequence. |
0x18 |
00 00 00 00 |
Must be zero. |
NOFF |
… | Name: a zero-terminated ASCII string. |
SOFF |
… | Data sequence: a series of SLEN bytes, each of which
is one quantum symbol in the sequence. |
A quantum code file defines a mapping from quantum symbols to a vector of
values for the four nucleotides A
, C
, G
, and T
. Usually these indicate the nucleotide probability
distribution for each symbol in the quantum alphabet. However, LASTZ doesn't
interpret the values, and only uses them to to augment the display of
alignments in the text output format.
Each line in the file gives the mapping for one symbol. Lines beginning
with a #
are considered to be comments and are ignored, as are
blank lines. Data lines have five columns, separated by whitespace. The first
field contains the symbol, either as a single character or two hexadecimal
digits, while the remaining
four fields contain floating-point values for A
,
C
, G
, and
T
, respectively.
Here is an example.
# sym p(A|sym) p(C|sym) p(G|sym) p(T|sym) 01 0.125041 0.080147 0.100723 0.694088 02 0.111162 0.053299 0.025790 0.809749 03 0.065313 0.007030 0.004978 0.922679 ... more rows here ... FF 0.209476 0.014365 0.755682 0.020477
This file is used with the subset
action to select particular sequences for processing. It consists
of one sequence name per line. Lines beginning with a #
are
considered to be comments and are ignored, as are blank lines. Only the first
whitespace-delimited word in any
line is read as the name; the rest of the line is ignored.
Note that the names must appear in the same order as they appear in the corresponding sequence file, and must match the mangled name extracted from that file.
A masking file for LASTZ consists of one interval per line, without
sequence names. Lines beginning with a #
are considered to be
comments and are ignored, as are blank lines. Only the first two
whitespace-delimited words
in any line are interpreted as the interval; the rest of the line is ignored.
Each interval describes a region to be masked, and consists of
<start> <end>Locations are one-based (a.k.a. origin-one), and inclusive on both ends. Note that if the sequence is reverse complemented (e.g. via the
revcomp
action), the masking
intervals are relative to the reverse strand.
Here is an example. If the target sequence is hg18.chr1, this would mask the 5' UTRs from several genes. Note that the third column is neither required nor interpreted by LASTZ, and acts as a comment.
884484 884542 NM_015658 885830 885936 NM_198317 891740 891774 NM_032129 925217 925333 NM_021170 938742 938816 NM_005101 945366 945415 NM_198576 1016787 1016808 NM_001114103 1017234 1017346 NM_001114103 1041303 1041486 NM_001114103
An anchor file describes a list of segments representing gap-free alignments.
This is either produced by LASTZ as a result of the
gap-free extension stage,
or supplied by the user via the
--anchors
option, which causes LASTZ to skip the stages up to that point.
Then LASTZ reduces the segments to highest-scoring peaks, and uses these peaks
to guide the gapped extension stage.
The file contains two intervals per line (called a "segment"), one from the
target and one from the query, with sequence names. Lines beginning with a
#
are considered to be comments and are ignored, as are blank
lines. #
can also be used to put comments at the end of lines.
Each line looks like
<name1> <start1> <end1> <name2> <start2> <end2> <strand> [<score>] [#<comment>]where <name1>, etc. correspond to the target sequence and <name2>, etc. correspond to the query. Fields are delimited by whitespace.
Locations are one-based and inclusive on both ends (thus the interval "154 228" has length 75 and is preceded by 153 bases in its sequence). Negative strand intervals are measured from the 5' end of the query's negative strand (corresponding to the rightmost end of the given query sequence). All target intervals are on the positive strand. The two intervals must have the same length (since these alignments are gap-free). Segments without scores are given a score of zero.
Query sequence names must appear in the same order as they do in the query file.
For each query sequence, all positive strand intervals must appear before any
negative strand intervals. Sequence names for the target may appear in any
order, and are only meaningful if the
multiple
action is used; otherwise
they are ignored. Intervals with names not found in the target or query are
not allowed. In cases where sequence names are either unknown or of no
importance (e.g. when all sequences in the file have the same name), a
*
can be used as a generic sequence name.
Here is an example.
R36QBXA37A3EQH 151 225 Q81JBBY19D81JM 14 88 + 6875 R36QBXA37D4L6V 26 100 Q81JBBY19D81JM 10 84 + 6808 R36QBXA37EVLNU 19 93 Q81JBBY19D81JM 7 81 + 6842 R36QBXA37CEBPD 8 81 Q81JBBY19D81JM 9 82 + 7108 R36QBXA37BLO6X 132 205 Q81JBBY19D81JM 11 84 - 7339 R36QBXA37A2W3P 162 214 Q81JBBY19D81JM 2 54 - 5024 R36QBXA37A9395 62 136 Q81JBBY19A323K 18 92 + 7231 R36QBXA37DNC74 18 82 Q81JBBY19A323K 2 66 + 6418 R36QBXA37CTR26 83 167 Q81JBBY19ASA7F 19 103 + 8034 R36QBXA37C2TAC 95 181 Q81JBBY19ASA7F 15 101 + 8272
A target capsule file is, essentially, a memory dump of several internal data structures related to the target sequence and the target seed word position table. At the current time, the authors don't wish to make this file format public.
The score set consists of a substitution matrix and other settings. The
other settings come first and are individually explained in the
table below.
All settings are optional, and most have exact correspondence to command-line
options, and the same defaults (unless otherwise specified in the table).
Command-line settings always override settings in this file. Any line may
end with a comment (#
is the comment character).
In the matrix, rows correspond to characters in the target sequence while
columns correspond to characters in the query. Matrix labels can either be
single characters or two-digit hexadecimal values in the range
01
..FF
(the value 00
is not allowed).
The rows and columns of the matrix need not have the same set of labels, so
for example, a matrix might describe scoring between the 4-letter DNA alphabet
and the 15-letter ambiguity alphabet. Any labels other than
A
, C
,
G
, and T
are treated as
quantum DNA.
Score values can be floating-point if the lastz_D
version of the
executable is used instead of lastz
.
Here is an example:
# This matches the default scoring set for BLASTZ bad_score = X:-1000 # used for sub['X'][*] and sub[*]['X'] fill_score = -100 # used when sub[*][*] is not defined gap_open_penalty = 400 gap_extend_penalty = 30 A C G T A 91 -114 -31 -123 C -114 100 -125 -31 G -31 -125 100 -114 T -123 -31 -114 91
BLASTZ score files are also accepted. These only contain a substitution matrix, and row labels must be absent (they are assumed to be the same as the column labels). No other settings are allowed.
A C G T 91 -114 -31 -123 -114 100 -125 -31 -31 -125 100 -114 -123 -31 -114 91
Keyword | Setting | Meaning |
bad_score |
<score>
<col>:<score>
<row>:<col>:<score>
|
This score is used to fill a single row and column of the scoring matrix, so
that any occurrences of the corresponding character(s) are severely penalized.
The <row> and <col> fields are character codes (as explained in the
section above). If <row> is absent the <col> value is used for
<row> also; if both are absent X is assumed.
The default value is X:-1000 .
There is no corresponding command-line option.
|
fill_score |
<score> |
This is used as a default for all cells of the scoring matrix that are not
otherwise set.
This is the score used for N s (unless --ambiguousn is
specified on the command line).
The default value is |
gap_open_penalty |
<score> |
This is identical to the <open>
field of the --gap=[<open>,]<extend>
command line option.
|
gap_extend_penalty |
<score> |
This is identical to the <extend>
field of the --gap=[<open>,]<extend>
command line option.
|
hsp_threshold |
<score> |
This is identical to the --hspthresh command line option.
|
gapped_threshold |
<score> |
This is identical to the --gappedthresh command line option.
|
x_drop |
<score> |
This is identical to the --xdrop command line option.
|
y_drop |
<score> |
This is identical to the --ydrop command line option.
|
ball |
<score>
<percentage>% |
This is identical to the --ball command line option.
|
step |
<offset> |
This is identical to the --step command line option.
|
seed |
<strategy> |
This corresponds to the --seed and --transition
command line options. <strategy> must be one of the
following, with no spaces:
12of19,transition
12of19,notransition
14of22,transition
14of22,notransition
|
When LASTZ is asked to infer substitution and gap scores from the input sequences, this file is used to set parameters that control the inference process.
Here is an example:
# base the inference on alignments in the middle 50 percentile # by percent identity min_identity = 25.0% # 25th percentile max_identity = 75.0% # 75th percentile # scale scores so max substitution score will be 100, and only # use alignments scoring no worse than 20 substitutions inference_scale = 100 # score for max substitution hsp_threshold = 20*inference_scale gapped_threshold = hsp_threshold # allow substitution score inference to iterate at most # 20 times; don't perform gap score inference-- instead # hardwire gap scores relative to max substitution max_sub_iterations = 20 max_gap_iterations = 0 gap_open_penalty = 4*inference_scale gap_extend_penalty = 0.3*inference_scale # use all seedword positions (don't sample) step = 1 # adjust for entropy when qualifying HSPs entropy = on
min_identity
and max_identity
specify the range of
sequence identity upon which inference is based. Only alignment blocks within
this range contribute to the inference. If the value ends with a percent sign,
the range is a percentile of the values found in the overall alignment;
otherwise it is a fixed percent identity value. For example,
min_identity=70
and max_identity=90
indicates that
blocks with identity ranging from 70 to 90 percent will be used, while
min_identity=25%
and max_identity=75%
indicates that
half of the blocks will be used (the middle 50 percentile).
inference_scale
specifies a value for the largest substitution
score (i.e. the score for the best match). All other scores are scaled
accordingly, in a linear fashion. If this is set to none
, the
scores will be log-odds using base 2 logarithms.
hsp_threshold
and gapped_threshold
correspond to
the command line --hspthresh
and --gappedthresh
options (also known as K
and L
in BLASTZ parlance).
max_sub_iterations
and max_gap_iterations
specify
limits on the number of inference iterations that will be performed. For
example, if you only want a substitution scoring matrix, you can set
max_gap_iterations=0
.
gap_open_penalty
and gap_extend_penalty
correspond to
the command line --gap=[<open>,]<extend>
option (also
known as O
and E
in BLASTZ parlance). These are used
for the first iteration of gap-scoring inference.
step
corresponds to the
command line --step
option (also known as Z
in BLASTZ
parlance). A large step, e.g. step=100
, could potentially speed up
the inference process. Ideally, this would base the inference on a sample of
only one percent of the whole. However, the sample actually ends up larger
than that and is biased toward HSPs that are either longer or have a lower
substitution rate. This happens because sampling occurs at the seed level,
and such HSPs generally have more seeds. Future versions of LASTZ may
include a means to compensate for this bias.
entropy
corresponds to the command line --entropy
option (also known as P
in BLASTZ parlance). Legal values are
on
or off
. If on, sequence entropy is incorporated
when filtering HSPs.
Note that these parameters apply to the inference process only. If the corresponding command line options are also set, those will apply for the final, "real" alignment stages (overriding the inferred scores if there is a conflict), but will not affect the inference itself.
LAV is the format produced by BLASTZ, and is the default. It reports the alignment blocks grouped by "contig" and strand, and describes them by listing the coordinates of gap-free segments. This format is compact because it does not include the nucleotides, but consequently interpretation usually requires access to the original sequence files, and it is not easy for humans to read. PSU LAV specification
The option --format=lav+text
adds textual output for each
alignment block (in the same format as the --format=text
option),
intermixed with the LAV format. Such files are unlikely to be
recognized by any LAV-reading program.
AXT is a pairwise alignment format popular at UCSC and PSU. UCSC AXT specification
The option --format=axt+
reports additional statistics with each
block, in the form of comments. The exact content of these comment lines may
change in future releases of LASTZ.
MAF is a multiple alignment format developed at UCSC. The MAF files produced by LASTZ have exactly two sequences per block: the first row always comes from the target sequence, and the second from the query. UCSC MAF specification
The option --format=maf+
reports additional statistics with each
block, in the form of comments. The exact content of these comment lines may
change in future releases of LASTZ.
The option --format=maf-
suppresses the MAF header and any
comments. This makes it suitable for concatenating output from multiple runs.
CIGAR is a pairwise alignment format that describes alignment blocks in a run-length format. Ensembl CIGAR specification
For the r dotplot output format, LASTZ writes the alignment blocks in a format that can easily be plotting using the plot command in the R statistical package. Alignments are reduced to a series of ungapped segements, and each is written as three lines as shown below.
<target_name> <query_name> <segment1_target_start1> <segment1_query_start2> <segment1_target_end1> <segment1_query_end2> NA NA <segment2_target_start1> <segment2_query_start2> <segment2_target_end1> <segment2_query_end2> NA NA ...This file can then be plotted in R with these commands:
dots = read.table("your_file",header=T) plot(dots,type="l")
Human-Readable Text (alignment output)
This textual output is intended to be read by people rather than programs. Each alignment block is displayed with gap characters and a row of match/transition characters, and lines are wrapped at a reasonable width to allow printing to paper. The exact format of this output may change in future releases of LASTZ, so programs are better off reading more stable formats like LAV, AXT, or MAF.
General Output (alignment output)
The General format is a tab-delimited table with one line per alignment block and configurable columns. This format is well-suited for use with spreadsheets and the R statistical package.
The format for this option is:
--format=general[:<fields>]where
<fields>
is a comma-separated list of field names in
the desired order, with no spaces. If this list is absent, the following
fields are printed, in this order:
score
,
name1
,
strand1
,
size1
,
zstart1
,
end1
,
name2
,
strand2
,
size2
,
zstart2
,
end2
,
identity
,
coverage
.
The recognized field names are shown below. Positions (start and end fields)
are counted from the 5' end of the aligning strand,
unless otherwise indicated in the table.
Field | Meaning |
score | Alignment's score. |
name1 | Name of the target sequence. |
strand1 | Target sequence strand. |
size1 | Size of the entire target sequence. |
start1 | Alignment starting position in the target, origin-one. |
zstart1 | Alignment starting position in the target, origin-zero. |
end1 | Alignment ending position in the target. |
length1 | Length of alignment in the target. |
text1 | Aligned characters in the target, including gap characters. |
name2 | Name of the query sequence. |
strand2 | Query sequence strand. |
size2 | Size of the entire query sequence. |
start2 | Alignment starting position in the query, origin-one. |
zstart2 | Alignment starting position in the query, origin-zero. |
end2 | Alignment ending position in the query. |
start2+ | Alignment starting position in the query, along positive strand (regardless of query sequence strand), origin-one. |
zstart2+ | Alignment starting position in the query, along positive strand (regardless of query sequence strand), origin-zero. |
end2+ | Alignment ending position in the query, along positive strand (regardless of query sequence strand). |
length2 | Length of alignment in the query. |
text2 | Aligned characters in the query, including gap characters. |
diff |
Differences between what would be written
for text1 and
text2 . Identical bases are
written as . , transitions
as : , transversions as
X , and gaps as
- . |
cigar | CIGAR-type format of alignment text. |
identity |
The fraction of aligned bases
that are matches. This is written as
two fields. The first field is a
fraction, written as
<n>/<d> .
The second field contains the same
value, computed as a percentage. |
coverage |
The fraction of the shorter sequence
covered by the alignment. This is
written as two fields. The first field
is a fraction,
written as <n>/<d> .
The second field contains the same value,
computed as a percentage. |
gaprate |
The rate of gaps (also called indels) in
the alignment. This is written as two
fields. The first field is a fraction,
written as <n>/<d> ,
with the denominator being the number of
non-gapped aligned base pairs (which is
the same in both sequences) and the
numerator being the number of bases
aligned to gaps. The second field
contains the same value, computed as a
percentage. |
diagonal | The diagonal of the start of the alignment
in the dynamic programming matrix,
start1-start2 . |
LASTZ includes support for other output formats which are intended mainly for the convenience of the developers. If you have specific questions, please contact us.
The target sequence (or sequences) is (are) read and kept in RAM throughout the run of the program. Actions such as masking, unmasking or reverse complement are applied. Query sequences are not read until just before the seeding stage. Queries are processed individually sequentially. The seeding through output stages are performed, comparing the query to the target. Then the same stages are performed to compare the reverse complement of the query to the target.
This scoring inference is not normally performed. As described in Inferring Score Sets, LASTZ can repeatedly perform the complete alignment process on the target and query, to derive a suitable scoring set. This will usually be too time-consuming to perform individually for every pair of sequences being. The typical application is to use it once on some sample sequences from the species being compared, save the score file, then use that score file for subsequent alignments.
A preprocessing pass parses the target sequence(s) into overlapping seed words of some constant length. Each word is converted to a number, called the packed seed word according to the seed pattern (discussed in more detail in the Seeds section). These (seed, position) pairs are collected into the seed word position table. Conceptually, the table is a mapping from a packed seed word to a list of the target sequence positions where that seed word occurs.
The table is one the major space requirements of the program. Both time and
memory required for seeding can be decreased by using sparse spacing.
--step=<offset>
sets a step size. Instead of storing
a seed word for every position, positions are stored only for multiples of the
step size. Large step sizes (e.g. --step=100
) incur a loss of
sensitivity, at least at the level of seeds. However, to discover any
gapped alignment block we only need to discover one seed (of many) in that
alignment, so the actual sensitivity loss is small in most cases. Section 6.2
of [Harris 2007] discusses some experimental results
of the effect of the step size on the end result.
The presence of biological repeats in the target and query can also be addressed
during the building of the index table. A large number of repeats can adversely
affect the speed of the program, by increasing the number of false alignments
the program considers in the early stages. LASTZ has three techniques for
dealing with repeats.
First, bases in the target and query sequences can be marked as repeats, by
using lower case. Seed words containing lower case bases are left out of the
seed word position table, so they do not pariticipate in the seeding stage.
Second, if repeat locations are not known, the option
--maxwordcount=<limit>
can be used to remove highly
occurring seed words from the table. Third, dynamic masking (--masking=<count>
)
can be used to mask target positions that occur in many alignments.
Seeds are short near-matches between target and query sequences. "Short" typically means less than 20 bp. Early alignment programs used exact matches (e.g. of length 12) as seeds, but more recent programs have used spaced seeds. This described in more detail in the Seeds section). For the purposes of this section, a seed can be thought to be a 12-mer exact match.
To locate seeds, the query sequence is parsed into seed words the same way the target is. Each packed seed word is used as an index into the seed word position table to find the target positions that have a seed match for this query position. Lower case bases do not partipicate in the seeding stage; query seed words containing lower case bases are not used, so that repeats will not participate in the seeding stage.
For quantum alignment it is not possible to do a direct lookup into the seed
word position table. The position table is for DNA-words (words consisting of
A
, C
,
G
and T
), whereas the query
consists of symbols from an arbitrary alphabet. The quantum sequence is parsed
into seed words as before, but instead of a direct lookup, each word, called a
q-word, is first converted to a quantum seeding ball of those
DNA-words that are most similar to it. Similarity is determined by the scoring
matrix; all words with a combined substitution score above the quantum seeding
threshold (--ball=<score>
) are considered to be
in the ball. Then each word in the ball is looked up in the position table,
with all such words considered to be seed matches for the q-word.
The ball scoring threshold can also be set as a percentage of the maximum word
score possible. This is the weight of the seed mulitplied by the maximum
substitution score. If a space-free seed is used, the weight is the same as the
length. If a spaced seed is used the weight is ≈ the number of
1
positions.
As each seed is found, it is extended without allowing gaps to determine whether it is part of a high-scoring segment pair (HSP). The seed is extended along its DP-matrix diagonal in both directions according to an extension rule, either exact match or x-drop.
In Exact match extension
Exact match extension (--exact=<length>
) simply extends the
seed until a mismatch is found. If the resulting length is long enough, the
extended seed is kept as an HSP for further processing. Exact match extension
is most useful when target and query are expected to be very similar, e.g. when
aligning short reads to a similar reference genome.
In x-drop extension
(--xdrop=<dropoff>
), as we extend in each direction, we keep
a running score of the extended match according to the substitution scoring
matrix. The extension is stopped when the score drops off more than the given
threshold. That is, when the difference between the maximum score and the
current score is more than that threshold. The extension is then trimmed to
the highest scoring point. If the combined score of the seed plus extnesions
meets the ungapped alignment score threshold (K) it is an HSP and is kept for
further processing. Matches at do not meet the score threshold are discarded.
An additional filtering step eliminates hits with low entropy.
Diagonal Hashing LASTZ includes a time and space optimization that deals with multiple seeds in the same HSP. The number of seeds in an HSP is generally proportional to both the length of the HSP and the similarity of the sequences being compared. For long HSPs or very similar sequences, performing extension over and over for many seeds in the same HSP would adversely affect the run time. To prevent this, LASTZ maintains a diagonal extent table that tracks the latest seed extension on each diagonal. As new seeds "arrive", if they overlap an earlier extension, they are simply ignored. While this saves time, a direct implementation could require a lot of space. For two human chromosomes of size 250M bp, the DP matrix has 500 million diagonals, and storing one position for each diagonal would require 2G bytes. To save space, LASTZ hashes diagonals to 16-bit values and tracks extension only by the hash value. While this saves space, it results in a miniscule loss of sensitivity—LASTZ may miss some seeds due to hash collisions. Using --recoverseeds will recover these lost seeds, but will slow the program significantly. Moreover, since most true alignments contain many HSPs, with many seeds in each HSP, the vast majority of lost seeds have no affect on the final results.
The chaining stage finds the highest scoring series of HSPs in which each HSP begins strictly before the start of the next. All HSPs not on this chain are discarded. This is useful when elements are known to be in the same relative order in the query as in the target. Figure 4(a) shows an alignment without chaining. Figure 4(b) shows the same alignment with chaining.
![]() lastz target query --nochain
Figure 4(a) |
![]() lastz target query --chain
Figure 4(b) |
Figure 5(a) demonstrates the relationship of seeds, HSPs and anchors. Heavy lines are seeds. Seeds are extended to create HSPs (thin lines). Seeds with no HSP shown (gray lines) had low scoring extensions. Blue dots are anchors.
![]() Figure 5(a) |
![]() Figure 5(b) |
The anchors are then processed in order of the score of their HSP (highest score first). One-sided extension is performed in both directions from the anchor point, the two resulting alignments are joined at the anchor, and if the score meets the gapped alignment score threshold it becomes an alignment in the output file. One-sided extension is computed using a typical dynamic-programming recurrence for affine gap alignment (e.g. [Myers 1989] or [Gusfield 1997]), beginning at the anchor and ending at the highest scoring point. The portion of the DP matrix considered is reduced by disallowing low-scoring segments (see [Zhang 1998]); wherever the score drops further below the maximum than the y-drop threshold, the DP matrix is truncated and no further cells are computed along that row.
Figure 5(b) shows the relationship of anchors and gapped extension. Anchors (blue dots) are single points. These are extended to form gapped alignments. The Anchor shown without an extension had low scoring extensions which were discarded.
Figure 6 shows the operation of y-drop extension in more detail. Extension is performed in two directions from an anchor (in this example, to the upper right and lower left). The gray region is the portion of the DP matrix explored by entension. The boundary of the region is the points where the score has dropped from the maximum by more than the y-drop threshold.
![]() Figure 6 |
Whatever alignment blocks have made it through the above guantlet are then
subjected to identity and coverage filtering (as specified by
--identity=<min>[..<max>]
or
--coverage=<min>[..<max>]
). Blocks that do not meet
the specified range for each feature are discarded.
Identity is the fraction of aligned bases (excluding gaps) that are matches. The numerator is the number of matches in the alignment. The denominator is the number of matches plus the number of mismatches. Gaps play no part in the computation of identity.
Coverage is the fraction of bases in the shorter sequence that are aligned (excluding gaps). The numerator is the number of bases in the alignment (matches plus mismatches). The denominator is the length of the sequence. Gaps play no part in the computation of coverage.
Once the above stages have been performed, it is not uncommon to have regions
leftover in which no alignment has been found. In the interpolation stage
(activated by --inner=<score>
) we repeat the seeding through
gapped extension stages, in these leftover regions, at a higher sensitivity.
We use a seed of lower weight (7 bp exact match) and a lower scoring threshold.
Using such high sensitivity from the outset would be computationally
prohibitive (due to the excessive number of false, low-scoring matches), but is
feasible on the smaller, leftover regions.
Figure 7 shows the operation in more details. The alignment blocks resulting from gapped extension are shown in 5(a) as squiggly lines. After interpolation, in 5(b), additional alignment blocks have been discovered, shown as dashed squiggly lines.
![]() before interpolation Figure 7(a) |
![]() after interpolation Figure 7(b) |
Whatever alignment blocks were found are written to
stdout
in whatever format has been chosen. There is
no particular order to the alignment blocks (e.g. they are not sorted
by score or position).
The biological research community has established several competing standards
describing intervals on a strand of DNA. Different programs often use different
standards. Since LASTZ
supports several input and output formats,
it is inevitable that it uses more than one way of describing an interval. We
describe the diffent ways here.
For the following examples, we will assume we have a 50 nucleotide strand of DNA consisting of
5' >>> CGACCTTACGATTACCTACTTAACACGTAAACTGAGGGATCAAAAGGAAA >>> 3'Note that since this is DNA it is natural to consider that it has 5' and 3' ends. We've highlighted the subsequence
ATTACCTA
so we can
discuss how to describe the interval it occupies. There are two commonly used
ways to do this. Both count from 5' to 3' (left to right). One way,
origin-one, starts counting from one. The other way, origin-zero,
starts counting from zero. So in origin-one, ATTACCTA
begins at position 11, while in origin-zero it begins at position 10.
To describe the ending position, there are also two commonly-used ways. One
way is closed, in which the position of the last nucleotide is given.
The other is half-open, in which the position following the last
nucleotide is given. In origin-one, closed, ATTACCTA
is
said to be at the interval (11,18). In origin-zero, half-open,
ATTACCTA
is said to be at the interval (10,18). Notice that only
the first number changes between these two paradigms; the second number is
the same.
We also inherently think of this as being double stranded DNA, which would look like this:
complement strand: 3' <<< GCTGGAATGCTAATGGATGAATTGTGCATTTGACTCCCTAGTTTTCCTTT <<< 5' forward strand: 5' >>> CGACCTTACGATTACCTACTTAACACGTAAACTGAGGGATCAAAAGGAAA >>> 3'In some cases it makes sense to refer to the interval along the complement strand. For example, if the above sequence was a query and the target contained
TAGGTAAT
, how should the query position of an alignment of those
two be described? One way would be to still refer to the interval along the
forward strand (which we also call the plus or positive stand),
and just indicate that it was on the reverse-complement. We call this
"counting along the forward strand." Another way is to count from the other
end, from the 5' end of the complement strand (which we also call the
reverse, minus or negative strand). We call this
"counting along the reverse strand," and for clarity we might add "from its 5'
end." In the example, if we were using origin-one, closed counting, we
would say that TAGGTAAT
occurs at (33,40) along the complement
strand.
Often the names in the input sequence files are inconvenient for downstream processing, or create problems with certain output formats. This process is further complicated by the fact that some input formats (most notably nib) so not contain sequence names, so in those cases a name must be derived from the filename. LASTZ provides several choices for how it should derive names for the input sequences.
Internally, LASTZ handles this in two stages. First, it creates a full header for the sequence. If the input format provides a name or header, this becomes the long header. Otherwise, the full header is constructed from the file name.
In the second stage, LASTZ shortens the full header to a name.
By default, LASTZ uses the first word (of anything other than whitespace,
vertical bar, or colon)
as the sequence name. Since this might possibly be a file name and contain a
long path prefix, any path prefix is removed, and commonly used file
extension suffixes are also removed
(.nib
, .2bit
,
.fa
and .fasta
).
Thus the name
~someuser/human/hg18/chr1.nib
is shortened to
chr1
.
The action
[nameparse=alphanum]
changes how the first word is
determined. It is terminated by any character other than alphabetic, numeric or
underscore. Path prefixes are still removed, but file extension suffixes are
not.
This default shortening is often adequate. For example, consider the following
FASTA file. By default, the names would be 000007_3133_3729
and
000015_3231_1315
.
>000007_3133_3729 length=142 uaccno=FX9DQEU13H5YZN ACCCGAAAGAGAAACAGCTTCCCCCCCTGTCCCGAGGGATATCAAGTAGTTTGTTGGCTA GGCTGATATTGGGGCCTTCCGCTAGAGTCGGCGCCCGCGCCTACGAGTCCCCCCCACCCC CCACCCCCACAGCGGGTTATCC >000015_3231_1315 length=190 uaccno=FX9DQEU13HUTXE TTGTTGAGTCGGATGAGAATAGCAAGTGCAGTCAACGGCAATGTGCTGGGTTAGTACAAC ...In that example, however, the user may find it more convenient to use the accession numbers. To accomplish this, she can use the file action
[nameparse=tag:uaccno=]
. LASTZ will find the tag
string uaccno=
in each header, and read the name
from the characters that follow it, up to the first character than is
not alphabetic, numeric or an underscore. In this case the sequence names would
be FX9DQEU13H5YZN
and FX9DQEU13HUTXE
.
Now consider this FASTA file:
>gi|197102135|ref|NM_001133512.1| Pongo abelii ... GCGCGCGTCTCCGTCAGTGTACCTTCTAGTCCCGCCATGGCCGCTCTCACCCGGGACCCC CAGTTCCAGAAGCTGCAGCAATGGTACCGCGAGCACGGCTCCGAGCTGAACCTGCGCCGC ... >gi|169213872|ref|XM_001716177.1| PREDICTED: Homo sapiens ... ATGTCTGAGGAGGTAGGATTTGATGCAGGAGGGAGGATCTGGTGCACTTATAAGGATCTG GGTCTGTCAGTGTCAGAGAAGGTAGGATCTGGCCCTGGTATGAGGATCTGGATCTGTCAG ... >gi|34784771|gb|BC006342.2| Homo sapiens ... GGGTGGGAGGACGCTACTCGCTGTGGTCGGCCATCGGACTCTCCATTGCCCTGCACGTGG GTTTTGACAACTTCGAGCAGCTGCTCTCGGGGGCTCACTGGATGGACCAGCACTTCCGCA ...In this case the default action fails (all sequences would be named
gi
). The action
[nameparse=tag:gi|]
gives us the names
197102135
, 169213872
and 34784771
.
Sometimes it is more convenient just to assign a specific name. This can be
done with the
[nickname=<name>]
action.
For example, using the target and query file specifiers
~someuser/human/hg18/chr1.nib[nickname=human]
~someuser/human/ponAbe2/chr1.nib[nickname=orang]
,
the alignments will show the sequences as human
and
orang
rather than calling them both
chr1
.
The user may want to do away with name mangling entirely, in which case she can
use the action [nameparse=full]
. This uses the full
long header as the sequence name. Note that if that contains spaces, the
resulting alignment files will probably fail downstream parsing.
These sequence naming alternatives are mutually exclusive; only one can be used at a time.
If the [subset=<names_file>]
action is used, the names in
the n<names_file>
have to match mangled names.
For fasta files, more complicated name mangling can be performed using other
unix command-line tools. In the example above, we could pipe the input through
sed
a couple times to shorten each name to the ncbi accession numbers
NM_001133512.1
, XM_001716177.1
and BC006342.2
.
cat query_file.fa \ | sed "s/>.*ref\|/>/g" \ | sed "s/>.*gb\|/>/g" \ | lastz target /dev/stdin
Seeds are short near-matches between the target and query sequences, where "short" typically means less than 20 bp. Early alignment programs used exact matches (e.g. of length 12) as seeds, but spaced seeds can improve sensitivity when the sequences are evolutionarily diverged.
A spaced seed pattern is a list of positions, in a short word, where
a seed may contain mismatches. For example, consider the seed pattern
1100101111
. A 1
indicates a match is
required in this position, and a 0
indicates a mismatch is allowed
(effectively it is a "don't care" position). As the example below shows, using
this seed patttern, the seed word GTAGCTTCAC
hits twice in the
sequence ACGTGACATCACACATGGCGACGTCGCTTCACTGG
.
target: ACGTGACATCACACATGGCGACGTCGCTTCACTGG (mis)match: ..::.X.... ..X....... query: GTAGCTTCAC GTAGCTTCAC pattern: 1100101111 1100101111
Spaced seeds have been shown to be more sensitive than exact match seeds, with little change in specificty. This is most advantageous when the sequences have lower similarity, such as human vs. mouse or chicken. Which seed pattern is best depends on the sequences being compared. See [Buhler 2003] for a discussion of spaced seeds and how to design them.
LASTZ gives the "user" many seeding choices. The intent is that these will be selected by some program (hence the quote marks around "user"), but they are available from the command line for any user.
N-mer match: A space-free seed can be specified by the length of the N-mer match required.
--seed=match<length>
General seed patterns:
Any spaced seed pattern can be specified. The pattern is a string of
1
s, 0
s, and T
s, where a 1
indicates that a match is required in that position, a 0
indicates
that a mismatch is allowed, and a T
indicates that a mismatch is
allowed only if it is a transition (A↔G or C↔T).
--seed=<pattern>The default seed is
--seed=1110100110010101111
, which is the same
12-of-19 seed used as the default in BLASTZ.
Half-weight seed patterns:
If a seed pattern consists of only 0
s and T
s, it is
implemented internally as a half-weight seed, which uses much less memory
(a half-weight seed uses the same amount of memory as a normal seed pattern
half as long). These patterns can be used in conjunction with filtering on the
number of matches and transversions in a seed. For example,
--seed=TTT0T00TT00T0T0TTTT --filter=2:15specifies the same pattern as the default seed, but allows the twelve
T
positions to be matches or transitions, requires at least
fifteen matches total (among the 19 positions), and allows at most two
transversions.
Additionally, --seed=half<length>
can be used as shorthand
to specify a space-free half-weight seed (i.e., all T
s).
Single, double, or no transitions:
By default, one match position (a 1
in a spaced seed, or any
position in an N-mer match) is allowed to be a transition instead of a true
match. --notransition
disables this. Alternatively,
--transition=2
allows any two match positions to be
transitions.
Twin hit seeds: The sensitivity of the seed can be decreased by ignoring seeds that don't have a second hit nearby, i.e. by requiring two seeds on the same diagonal.
--twins=[<minsep>:]<maxsep>The distance between the hits (the number of bases between the end of the first hit and the beginning of the second) must be at least
<minsep>
but not more than <maxsep>
.
If <minsep>
is omitted, zero is used (which means the
twin seeds may be adjacent but not overlap). Negative values can
be used; for example --twins=-5:10
means the twins can overlap
by as much as 5 bases or can have as much as 10 bases between them.
Scoring inference is described in greater detail in [Harris 2007]. In this section we give a brief overview of the process.
Log-odds scores can be inferred from the given sequences, and the resulting scoring set can be saved to a file or immediately used to align the sequences.
Inference is achieved by computing the probability of each of the 18 different alignment events (gap open, gap extend, and 16 substitutions). These probabilities are estimated from alignments of the sequences. Of course, at first we don't have alignments, so we start by using a generic scoring set to create alignments, infer scores from those, then realign, and so on, until the scores stabilize or "converge". Ungapped alignments are performed until the substitution scores converge, then gapped alignments are performed (holding the substitution scores constant) until the gap scores converge.
To infer scores, use the --infer
or --inferonly
options. (The latter will stop after inferring scores, without performing
the final alignment.) Settings for the inference process can be specified in
a control file included with these options.
The --infscores[=<output_file>]
option causes the inferred
scoring set to be written out to a separate file. If no
<output_file>
is specified, it is written to the header
of the alignment output file, as a comment. As a last resort, if no alignment
is performed the scoring set is written to stdout.
The scores are written in the same format
used to input scoring sets.
Usually it is undesirable to use all alignment blocks for inference. Blocks with a high substitution rate (low identity) are likely to be false positives. On the other hand, blocks with few substitutions (high identity) will be found regardless of what scores are used. Thus it is desirable to base the inference only on statistics from a mid-range of identity. By default the middle 50% is used (that is, the 25th through 75th percentile of identity found in the alignment), but this can be changed in the control file.
Since the inferred scores are an iterated process, greater accuracy can be
achieved by using the floating-point scoring version of LASTZ
(lastz_D
). Moreover, the technique used to infer gap
scores has not been shown to create good scores. Thus the author recommends
that users only use scoring inference for substitution scores. To enforce
these recommendations, the scoring inference code is blocked from operation in
the integer scoring version of LASTZ (lastz
) and gap
score inference is blocked in both versions. Special build options are
available to defeat the blocks; contact the author if you are interested.
The handling of characters other than A
,
C
, G
and
T
in sequences that are supposed to represent
DNA is problematic. Many database sequences contain
N
s to represent bases for which the actual nucleotide
is not known (at least, not known with any level of confidence). Unfortunately,
there is also a tradition of using strings of X
s
or N
s to splice together multiple sequences to gain
efficiency when dealing with programs that were limited to operating to a
single sequence.
Although this splicing was useful in BLASTZ, it is no longer needed for LASTZ.
Since LASTZ can handle multiple target sequences (with the
multiple
action),
we would prefer users not resort to splicing.
However, users that are replacing BLASTZ with LASTZ in a pipeline may be
using splicing. So LASTZ's default interpretation of non-ACGT characters is
the same as BLASTZ's—
X
s are excluded from the alignment seeding stage, and are so
severely penalized by alignment scoring that they will not normally appear in
any alignment. N
s are also excluded from seeding, and
are penalized about the same as a transversion mismatch.
Specifically, any substitution with X
is
scored as -1000
, and any substitution with anything
else (other than A
, C
,
G
or T
) is scored as
-100
.
Note that the user has to put "enough"
X
s or N
s between sequences
so that no alignment block will cross the splice. This is problematic, since
gap-scoring is only dependent on the length of the gap and not on the characters
in the gap. So if a gap the same length as the splice length would not be
penalized more than the y-drop setting, the alignment may hop the splice.
This default treatment of non-ACGT characters is inappropriate when the
sequences contain N
s to represent ambiguous bases.
To handle this case, LASTZ provides the --ambiguousn
option.
Substitutions with N
are scored as zero.
In either case, non-ACGT characters are ignored during the seeding stage. Only
seed words that consist entirely of A
, C
, G
or T
are involved in seeding, even if the non-ACGT characters occur in don't-care
positions in the seed pattern.
The score values described above can be changed if a scoring file is specified.
The -1000
score is called
bad_score
and The -100
score
is called fill_score
. Further, which character is
considered "bad" (by default this is X
) can also be
specified in the score file, and can actually be different between the target
and query. The latter capability is necessary for dealing with quantum
sequences. Throughout this document, when we refer to the character
X
, we really mean the character specified to get the
bad_score
and which defaults to
X
).
Target capsule files are provided to improve run-time memory utilization when multiple cores on the same computer are running LASTZ with the same target sequence. They allow the lion's share of the large internal data structures to be shared between the processes. This allows more copies of LASTZ to be run simultaneously without overrunning physical memory. This can improve the throughput, for example, for mapping a large set of reads to a single (large) reference sequence.
To create a capsule file, use a command like this:
lastz <target> --writecapsule=<file> [<seeding_options>]Applicable seeding options are
--seed
,
--step
,
--maxwordcount
and --word
.
To use the capsule file, use a command like this:
lastz --targetcapsule=<file> <query> [<other_options>]No additional effort on the part of the user is required to handle sharing of the capsule data between separate runs. Nearly all options are allowed. However, the seeding options
--seed
,
--step
,
--maxwordcount
and --word
are not allowed, since these (or their byproducts) are stored in the capsule
file. Further, --masking
is not allowed, because it
would require modifying both the target sequence and the target seed word
position table, which are contained in the capsule.
Internally LASTZ asks the operating system to directly map the capsule file
into the running program's memory space. Multiple running instances can map
the same file; each instance will have its own virtual addresses for the
capsule data, but the physical memory is shared. There is no requirement that
more than one instance is actually using the capsule simultaneously. Running
a single copy of lastz with --targetcapsule
will work fine, and
in fact there may be a small speed improvement as compared to running the same
alignment without a capsule.
The downside of this technique is that the capsule files are very large and are also machine-dependent. For example, for human chromosome 1, the file is about 1.4 gigabytes. Note that attempts to run a capsule built on a mismatched computer are detected and rejected.
"Designing seeds for similarity search in genomic DNA." Buhler J, Keich U, Sun Y. Proc. 7th Annual International Conference on Research in Computational Molecular Biology (RECOMB '03), pp. 67-75
"Scoring Pairwise Genomic Sequence Alignments." Chiaromonte F, Yap VB, Miller W. Pacific Symposium on Biocomputing (2002), vol. 7, pp. 115-126
"Algorithms on Strings, Trees and Sequences". Gusfield D. 1997. pp. 244.
"Improved Pairwise Alignment of Genomic DNA". Harris RS. PhD Thesis, Pennsylvania State University, 2007.
"Approximate Matching of Regular Expressions." Myers EW and Miller W. Bull. Math. Biol. 51 (1989), pp. 5-37.
"Alignments without low-scoring regions." Zhang Z, Berman P, Miller W. Journal of Computational Biology 5:197-210 (1998).
Release | Date | Changes |
1.0.1 | Jul/28/2008 | Initial release. |
1.0.5 | Aug/2/2008 |
Fixed a bug that in some cases caused a bus error when interpolated alignments
(e.g. --inner= …) were used with multiple queries.
|
Added xmask=<file> and nmask=<file>
file masking actions.
| ||
1.0.21 | Sep/9/2008 |
Fixed a bug involving the default value for --gappedthresh (a.k.a.
L ) when --exact is used. The bug caused the gapped
threshold to be inordinately low, allowing undesirable alignment blocks to
make it to the output file.
|
Fixed a bug whereby Xs and Ns were treated as desirable substitutions when
unit scores (e.g. --match= …) were used.
| ||
Re-implemented --twins= …. The previous implementation
improperly truncated the left-extension of HSPs. The new implementation is
slower and uses more memory.
| ||
Added --census=<file> . The census counts the number of
times each base in the target sequence is part of an alignment block.
Previously, --census produced a census only if the output format
was LAV (the census is a special stanza in a LAV file). Otherwise the option
was ignored. Now, if a file is specified a census is written to that file.
The format of lines in the census is
<name> <position> <count> .
The position is one-based, and the count is limited to 255.
In situtations where 255 is too limiting, | ||
Added --format=<differences> , to support Galaxy. All
differences (gaps and runs of mismatches) are reported, one per line.
| ||
Added --anchors=<file> , giving the user the ability to
bypass the seeding and anchoring stage.
| ||
Changed default gap penalties for unit scores (e.g.
--match= …) to be relative to mismatch score (instead of
match score).
| ||
Made the
<start>#<length>
file subrange action better at checking errors, and also allowed
<length> to use units such as M and K.
| ||
Sped up program exit by no longer freeing dynamically allocated memory. | ||
1.1.0 | Dec/5/2008 | Improved x-drop extension to better handle suboptimal HSPs. Left-extension now starts at the right end of the seed (rather than the left end). This reduces the chance that the extended region (the combination of left and right extensions) will score less than some subinterval. |
Changed coverage filtering so that it is relative to whichever sequence is shortest. Previously it was always relative to the query. | ||
Changed defaults for xdrop and ydrop when --match scoring is used.
| ||
Interpolation now uses the xdrop value from the main alignments. Previously it used the ydrop value to match BLASTZ, but we have decided that was a bug in BLASTZ. | ||
Added general output format.
| ||
Added --maxwordcount .
| ||
Added --notrivial .
| ||
Corrected problem with --subset action, which wasn't using mangled
sequence names.
| ||
Fixed problem in writing LAV m- and x-stanzas. | ||
Blocked the use of scoring inference in the integer build, and blocked gap scoring inference in all builds. | ||
Changed much of the syntax for options and actions. The newer syntax is clearer and more consistent that the older. The older is still supported by the program so that existing scripts will still work, but it is not documented. | ||
Changed reporting of duplicated options from
can't understand "<option>" to
duplicated or conflicting option "<option>" .
| ||
Added --format=rdotplot option.
| ||
1.1.25 | Feb/5/2009 | Fix a bug that caused some gapped extensions to be terminated prematurely. In some cases this also allowed a nearby low-scoring alignment to "piggyback" onto the remainder of a terminated alignment, gaining enough in score to pass the score threshold. |
Added support for target capsule files. | ||
Added support for --format=cigar. | ||
Added the <center>^<length> sequence interval specifier.
| ||
Corrected the behavior of --exact regarding lower-case and
non-ACGT characters. --exact now considers, e.g., a lowercase A
to be a match for an uppercase A. Further, any non-ACGT characters now stop
the match.
| ||
Improved detection and reporting of memory allocation overflow. Two problems were fixed as part of this-- (1) allocation of single blocks larger than 2G was being rejected even on platforms that could support larger blocks, and (2) an allocation overflow problem which could cause a segfault for target sequences longer than about 1G (these require allocation of a block greate than 4G). | ||
Changed the behaior when encountering an empty sequence (in a file with many sequences). Previously this reported as an error, and the program halted. Now it is reported as a warning (to stderr), and the program continues. | ||
Added --output option. In some batch systems, it is difficult to redirect stdout into a file. So this option allows the user to do it directly. |