TABLE OF CONTENTS
This document describes installation and usage of the LASTZ sequence alignment program. LASTZ is a drop-in replacement for BLASTZ, and is backward compatible with BLASTZ's command-line options. That is, it supports all of BLASTZ's options but also has additional ones, and may produce slightly different alignment results.
LASTZ — | A tool for (1) aligning two DNA sequences, and (2) inferring appropriate scoring parameters automatically. | |
Platform: | This package was developed on a Macintosh OS X system, but should work on other Linux or Unix platforms with little change (if any). LASTZ is written in C and compiled with gcc. Some ancillary tools are written in Python, but only use modules available in typical python installations. | |
Author: | Bob Harris <rsharris at bx dot psu dot edu> | |
Date: | July 2008 |
This is a preliminary document, covering installation, common options, and support for Yasra. A more detailed document describing additional features is forthcoming.
If you have received the distribution as a packed archive, unpack it
by whatever means are appropriate for your computer. The result should be a
directory <somepath>/lastz‑distrib‑X.XX.XX
that
contains a src
subdirectory (and some others). You may find it
convenient to remove the revision number (‑X.XX.XX
) from the
directory name.
Before building or installing any of the programs, you will need to tell the
installer where to put the executable, either by setting the shell variable
$LASTZ_INSTALL
, or by editing the
make‑include.mak
file to set the definition of
installDir
. Also, be sure to add the directory you choose to your
$PATH
.
Then to build the LASTZ executable, enter the following commands from bash
(or a similar command-line shell). This will build two executables
(lastz
and lastz_D
) and copy them into your
installDir
.
cd <somepath>/lastz-distrib-X.XX.XX/src make make installThe two executables are basically the same program; the only difference is that
lastz
uses integer scores, while lastz_D
uses
floating-point scores.
A simple self test is included so you can test whether the build succeeded. To run it, enter the following command:
make testIf the test is successful, you will see no output from this command. Otherwise, you will see the differences between the expected output and the output of your build, plus a line that looks like this:
make: *** [test] Error 1
Aligning a human chromosome to a chicken chromosome
To run a quick low-sensitivity alignment of these sequences:
lastz hg18.chr4.fa galGal3.chr4.fa C=3 T=2 Z=10 --maf > hg18_4.galGal3_4.maf
Comparing shotgun reads to a human chromosome
lastz hg18.chr22.fa reads.fa --yasra98 --maf > hg18_22.reads.maf
If you are familiar with BLASTZ, you can run LASTZ the same way you ran BLASTZ, with the same options and input files. In addition to this BLASTZ compatibility, LASTZ provides other options.
The general format of the LASTZ command line is
lastz <target_file> <query_file> [<options>]The angle brackets
<>
indicate meta-syntactic variables
that should be replaced with your values, while the square ones []
indicate elements that are optional. Elements can appear in any order, the
only constraint being that the query_file
must appear after the
target_file
.
The target_file
and query_file
are usually just the
names of files containing the two sequences to be aligned, either in FASTA,
Nib, or 2Bit format. However they can also specify subsequences from these
files; running
lastz --help=filesgives a description of the available filename modifiers.
For options, the general format is ‑‑<keyword>
or ‑‑<keyword>=<value>
, but for BLASTZ
compatibility some options also have an alternative syntax
<letter>=<number>
.
Running the command lastz
without any arguments prints a help
message with the most commonly used options, while running
lastz --helplists all of the options.
Option | BLASTZ equivalent | Meaning | |||||||||||||||||||||||||||
‑‑strand=both |
B=2 | Search both strands. | |||||||||||||||||||||||||||
‑‑strand=plus |
B=0 | Search forward strand only (the one in the query file). | |||||||||||||||||||||||||||
‑‑strand=minus |
B=‑1 | Search the reverse complement strand only (opposite of the query file). | |||||||||||||||||||||||||||
(By default both strands are searched.) | |||||||||||||||||||||||||||||
‑‑seed=12of19 |
T=1 or T=2 | Seed hits require a 19 bp word with matches in 12 specific positions. | |||||||||||||||||||||||||||
‑‑seed=14of22 |
T=3 or T=4 | Seed hits require a 22 bp word with matches in 14 specific positions. | |||||||||||||||||||||||||||
‑‑seed=match<n> |
W=<n> | Seed hits require an n bp word with matches in all positions. | |||||||||||||||||||||||||||
‑‑transition |
T=1 or T=3 | Allow one transition in each seed hit. | |||||||||||||||||||||||||||
‑‑transition=2 |
Allow two transitions in a seed hit. | ||||||||||||||||||||||||||||
‑‑notransition |
T=2 or T=4 | Don't allow any transitions in seed hits. | |||||||||||||||||||||||||||
(By default the 12-of-19 seed is used, and one transition is allowed.) | |||||||||||||||||||||||||||||
‑‑step=<n> |
Z=<n> | Number of bases between the start of each target word considered for a seed match. | |||||||||||||||||||||||||||
(By default a step of 1 is used.) | |||||||||||||||||||||||||||||
‑‑gfextend |
Perform gap-free extension of seed hits to HSPs (high scoring segment pairs). | ||||||||||||||||||||||||||||
‑‑nogfextend |
Don't extend seed hits to HSPs. | ||||||||||||||||||||||||||||
‑‑chain |
C=1 or C=2 | Perform chaining of HSPs. | |||||||||||||||||||||||||||
‑‑nochain |
C=0 or C=3 | Don't perform chaining of HSPs. | |||||||||||||||||||||||||||
‑‑gapped |
C=0 or C=2 | Perform gapped alignment (instead of gap-free). | |||||||||||||||||||||||||||
‑‑nogapped |
C=1 or C=3 | Perform gap-free alignment. | |||||||||||||||||||||||||||
(By default seed hits are entded to HSPs and gapped alignment is performed, without chaining.) | |||||||||||||||||||||||||||||
‑‑scores=<file> |
Q=<file> | Read substitution scores from a file. | |||||||||||||||||||||||||||
‑‑match=<reward>,<penalty> |
Set the score values for a match (+<reward> )
and mismatch (‑<penalty> ). |
||||||||||||||||||||||||||||
(By default HOXD70 scores are used.)
|
|||||||||||||||||||||||||||||
‑‑gap=<[open,]extend> |
O=<score> E=<score> | Set the score penalties for opening and extending a gap. | |||||||||||||||||||||||||||
(Default is 400 for gap open, 30 for gap extend.) | |||||||||||||||||||||||||||||
‑‑xdrop=<score> |
X=<score> | Set the x-drop threshold. | |||||||||||||||||||||||||||
(Default is 10 times the A-vs.-A substitution score.) | |||||||||||||||||||||||||||||
‑‑ydrop=<score> |
Y=<score> | Set the y-drop threshold. | |||||||||||||||||||||||||||
(Default is the score of a 300 bp gap.) | |||||||||||||||||||||||||||||
‑‑hspthresh=<score> |
K=<score> | Set the threshold for HSPs; ungapped extensions scoring lower are discarded. | |||||||||||||||||||||||||||
(Default is 3000.) | |||||||||||||||||||||||||||||
‑‑gappedthresh=<score> |
L=<score> | Set the threshold for gapped alignments; gapped extensions scoring lower are discarded. | |||||||||||||||||||||||||||
(Default is to use the same value as ‑‑hspthresh .) |
|||||||||||||||||||||||||||||
‑‑inner=<score> |
H=<score> | Set the threshold for HSPs during interpolation. | |||||||||||||||||||||||||||
(Default is to not perform interpolation.) | |||||||||||||||||||||||||||||
‑‑entropy |
P=1 | Involve entropy when filtering HSPs. | |||||||||||||||||||||||||||
‑‑noentropy |
P=0 | Don't involve entropy when filtering HSPs. | |||||||||||||||||||||||||||
(Default is to involve entropy.) | |||||||||||||||||||||||||||||
‑‑traceback=<bytes> |
m=<bytes> | Space to allocate (in RAM) for trace-back information. | |||||||||||||||||||||||||||
(Default is 80.0M ) |
|||||||||||||||||||||||||||||
‑‑identity=<min>[..<max>] |
Filter alignments by percent identity, 0 ≤ min ≤
max ≤ 100; alignment blocks outside the range are
discarded. |
||||||||||||||||||||||||||||
(Default is to not perform identity filtering.) | |||||||||||||||||||||||||||||
‑‑format=<type> |
Specify the output format, one of: lav ,
axt , maf , or
text . |
||||||||||||||||||||||||||||
(By default output is in LAV format.) | |||||||||||||||||||||||||||||
‑‑help |
List all options. | ||||||||||||||||||||||||||||
‑‑help=files |
List information about file specifiers. | ||||||||||||||||||||||||||||
‑‑help=shortcuts |
List BLASTZ-compatible shortcuts. |
There are several options to support the
Yasra mapping assembler. These
provide canned sets of option settings that work well for aligning an assembled
reference sequence (as the target) with a set of shotgun reads (as the query).
They are selected based on the expected level of identity between the sequences.
For example, ‑‑yasra90
should be used when we expect
90% identity. The ‑‑yasraXXshort
options are
appropriate when the reads are very short (less than 50 bp).
Option | Equivalent |
‑‑yasra98 |
T=2 Z=20 ‑‑match=1,6
O=8 E=1 Y=20 K=22 L=30 ‑‑identity=98 |
‑‑yasra95 |
T=2 Z=20 ‑‑match=1,5
O=8 E=1 Y=20 K=22 L=30 ‑‑identity=95 |
‑‑yasra90 |
T=2 Z=20 ‑‑match=1,5
O=6 E=1 Y=20 K=22 L=30 ‑‑identity=90 |
‑‑yasra85 |
T=2 ‑‑match=1,2
O=4 E=1 Y=20 K=22 L=30 ‑‑identity=85 |
‑‑yasra75 |
T=2 ‑‑match=1,1
O=3 E=1 Y=20 K=22 L=30 ‑‑identity=75 |
‑‑yasra95short |
T=2 ‑‑match=1,7
O=6 E=1 Y=14 K=10 L=14 ‑‑identity=95 |
‑‑yasra85short |
T=2 ‑‑match=1,3
O=4 E=1 Y=14 K=11 L=14 ‑‑identity=85 |
Release | Date | Changes |
1.0.1 | Jul/28/2008 | Initial release. |
1.0.5 | Aug/2/2008 |
Fixed a bug that in some cases caused a bus error when interpolated alignments
(e.g. ‑‑inner=... )
were used with multiple queries.
|
Added
xmask=<file>
and
nmask=<file>
file masking actions.
| ||
1.0.21 | Sep/9/2008 |
Fixed a bug involving the default value for
‑‑gappedthresh
(a.k.a. L )
when
‑‑exact
is used. The bug caused the gapped threshold to be inordinately low, allowing
undesirable alignment blocks to make it to the output file.
|
Fixed a bug whereby Xs and Ns were treated as desirable substitutions when unit
scores
(e.g. ‑‑match=... )
were used.
| ||
Re-implemented
‑‑twins=... .
The previous implementation improperly truncated the left-extension of HSPs.
The new implementation is slower and uses more memory.
| ||
Added ‑‑census=<file> .
The census counts the number of times each base in the target sequence is part
of an alignment block. Previously,
‑‑census
produced a census only if the output format was lav (the census is a special
stanza in a lav file). Otherwise the option was
ignored. Now, if a file is specified a census is written to that file. The
format of lines in the census is
<name> <position> <count> .
The position one-based, and the count is limited to 255.
In situtations where a limit of 255 is too limiting,
| ||
Added ‑‑format=<differences> , to
support Galaxy. All differences (gaps and runs of mismatches) are reported,
one per line.
| ||
Added ‑‑anchors=<file> , giving the
user the ability to bypass the seeding and anchoring stage.
| ||
Changed default gap penalties for unit scores
(e.g. ‑‑match=... )
to be relative to mismatch score (instead of match score).
| ||
Made the
<start>#<length>
file subrange action better at checking errors, and also allowed
<length> to use units such as M and K.
| ||
Sped up program exit by no longer freeing up dynamically allocate memory. | ||
1.0.30 | Sep/26/2008 | Improved the heuristic used for locating high-scoring gap-free matches. In certain cases the previous heuristic included a poor-scoring end in the match, lowering the match's overall score, and causing some matches to be unjustly discarded. Some brief tests suggest that in chromosome-to-chromosome alignments the problem rate was very low (less than ≈ 1 in 25,000 matches). |
Changed coverage filtering
(‑‑coverage= )
so that it is relative to whichever sequence is shortest. Previously it was
always relative to the second sequence file.
| ||
Changed defaults for
‑‑xdrop
and
‑‑ydrop
when unit scoring is used (e.g. for
‑‑match
or
‑‑unitscores ). The new defaults, for this
case only, are xdrop=10*sqrt(mismatch_penalty) | ||
Fixed a parsing bug involving file subrange actions when they were not at the end of the action list. | ||
Corrected a bus error caused in some cases when
‑‑tableonly was used.
|