LASTZ Release 1.0.21 built Sep/9/2008

TABLE OF CONTENTS

Introduction

This document describes installation and usage of the LASTZ sequence alignment program. LASTZ is a drop-in replacement for BLASTZ, and is backward compatible with BLASTZ's command-line options. That is, it supports all of BLASTZ's options but also has additional ones, and may produce slightly different alignment results.

LASTZ  —     A tool for (1) aligning two DNA sequences, and (2) inferring appropriate scoring parameters automatically. 
Platform: This package was developed on a Macintosh OS X system, but should work on other Linux or Unix platforms with little change (if any). LASTZ is written in C and compiled with gcc. Some ancillary tools are written in Python, but only use modules available in typical python installations.
Author:Bob Harris  <rsharris at bx dot psu dot edu>
Date:July 2008

This is a preliminary document, covering installation, common options, and support for Yasra. A more detailed document describing additional features is forthcoming.

Installation

If you have received the distribution as a packed archive, unpack it by whatever means are appropriate for your computer. The result should be a directory <somepath>/lastz‑distrib‑X.XX.XX that contains a src subdirectory (and some others). You may find it convenient to remove the revision number (‑X.XX.XX) from the directory name.

Before building or installing any of the programs, you will need to tell the installer where to put the executable, either by setting the shell variable $LASTZ_INSTALL, or by editing the make‑include.mak file to set the definition of installDir. Also, be sure to add the directory you choose to your $PATH.

Then to build the LASTZ executable, enter the following commands from bash (or a similar command-line shell). This will build two executables (lastz and lastz_D) and copy them into your installDir.

    cd <somepath>/lastz-distrib-X.XX.XX/src
    make
    make install
The two executables are basically the same program; the only difference is that lastz uses integer scores, while lastz_D uses floating-point scores.

A simple self test is included so you can test whether the build succeeded. To run it, enter the following command:

    make test
If the test is successful, you will see no output from this command. Otherwise, you will see the differences between the expected output and the output of your build, plus a line that looks like this:
    make: *** [test] Error 1

Examples

Aligning a human chromosome to a chicken chromosome

To run a quick low-sensitivity alignment of these sequences:

    lastz hg18.chr4.fa galGal3.chr4.fa C=3 T=2 Z=10 --maf > hg18_4.galGal3_4.maf

Comparing shotgun reads to a human chromosome

    lastz hg18.chr22.fa reads.fa --yasra98 --maf > hg18_22.reads.maf

Command-Line Options

If you are familiar with BLASTZ, you can run LASTZ the same way you ran BLASTZ, with the same options and input files. In addition to this BLASTZ compatibility, LASTZ provides other options.

The general format of the LASTZ command line is

    lastz <target_file> <query_file> [<options>]
The angle brackets <> indicate meta-syntactic variables that should be replaced with your values, while the square ones [] indicate elements that are optional. Elements can appear in any order, the only constraint being that the query_file must appear after the target_file.

The target_file and query_file are usually just the names of files containing the two sequences to be aligned, either in FASTA, Nib, or 2Bit format. However they can also specify subsequences from these files; running

    lastz --help=files
gives a description of the available filename modifiers.

For options, the general format is ‑‑<keyword> or ‑‑<keyword>=<value>, but for BLASTZ compatibility some options also have an alternative syntax <letter>=<number>.

Running the command lastz without any arguments prints a help message with the most commonly used options, while running

    lastz --help
lists all of the options.

Commonly Used Options

OptionBLASTZ equivalentMeaning
‑‑strand=both B=2 Search both strands.
‑‑strand=plus B=0 Search forward strand only (the one in the query file).
‑‑strand=minus B=‑1 Search the reverse complement strand only (opposite of the query file).
(By default both strands are searched.)
‑‑seed=12of19 T=1 or T=2 Seed hits require a 19 bp word with matches in 12 specific positions.
‑‑seed=14of22 T=3 or T=4 Seed hits require a 22 bp word with matches in 14 specific positions.
‑‑seed=match<n> W=<n> Seed hits require an n bp word with matches in all positions.
‑‑transition T=1 or T=3 Allow one transition in each seed hit.
‑‑transition=2 Allow two transitions in a seed hit.
‑‑notransition T=2 or T=4 Don't allow any transitions in seed hits.
(By default the 12-of-19 seed is used, and one transition is allowed.)
‑‑step=<n> Z=<n> Number of bases between the start of each target word considered for a seed match.
(By default a step of 1 is used.)
‑‑gfextend Perform gap-free extension of seed hits to HSPs (high scoring segment pairs).
‑‑nogfextend Don't extend seed hits to HSPs.
‑‑chain C=1 or C=2 Perform chaining of HSPs.
‑‑nochain C=0 or C=3 Don't perform chaining of HSPs.
‑‑gapped C=0 or C=2 Perform gapped alignment (instead of gap-free).
‑‑nogapped C=1 or C=3 Perform gap-free alignment.
(By default seed hits are entded to HSPs and gapped alignment is performed, without chaining.)
‑‑scores=<file> Q=<file> Read substitution scores from a file.
‑‑match=<reward>,<penalty> Set the score values for a match (+<reward>) and mismatch (‑<penalty>).
(By default HOXD70 scores are used.) 
    
 ACGT
A91‑114‑31‑123
C‑114100‑125‑31
G‑31‑125100‑114
T‑123‑31‑11491
‑‑gap=<[open,]extend> O=<score> E=<score> Set the score penalties for opening and extending a gap.
(Default is 400 for gap open, 30 for gap extend.)
‑‑xdrop=<score> X=<score> Set the x-drop threshold.
(Default is 10 times the A-vs.-A substitution score.)
‑‑ydrop=<score> Y=<score> Set the y-drop threshold.
(Default is the score of a 300 bp gap.)
‑‑hspthresh=<score> K=<score> Set the threshold for HSPs; ungapped extensions scoring lower are discarded.
(Default is 3000.)
‑‑gappedthresh=<score> L=<score> Set the threshold for gapped alignments; gapped extensions scoring lower are discarded.
(Default is to use the same value as ‑‑hspthresh.)
‑‑inner=<score> H=<score> Set the threshold for HSPs during interpolation.
(Default is to not perform interpolation.)
‑‑entropy P=1 Involve entropy when filtering HSPs.
‑‑noentropy P=0 Don't involve entropy when filtering HSPs.
(Default is to involve entropy.)
‑‑traceback=<bytes> m=<bytes> Space to allocate (in RAM) for trace-back information.
(Default is 80.0M)
‑‑identity=<min>[..<max>] Filter alignments by percent identity, 0 ≤ minmax ≤ 100; alignment blocks outside the range are discarded.
(Default is to not perform identity filtering.)
‑‑format=<type> Specify the output format, one of: lav, axt, maf, or text.
(By default output is in LAV format.)
‑‑help List all options.
‑‑help=files List information about file specifiers.
‑‑help=shortcuts List BLASTZ-compatible shortcuts.

Yasra-Specific Options

There are several options to support the Yasra mapping assembler. These provide canned sets of option settings that work well for aligning an assembled reference sequence (as the target) with a set of shotgun reads (as the query). They are selected based on the expected level of identity between the sequences. For example, ‑‑yasra90 should be used when we expect 90% identity. The ‑‑yasraXXshort options are appropriate when the reads are very short (less than 50 bp).

OptionEquivalent
‑‑yasra98 T=2 Z=20 ‑‑match=1,6 O=8 E=1 Y=20 K=22 L=30 ‑‑identity=98
‑‑yasra95 T=2 Z=20 ‑‑match=1,5 O=8 E=1 Y=20 K=22 L=30 ‑‑identity=95
‑‑yasra90 T=2 Z=20 ‑‑match=1,5 O=6 E=1 Y=20 K=22 L=30 ‑‑identity=90
‑‑yasra85 T=2      ‑‑match=1,2 O=4 E=1 Y=20 K=22 L=30 ‑‑identity=85
‑‑yasra75 T=2      ‑‑match=1,1 O=3 E=1 Y=20 K=22 L=30 ‑‑identity=75
‑‑yasra95short T=2      ‑‑match=1,7 O=6 E=1 Y=14 K=10 L=14 ‑‑identity=95
‑‑yasra85short T=2      ‑‑match=1,3 O=4 E=1 Y=14 K=11 L=14 ‑‑identity=85

Change History

ReleaseDateChanges
1.0.1Jul/28/2008 Initial release.
1.0.5Aug/2/2008 Fixed a bug that in some cases caused a bus error when interpolated alignments (e.g. ‑‑inner=...) were used with multiple queries.
Added xmask=<file> and nmask=<file> file masking actions.
1.0.21Sep/9/2008 Fixed a bug involving the default value for ‑‑gappedthresh (a.k.a. L) when ‑‑exact is used. The bug caused the gapped threshold to be inordinately low, allowing undesirable alignment blocks to make it to the output file.
Fixed a bug whereby Xs and Ns were treated as desirable substitutions when unit scores (e.g. ‑‑match=...) were used.
Re-implemented ‑‑twins=.... The previous implementation improperly truncated the left-extension of HSPs. The new implementation is slower and uses more memory.
Added ‑‑census=<file>. The census counts the number of times each base in the target sequence is part of an alignment block. Previously, ‑‑census produced a census only if the output format was lav (the census is a special stanza in a lav file). Otherwise the option was ignored. Now, if a file is specified a census is written to that file. The format of lines in the census is <name> <position> <count>. The position one-based, and the count is limited to 255.

In situtations where a limit of 255 is too limiting, ‑‑census16=<file> or ‑‑census32=<file> can be used, with limits of about 65 thousand and 4 billion, respectively. Note that these will respectively double and quadruple the amount of memory used for the census. The default census uses one byte per target sequence location.

Added ‑‑format=<differences>, to support Galaxy. All differences (gaps and runs of mismatches) are reported, one per line.
Added ‑‑anchors=<file>, giving the user the ability to bypass the seeding and anchoring stage.
Changed default gap penalties for unit scores (e.g. ‑‑match=...) to be relative to mismatch score (instead of match score).
Made the <start>#<length> file subrange action better at checking errors, and also allowed <length> to use units such as M and K.
Sped up program exit by no longer freeing up dynamically allocate memory.