LAV Format

TABLE OF CONTENTS

Introduction
Example
Stanza Types

Introduction

LAV is a plain-text file format for alignments of two DNA sequences. It is the only output format produced by the BLASTZ alignment program (though often converted to AXT format by post-processing programs), and is the default output format for BLASTZ's successor, LASTZ.

The alignment blocks are grouped by sequence (e.g. chromosome, scaffold, contig, cDNA read, shotgun sequencing read, etc.) and strand, and described by listing the coordinates of the gap-free aligning segments in each block. This format is compact because it does not include the nucleotides, but the tradeoff is that interpretation usually requires access to the original sequence files, and it is not easy for humans to read.

Example

Here's a typical LAV file:

    #:lav
    d {
      "lastz.v0.3 malus.fa aurantium.fa C=2 W=8 T=0 
         A    C    G    T
        91 -114  -31 -123
      -114  100 -125  -31
       -31 -125  100 -114
      -123  -31 -114   91
      O = 400, E = 30, K = 3000, L = 3000, M = 0"
    }
    #:lav
    s {
      "malus.fa" 1 191411218 0 1
      "aurantium.fa" 1 90634903 0 1
    }
    h {
      "> apple"
      "> orange"
    }
    a {
      s 20643
      b 46566766 2083211
      e 46567353 2083795
      l 46566766 2083211 46566796 2083241 61
      l 46566797 2083245 46566814 2083262 78
      l 46566821 2083263 46567353 2083795 65
    }
    a {
      s 4233
      b 47246530 10635696
      e 47246660 10635826
      l 47246530 10635696 47246660 10635826 63
    }
    ... many more a-stanzas ...
    #:lav
    s {
      "malus.fa" 1 191411218 0 1
      "aurantium.fa-" 1 90634903 1 1
    }
    h {
      "> apple"
      "> orange (reverse complement)"
    }
    a {
      s 13897
      b 1005819 5352698
      e 1006099 5352978
      l 1005819 5352698 1006099 5352978 74
    }
    ... many more a-stanzas ...
    #:eof

Stanza Types

An LAV file primarily consists of a series of "stanzas", each of which is a single letter code followed by a brace-enclosed block. There are also #:lav lines which break the file into sections, and one #:eof line indicating the end of the file. Programs that read LAV format should consider the file bad if the #:eof is missing (or if anything appears after it).

D Stanza

The d-stanza is intended to document the program and parameters used to create the file. Programs reading the file normally treat this as a comment, but it is possible to extract the scoring parameters for further processing.

S Stanza

An s-stanza describes the sequences used for the subsequent alignment records (a-stanzas). It contains exactly two lines in the following format.

    "<filename>[-]" <start> <stop> [<rev_comp_flag> <sequence_number>]

Here <start> and <stop> are origin 1 (i.e. the first base in the original given sequence is called "1") and inclusive (both endpoints are included in the interval). Usually <start> is 1 and <stop> is the full length of the given sequence, however they can specify any subsequence (e.g. if the alignment program was instructed to use only part of the original sequence).

<rev_comp_flag> is 1 if the sequence was reverse-complemented before aligning, or 0 otherwise. Usually the first sequence will have a 0 here, since most alignment programs only ever reverse-complement the second one. If this flag is 1, the <filename> will also have a - appended to it; programs that read LAV format should report an error if these two indicators are contradictory. Note that even when this flag indicates reverse-complement, the <start> and <stop> endpoints are still relative to the original orientation, and <start> is less than <stop>. That is, conceptually the alignment program extracts the requested sequence fragment first, then reverse-complements it (if applicable), and finally tries to align it.

<sequence_number> is useful when the second file contains multiple sequences. The first sequence is 1, the second is 2, and so on. Most programs that write and read LAV format do not allow the first file to contain multiple sequences, so in these cases the sequence number for the first file is always 1 (though the format itself does not require this). Note that <start> and <stop> are relative to each sequence, not to the entire file.

The <rev_comp_flag> and <sequence_number> are shown here as optional because early versions of this format did not include them.

H Stanza

Usually an s-stanza is followed immediately by an h-stanza, which provides a name for each of the two sequences, typically obtained from the FASTA header line. (Before the s-stanza's <sequence_number> field was introduced, this was the only way to identify which sequence from a multi-sequence file was aligned.) A (reverse complement) suffix is appended when applicable; again, programs should report an error if this contradicts the other indicators.

A Stanza

An a-stanza describes a single alignment block, sometimes called a "local alignment", which typically includes gaps due to small insertions and deletions in the aligned sequences. In the example below, the s, b, and e lines indicate that the block has a score of 13916 and an overall range of 4886..5171 in sequence 1 and 21292..21537 in sequence 2.

The l lines describe the block's gap-free segments, with the final field representing the percentage of matching bases in each segment. In this example the alignment starts with a segment from 4886..4899 in sequence 1 and from 21292..21305 in sequence 2, having a percent identity of 79%. Note that the segment length must be the same in both sequences (14 basepairs for this segment). The next segment starts at 4900 and 21308 in sequences 1 and 2, respectively, indicating a two-base gap in sequence 1 (corresponding to positions 21306 and 21307 in sequence 2).

    a {
      s 13916
      b 4886 21292
      e 5171 21537
      l 4886 21292 4899 21305 79
      l 4900 21308 4924 21332 92
      l 4925 21334 5024 21433 88
      l 5027 21434 5040 21447 100
      l 5086 21448 5117 21479 84
      l 5118 21484 5171 21537 87
    }

Coordinates in an a-stanza are origin 1 and inclusive, and are relative to the subsequences indicated in the most recent s-stanza. In the example below the alignment is of apple 1333..1444 to orange 2777..2888.

    s {
      "malus.fa" 1001 2000 0 1
      "aurantium.fa" 2001 5000 0 1
    }
    ...
    a {
      s 7321
      b 333 777
      e 444 888
      l 333 777 444 888 62
    }

If a sequence is reverse-complemented, then the coordinates are relative to the reverse complement, so they are counted back from the end of the subsequence. Thus the example below represents an alignment of apple 1333..1444 to the reverse complement of orange 4113..4224. In detail: the s-stanza indicates that the first sequence from aurantium.fa should be used, and its subsequence from 2001..5000 should be extracted and then reverse complemented before aligning with apple. In this 3000 bp reverse-complemented subsequence, the first base corresponds to position 5000 in the original sequence, the second to position 4999, and so on to the last (3000th) base, which corresponds to position 2001. Thus the conversion formula is p = 5000 - (r - 1), where p is the position in the original sequence, and r is the position in the reverse-complemented subsequence. Within the reverse-complemented subsequence, the alignment is at 777..888. The starting point, 777, is the nucleotide 776 bp back from 5000, or 4224, while the ending point, 888, is 887 bp back from 5000, or 4113.

    s {
      "malus.fa" 1001 2000 0 1
      "aurantium.fa-" 2001 5000 1 1
    }
    ...
    a {
      s 7321
      b 333 777
      e 444 888
      l 333 777 444 888 62
    }

The fifth numeric field in an a-stanza's l line is the percentage of bases in the aligned segment that match (often called the "percent identity" or "percent id"). This is used by viewer tools such as Laj and PipMaker.

X and M Stanzas

An LAV file may also contain x- and m-stanzas describing dynamic masking. Each section will contain an x-stanza that looks like the one below. The count is the number of bases newly masked as a result of processing the latest query sequence; it does not include bases previously masked.

    x {
      n <count>
    }

A single m-stanza listing the masked regions is then included in the final section, and looks like this:

    m {
      x <start> <end>
      x <start> <end>
      ...
      n <count>
    }

Dynamic masking is invoked by the ‑‑masking=<count> option in LASTZ, or the M=<count> option in BLASTZ. For more information about these options, please see the LASTZ documentation.

Census Stanza

In LASTZ, the ‑‑census option will produce a Census-stanza. The first field in each line (1, 2, 3, …) is a position in the target (sequence 1). The count indicates the number of times the corresponding base appears in an alignment.

    Census {
      1 <count>
      2 <count>
      ...
    }

Bob Harris and Cathy Riemer, October 2008