MAF 1 April 14, 2009 version 0.1 File formats for genomic data

NAME

MAF - (Multiple Alignment Format) Specifications

DEFINITION

The multiple alignment format stores a series of multiple alignments in a format that is easy to parse and relatively easy to read. This format stores multiple alignments at the DNA level between entire genomes. Previously used formats are suitable for multiple alignments of single proteins or regions of DNA without rearrangbents, but would require considerable extension to cope with genomic issues such as forward and reverse strand directions, multiple pieces to the alignment, and so forth.

General Structure

The .maf format is line-oriented. Each multiple alignment ends with a blank line. Each sequence in an alignment is on a single line, which can get quite long, but there is no length limit. Words in a line are delimited by any white space. Lines starting with # are considered to be comments. Lines starting with ## can be ignored by most programs, but contain meta-data of one form or another. The file is divided into paragraphs that terminate in a blank line. Within a paragraph, the first word of a line indicates its type. Each multiple alignment is in a separate paragraph that begins with an "a" line and contains an "s" line for each sequence in the multiple alignment. Some MAF files may contain other optional line types:
i an "i" line containing information about what is in the aligned species DNA before and after the immediately preceding "s" line
e an "e" line containing information about the size of the gap between the alignments that span the current block
q a "q" line indicating the quality of each aligned base for the species

Header line

The first line of a .maf file begins with ##maf. This word is followed by white-space-separated variable=value pairs. There should be no white space surrounding the "=".
##maf version=1 scoring=tba.v8

The currently defined variables are:

version Required. Currently set to one.

scoring Optional. A name for the scoring scheme used for the alignments.

program Optional. Name of the program generating the alignment.

Undefined variables are ignored by the parser.

Alignments Parameter Line

The second line displays the parameters that were used to run the alignment program.
# tba.v8 (((human chimp) baboon) (mouse rat))

Alignment Block Lines

(lines starting with 'a' -- parameters for a new alignment block)
a score=23262.0

Each alignment begins with an 'a' line that set variables for the entire alignment block. The 'a' is followed by name=value pairs. There are no required name=value pairs. The currently defined variables are:

score Optional. Floating point score. If this is present, it is good practice to also define scoring in the first line.

pass Optional. Positive integer value. For programs that do multiple pass alignments such as blastz, this shows which pass this alignment came from. Typically, pass 1 will find the strongest alignments genome-wide, and pass 2 will find weaker alignments between two first-pass alignments.

Lines starting with 's' -- a sequence within an alignment block

 s hg16.chr7    27707221 13 + 158545518 gcagctgaaaaca
 s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca
 s baboon         249182 13 +   4622798 gcagctgaaaaca
 s mm4.chr6     53310102 13 + 151104725 ACAGCTGAAAATA

The 's' lines together with the 'a' lines define a multiple alignment. The 's' lines have the following fields which are defined by position rather than name=value pairs.

src The name of one of the source sequences for the alignment. For sequences that are resident in a browser assembly, the form 'database.chromosome' allows automatic creation of links to other assemblies. Non-browser sequences are typically reference by the species name alone.

start The start of the aligning region in the source sequence. This is a zero-based number. If the strand field is '-' then this is the start relative to the reverse-complemented source sequence.

size The size of the aligning region in the source sequence. This number is equal to the number of non-dash characters in the alignment text field below.

strand Either '+' or '-'. If '-', then the alignment is to the reverse-complemented source.

srcSize The size of the entire source sequence, not just the parts involved in the alignment.

text The nucleotides (or amino acids) in the alignment and any insertions (dashes) as well.

Lines starting with 'i' -- information about what's happening before and after this block in the aligning species

 s hg16.chr7    27707221 13 + 158545518 gcagctgaaaaca 
 s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca
 i panTro1.chr6 N 0 C 0
 s baboon         249182 13 +   4622798 gcagctgaaaaca
 i baboon       I 234 n 19

Lines starting with 'e' -- information about empty parts of the alignment block

s hg16.chr7    27707221 13 + 158545518 gcagctgaaaaca
e mm4.chr6     53310102 13 + 151104725 I

EXAMPLE

##maf version=1 scoring=tba.v8 
# tba.v8 (((human chimp) baboon) (mouse rat)) 
                   
a score=23262.0     
s hg18.chr7    27578828 38 + 158545518 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
s panTro1.chr6 28741140 38 + 161576975 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
s baboon         116834 38 +   4622798 AAA-GGGAATGTTAACCAAATGA---GTTGTCTCTTATGGTG
s mm4.chr6     53215344 38 + 151104725 -AATGGGAATGTTAAGCAAACGA---ATTGTCTCTCAGTGTG
s rn3.chr4     81344243 40 + 187371129 -AA-GGGGATGCTAAGCCAATGAGTTGTTGTCTCTCAATGTG
                   
a score=5062.0                    
s hg18.chr7    27699739 6 + 158545518 TAAAGA
s panTro1.chr6 28862317 6 + 161576975 TAAAGA
s baboon         241163 6 +   4622798 TAAAGA 
s mm4.chr6     53303881 6 + 151104725 TAAAGA
s rn3.chr4     81444246 6 + 187371129 taagga

a score=6636.0
s hg18.chr7    27707221 13 + 158545518 gcagctgaaaaca
s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca
s baboon         249182 13 +   4622798 gcagctgaaaaca
s mm4.chr6     53310102 13 + 151104725 ACAGCTGAAAATA

SEE ALSO

gff(1), bed(1)