GFF 1 April 14, 2009 version 0.1 File formats for genomic data

NAME

GFF - (Gene Feature Format) Specifications

SYNOPSIS

chromtpnametfnametstarttendtscoretstrandt.t[attribute_list]

The genome coordinates are 1-based, meaning that the lowest value of start is 1.
All fields are tab-delimited (t), except for the attribute_list, which is a semicolon (;)-delimited list of field/value pairs.

Introduction

Essentially all current approaches to feature finding in higher organisms use a variety of recognition methods that give scores to likely signals (starts, splice sites, stops, motifs, etc.) or to extended regions (exons, introns, protein domains etc.), and then combine these to give complete gene, RNA transcript or protein structures. Normally the combination step is done in the same program as the feature detection, often using dynamic programming methods. To enable these processes to be decoupled, a format called GFF ('Gene-Finding Format' or 'General Feature Format') was proposed as a protocol for the transfer of feature information. It is now possible to take features from an outside source and add them in to an existing program, or in the extreme to write a dynamic programming system which only took external features.

GFF allows people to develop features and have them tested without having to maintain a complete feature-finding system. Equally, it would help those developing and applying integrated gene-finding programs to test new feature detectors developed by others, or even by themselves.

We want the GFF format to be easy to parse and process by a variety of programs in different languages. e.g. it would be useful if Unix tools like grep, sort and simple perl and awk scripts could easily extract information out of the file. For these reasons, for the primary format, we propose a record-based structure, where each feature is described on a single line, and line order is not relevant.

We do not intend GFF format to be used for complete data management of the analysis and annotation of genomic sequence. Systems such as Acedb, Genotator etc. that have much richer data representation semantics have been designed for that purpose. The disadvantages in using their formats for data exchange (or other richer formats such as ASN.1) are (1) they require more complexity in parsing/processing, (2) there is little hope on achieving consensus on how to capture all information. GFF is intentionally aiming for a low common denominator.

With the changes taking place to version 2 of the format, we also allow for feature sets to be defined over RNA and Protein sequences, as well as genomic DNA. This is used for example by the EMBOSS project to provide standard format output for all features as an option. In this case the <strand> and <frame> fields should be set to '.'. To assist this transition in specification, a new #Type Meta-Comment has been added.

Here are some example records:

SEQ1	EMBL	atg	103	105	.	+	0
SEQ1	EMBL	exon	103	172	.	+	0
SEQ1	EMBL	splice5	172	173	.	+	.
SEQ1	netgene	splice5	172	173	0.94	+	.
SEQ1	genie	sp5-20	163	182	2.3	+	.
SEQ1	genie	sp5-10	168	177	2.1	+	.
SEQ2	grail	ATG	17	19	2.1	-	0

DEFINITION

Fields are: <seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]

<seqname>
The name of the sequence. Having an explicit sequence name allows a feature file to be prepared for a data set of multiple sequences. Normally the seqname will be the identifier of the sequence in an accompanying fasta format file. An alternative is that <seqname> is the identifier for a sequence in a public database, such as an EMBL/Genbank/DDBJ accession number. Which is the case, and which file or database to use, should be explained in accompanying information.

<source>
The source of this feature. This field will normally be used to indicate the program making the prediction, or if it comes from public database annotation, or is experimentally verified, etc.

<feature>
The feature type name. We hope to suggest a standard set of features, to facilitate import/export, comparison etc.. Of course, people are free to define new ones as needed. For example, Genie splice detectors account for a region of DNA, and multiple detectors may be available for the same site, as shown above.


We would like to enforce a standard nomenclature for common GFF features. This does not forbid the use of other features, rather, just that if the feature is obviously described in the standard list, that the standard label should be used. For this standard table we propose to fall back on the international public standards for genomic database feature annotation, specifically, the DDBJ/EMBL/GenBank feature table documentation).
<start>, <end>
Integers. <start> must be less than or equal to <end>. Sequence numbering starts at 1, so these numbers should be between 1 and the length of the relevant sequence, inclusive. (Version 2 change: version 2 condones values of <start> and <end> that extend outside the reference sequence. This is often more natural when dumping from acedb, rather than clipping. It means that some software using the files may need to clip for itself.)
<score>
A floating point value. When there is no score (i.e. for a sensor that just records the possible presence of a signal, as for the EMBL features above) you should use '.'. (Version 2 change: in version 1 of GFF you had to write 0 in such circumstances.)
<strand>
One of '+', '-' or '.'. '.' should be used when strand is not relevant, e.g. for dinucleotide repeats. Version 2 change: This field is left empty '.' for RNA and protein features.
<frame>
One of '0', '1', '2' or '.'. '0' indicates that the specified region is in frame, i.e. that its first base corresponds to the first base of a codon. '1' indicates that there is one extra base, i.e. that the second base of the region corresponds to the first base of a codon, and '2' means that the third base of the region is the first base of a codon. If the strand is '-', then the first base of the region is value of <end>, because the corresponding coding region will run from <end> to <start> on the reverse strand. As with <strand>, if the frame is not relevant then set <frame> to '.'. It has been pointed out that "phase" might be a better descriptor than "frame" for this field. Version 2 change: This field is left empty '.' for RNA and protein features.
[attribute]
From version 2 onwards, the attribute field must have an tag value structure following the syntax used within objects in a .ace file, flattened onto one line by semicolon separators. Tags must be standard identifiers ([A-Za-z][A-Za-z0-9_]*). Free text values must be quoted with double quotes. Note: all non-printing characters in such free text value strings (e.g. newlines, tabs, control characters, etc) must be explicitly represented by their C (UNIX) style backslash-escaped representation (e.g. newlines as '\n', tabs as '\t'). As in ACEDB, multiple values can follow a specific tag. The aim is to establish consistent use of particular tags, corresponding to an underlying implied ACEDB model if you want to think that way (but acedb is not required). Examples of these would be:
seq1     BLASTX  similarity   101  235 87.1 + 0	Target "HBA_HUMAN" 11 55 ; E_value 0.0003
dJ102G20 GD_mRNA coding_exon 7105 7201   .  - 2 Sequence "dJ102G20.C1.1"

The semantics of tags in attribute field tag-values pairs has intentionally not been formalized. Two useful guidelines are to use DDBJ/EMBL/GenBank feature 'qualifiers' (see DDBJ/EMBL/GenBank feature table documentation), or the features that ACEDB generates when it dumps GFF.

Version 1 note In version 1 the attribute field was called the group field, with the following specification:
An optional string-valued field that can be used as a name to group together a set of records. Typical uses might be to group the introns and exons in one gene prediction (or experimentally verified gene structure), or to group multiple regions of match to another sequence, such as an EST or a protein.

COMMENTS

Comments are allowed, starting with "#" as in Perl, awk etc. Everything following # until the end of the line is ignored. Effectively this can be used in two ways. Either it must be at the beginning of the line (after any whitespace), to make the whole line a comment, or the comment could come after all the required fields on the line.

## comment lines for meta information

There is a set of standardised (i.e. parsable) ## line types that can be used optionally at the top of a gff file. The philosophy is a little like the special set of %% lines at the top of postscript files, used for example to give the BoundingBox for EPS files.
Current proposed ## lines are:

FILE NAMING

SEMANTICS

WAYS TO USE GFF

COMPLEX EXAMPLES

Similarities to Other Sequences

CUMULATIVE SCORE ARRAYS