Galaxy data formats

For problems with missing queries in the tool selection boxes the most common reason is the tool only lists history items with data formats compatible with the tool. Some formats are subsets of others and Galaxy should also list those with compatible subformats as well. If the query is not showing up still and you believe it is in the correct format you can click on the pencil icon and manually change the format. This will not edit the file just change the metadata for the file. Some cases you will need to actually change the file format. For example, if the file is space delimited and a tabular file is required; then the "Convert delimiters to TAB" tool under "Text Manipulation" can be used to reformat the file.

Some of the most commonly used formats are very similar. Start with the basic tabular file. It has few requirements other than 1 or more columns of data separated by tabs. Next is intervals which are tabular but they have the added requirement that 3 of the columns must be the chromosome, start point, and end point. There is optionally a strand and header labelling the columns. Next is BED or GFF, which are also tabular and intervals, but with more restrictions. BED can vary between 3 and 12 columns, with each being precisely defined. Here the order of the columns also matters, and only the end columns can be skipped. Some groups of the columns have to be all there or all left off. GFF is similar in setup but with all 9 columns required and different definitions. See more detailed descriptions below.

Formats

Ab1
AXT
BAM
Binseq.zip
BED
BedGraph
FASTA
FastqSolexa
fped
GFF
GFF3
GTF
Html
Interval
LAV
Lped
MAF
pbed
PSL
Scf
Sff
Table
Tabular
Txtseq.zip
Wiggle custom track
Other text type

Ab1

A binary sequence file in 'ab1' format with a '.ab1' file extension. You must manually select this 'File Format' when uploading the file.

AXT

blastz pairwise alignment format. Each alignment block in an axt file contains three lines: a summary line and 2 sequence lines. Blocks are separated from one another by blank lines. The summary line contains chromosomal position and size information about the alignment. It consists of 9 required fields. Click here for more information about axt format.

Can be converted to:

FASTA
Convert Formats→AXT to FASTA
LAV
Convert Formats→AXT to LAV

Bam

A binary file compressed in the BGZF format with a '.bam' file extension. SAM format is the human readable text version of these files.

Can be converted to:

pileup
NGS: SAM Tools→Generate pileup
interval
First you have to go to pileup as above then NGS: SAM Tools→Pileup-to-Interval

Binseq.zip

A zipped archive consisting of binary sequence files in either 'ab1' or 'scf' format. All files in this archive must have the same file extension which is one of '.ab1' or '.scf'. You must manually select this 'File Format' when uploading the file.

BED

also tabular
also interval

This describes a genomic interval, but has strict field specifications for use in browsers. Click here for field specifications.
Example:

chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0,3512
chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601

Can be converted to:

GFF
Convert Formats→BED-to-GFF

BedGraph

also tabular
also interval
also BED

BedGraph is a BED file with the name column being a float value that is displayed as a Wiggle in tracks. Unlike wiggles this score can be retrieved in its exact value after being loaded as a track.

Fasta

A sequence in FASTA format consists of a single-line description, followed by lines of sequence data. The first character of the description line is a greater-than (">") symbol in the first column. All lines should be shorter than 80 characters::

>sequence1
atgcgtttgcgtgc
gtcggtttcgttgc
>sequence2
tttcgtgcgtatag
tggcgcggtga

Can be converted to:

tabular
Convert Formats→FASTA-to-Tabular

FastqSolexa

FastqSolexa is the Illumina (Solexa) variant of the Fastq format, which stores sequences and quality scores in a single file

@seq1  
GACAGCTTGGTTTTTAGTGAGTTGTTCCTTTCTTT  
+seq1  
hhhhhhhhhhhhhhhhhhhhhhhhhhPW@hhhhhh  
@seq2  
GCAATGACGGCAGCAATAAACTCAACAGGTGCTGG  
+seq2  
hhhhhhhhhhhhhhYhhahhhhWhAhFhSIJGChO

Or 

@seq1
GAATTGATCAGGACATAGGACAACTGTAGGCACCAT
+seq1
40 40 40 40 35 40 40 40 25 40 40 26 40 9 33 11 40 35 17 40 40 33 40 7 9 15 3 22 15 30 11 17 9 4 9 4
@seq2
GAGTTCTCGTCGCCTGTAGGCACCATCAATCGTATG
+seq2
40 15 40 17 6 36 40 40 40 25 40 9 35 33 40 14 14 18 15 17 19 28 31 4 24 18 27 14 15 18 2 8 12 8 11 9

Can be converted to:

FASTA
Convert Formats→FASTQ to FASTA

fped

Also known as the FBAT format, for use in the FBAT program. It consists of a pedigree file and an phenotype file.

Gff

also tabular
also interval

GFF lines have nine required fields that must be tab-separated.

Can be converted to:

BED
Convert Formats→GFF-to-BED

GFF3

also tabular
also interval

The GFF3 format addresses the most common extensions to GFF, while preserving backward compatibility with previous formats.

GTF

also tabular
also interval

GTF is a format for describing genes and other features associated with DNA, RNA and Protein sequences.

Can be converted to:

BED graph
Convert Formats→GTF-to-BEDGraph

Html

This format is a html web page. Click the eye icon to view the dataset in your browser.

Interval (Genomic Intervals)

also tabular

Required fields

CHROM - The name of the chromosome (e.g. chr3, chrY, chr2_random) or contig (e.g. ctgY1).
START - The starting position of the feature in the chromosome or contig. The first base in a chromosome is numbered 0.
END - The ending position of the feature in the chromosome or contig. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.

Optional

STRAND - Defines the strand - either '+' or '-'.
Headers

Example:

    #CHROM START END   STRAND NAME COMMENT
    chr1   10    100   +      exon myExon
    chrX   1000  10050 -      gene myGene

Can be converted to:

BED
The exact changes needed and tools to run can vary with what fields are in the interval file and what size BED you are converting to. In general you will likely use Text Manipulation→Compute, Cut or Merge Columns.

LAV

LAV is the primary output format for BLASTZ. The first line of a .lav file begins with #:lav..

Can be converted to:

BED
Convert Formats→LAV to BED

Lped

This is the linkage pedigree format (separate map and ped files). These files together describe SNPs, the map file has the position and an identifier for the SNP and the pedigree file has the alleles. To upload this format into Galaxy do not use auto-detect for the file format, instead select lped. You will then be given two sections for uploading files, one for the pedigree file and one for the map file. For more information see linkage pedigree or map or ped.

Can be converted to:

pbed
Automatic
fped
Automatic

MAF

TBA and multiz multiple alignment format. The first line of a .maf file begins with ##maf. This word is followed by white-space-separated "variable=value pairs". There should be no white space surrounding the "=". Click here for more about MAF format.

Can be converted to:

BED
Convert Formats→Maf to BED
Interval
Convert Formats→Maf to Interval
FASTA
Convert Formats→Maf to FASTA

pbed

This is the binary version of the lped file format.

Can be converted to:

lped
Automatic

PSL

PSL format is for alignments, it is returned by BLAT. It does not include any sequence.

Scf

A binary sequence file in 'scf' format with a '.scf' file extension. You must manually select this 'File Format' when uploading the file. Click here for more information.

Sff

A binary file in 'Standard Flowgram Format' with a '.sff' file extension.

Can be converted to:

FASTA
Convert Formats→SFF converter
FASTQ
Convert Formats→SFF converter

Table

Text delimited into columns by something other than a tab.

Tabular (tab delimited)

Any data in tab delimited format (tabular)

Can be converted to:

FASTA
Convert Formats→Tabular-to-FASTA
Tabular file must have a title and sequence column.
interval
If the tabular file has the chromosome, or is all on one chromosome, and a position you can create an interval file. If all one chromosome use Text Manipulation→Add column to add the chromosome. If the given position is a 1 based position use Text Manipulation→Compute and the position column minus 1 to get the start. Otherwise do plus 1 to get the end.

Txtseq.zip

A zipped archive consisting of flat text sequence files. All files in this archive must have the same file extension of '.txt'. You must manually select this 'File Format' when uploading the file.

Wiggle custom track

The wiggle format is line-oriented. Wiggle data is preceded by a track definition line, which gives the type of wiggle. There are 3 different types, each with their uses. More information here.

Can be converted to:

interval
Convert Formats→Wiggle-to-Interval
As a second step this could be converted to BED 3 or 4 by removing columns. Text Manipulation→Cut columns from a table

Other text type

Any text file

Can be converted to:

tabular
If this is space or some other delimiter separated fields it can be converted to tabular. Text Manipulations→Convert delimiters to TAB