Galaxy data formats

For problems with missing queries in the tool selection boxes the most common reason is the tool only lists history items with data formats compatible with the tool. Some formats are subsets of others and Galaxy should also list those with compatible subformats as well. If the query is not showing up still and you believe it is in the correct format you can click on the pencil icon and manually change the format. This will not edit the file just change the metadata for the file. Some cases you will need to actually change the file format. For example, if the file is space delimited and a tabular file is required; then the "Convert delimiters to TAB" tool under "Text Manipulation" can be used to reformat the file.

Some of the most commonly used formats are very similar. Start with the basic tabular file. It has few requirements other than 1 or more columns of data separated by tabs. Next is intervals which are tabular but they have the added requirement that 3 of the columns must be the chromosome, start point, and end point. There is optionally a strand and header labelling the columns. Next is BED or GFF, which are also tabular and intervals, but with more restrictions. BED can vary between 3 and 12 columns, with each being precisely defined. Here the order of the columns also matters, and only the end columns can be skipped. Some groups of the columns have to be all there or all left off. GFF is similar in setup but with all 9 columns required and different definitions. See more detailed descriptions below.


Formats



Ab1

A binary sequence file in 'ab1' format with a '.ab1' file extension. You must manually select this 'File Format' when uploading the file.


AXT

blastz pairwise alignment format. Each alignment block in an axt file contains three lines: a summary line and 2 sequence lines. Blocks are separated from one another by blank lines. The summary line contains chromosomal position and size information about the alignment. It consists of 9 required fields. Click here for more information about axt format.

Can be converted to:

Bam

A binary file compressed in the BGZF format with a '.bam' file extension. SAM format is the human readable text version of these files.

Can be converted to:

Binseq.zip

A zipped archive consisting of binary sequence files in either 'ab1' or 'scf' format. All files in this archive must have the same file extension which is one of '.ab1' or '.scf'. You must manually select this 'File Format' when uploading the file.


BED

This describes a genomic interval, but has strict field specifications for use in browsers.
Click here for field specifications.
Example:
chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0,3512
chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601
Can be converted to:

BedGraph

BedGraph is a BED file with the name column being a float value that is displayed as a Wiggle in tracks. Unlike wiggles this score can be retrieved in its exact value after being loaded as a track.
Fasta

A sequence in FASTA format consists of a single-line description, followed by lines of sequence data. The first character of the description line is a greater-than (">") symbol in the first column. All lines should be shorter than 80 characters::

>sequence1
atgcgtttgcgtgc
gtcggtttcgttgc
>sequence2
tttcgtgcgtatag
tggcgcggtga
Can be converted to:

FastqSolexa

FastqSolexa is the Illumina (Solexa) variant of the Fastq format, which stores sequences and quality scores in a single file

@seq1  
GACAGCTTGGTTTTTAGTGAGTTGTTCCTTTCTTT  
+seq1  
hhhhhhhhhhhhhhhhhhhhhhhhhhPW@hhhhhh  
@seq2  
GCAATGACGGCAGCAATAAACTCAACAGGTGCTGG  
+seq2  
hhhhhhhhhhhhhhYhhahhhhWhAhFhSIJGChO

Or 

@seq1
GAATTGATCAGGACATAGGACAACTGTAGGCACCAT
+seq1
40 40 40 40 35 40 40 40 25 40 40 26 40 9 33 11 40 35 17 40 40 33 40 7 9 15 3 22 15 30 11 17 9 4 9 4
@seq2
GAGTTCTCGTCGCCTGTAGGCACCATCAATCGTATG
+seq2
40 15 40 17 6 36 40 40 40 25 40 9 35 33 40 14 14 18 15 17 19 28 31 4 24 18 27 14 15 18 2 8 12 8 11 9
Can be converted to:

fped

Also known as the FBAT format, for use in the FBAT program. It consists of a pedigree file and an phenotype file.


Gff

GFF lines have
nine required fields that must be tab-separated.
Can be converted to:

GFF3

The
GFF3 format addresses the most common extensions to GFF, while preserving backward compatibility with previous formats.
GTF

GTF is a format for describing genes and other features associated with DNA, RNA and Protein sequences.
Can be converted to:

Html

This format is a html web page. Click the eye icon to view the dataset in your browser.


Interval (Genomic Intervals)

Required fields Optional Example:
    #CHROM START END   STRAND NAME COMMENT
    chr1   10    100   +      exon myExon
    chrX   1000  10050 -      gene myGene
Can be converted to:
  • BED
    The exact changes needed and tools to run can vary with what fields are in the interval file and what size BED you are converting to. In general you will likely use Text Manipulation→Compute, Cut or Merge Columns.

LAV

LAV is the primary output format for BLASTZ. The first line of a .lav file begins with #:lav..

Can be converted to:

Lped

This is the linkage pedigree format (separate map and ped files). These files together describe SNPs, the map file has the position and an identifier for the SNP and the pedigree file has the alleles. To upload this format into Galaxy do not use auto-detect for the file format, instead select lped. You will then be given two sections for uploading files, one for the pedigree file and one for the map file. For more information see linkage pedigree or map or ped.

Can be converted to:

MAF

TBA and multiz multiple alignment format. The first line of a .maf file begins with ##maf. This word is followed by white-space-separated "variable=value pairs". There should be no white space surrounding the "=". Click here for more about MAF format.

Can be converted to:

pbed

This is the binary version of the lped file format.

Can be converted to:
  • lped
    Automatic

PSL

PSL format is for alignments, it is returned by BLAT. It does not include any sequence.


Scf

A binary sequence file in 'scf' format with a '.scf' file extension. You must manually select this 'File Format' when uploading the file. Click here for more information.


Sff

A binary file in 'Standard Flowgram Format' with a '.sff' file extension.

Can be converted to:
  • FASTA
    Convert Formats→SFF converter
  • FASTQ
    Convert Formats→SFF converter

Table

Text delimited into columns by something other than a tab.


Tabular (tab delimited)

Any data in tab delimited format (tabular)

Can be converted to:
  • FASTA
    Convert Formats→Tabular-to-FASTA
    Tabular file must have a title and sequence column.
  • interval
    If the tabular file has the chromosome, or is all on one chromosome, and a position you can create an interval file. If all one chromosome use Text Manipulation→Add column to add the chromosome. If the given position is a 1 based position use Text Manipulation→Compute and the position column minus 1 to get the start. Otherwise do plus 1 to get the end.

Txtseq.zip

A zipped archive consisting of flat text sequence files. All files in this archive must have the same file extension of '.txt'. You must manually select this 'File Format' when uploading the file.


Wiggle custom track

The wiggle format is line-oriented. Wiggle data is preceded by a track definition line, which gives the type of wiggle. There are 3 different types, each with their uses. More information here.

Can be converted to:

Other text type

Any text file

Can be converted to:
  • tabular
    If this is space or some other delimiter separated fields it can be converted to tabular. Text Manipulations→Convert delimiters to TAB