Phenotype Association Tools in Galaxy

Example 1: Using Galaxy to look for disease SNPs in a full genome

For this example we will use an artificial dataset consisting of the SNP calls from the Complete Genomics genome GS12880 with a few known disease variants added. This will provide a realistic background to search for the disease SNPs, but not necessarily a realistic collection of disease SNPs for a single individual to have. We chose an assortment of six SNPs from the PhenCode database, representing different genes and different parts of the gene. There are two coding SNPs (heterozygous) and four non-coding (one heterozygous and three homozygous). The four non-coding SNPs are located in a promoter region, a UTR, and two introns.

Disease SNPs planted in the sample dataset

chr11	5248153	5248154	G	intron	HbVar:thalassemia
chr11	5255743	5255744	C	UTR	HbVar:thalassemia
chr11	5275879	5275880	C/T	coding	HbVar:Hb E
chr16	222915	222916	T/C	coding	HbVar:Hb Lyon-Bron
chrX	31279779	31279780	T/C	intron	LMDp:muscular dystrophy
chrX	100641248	100641249	G	promoter	BTKbase:Agammaglobulinemia

This example builds a single sequential history, but it is organized into several parts according to the type of analysis being performed. These illustrate the following skills.

Part 1: Preparing input data.

Uploading files
Using Galaxy libraries
Basic filtering

Part 2: Selecting known coding SNPs predicted to be damaging, then finding their genes and associated pathways.

PolyPhen-2
Gene-based analysis

Part 3: Running new predictions of coding SNPs likely to be detrimental.

SIFT
Using published workflows

Part 4: Finding SNPs that fall in suspected functional regions.

Predicted regulatory regions
ENCODE functional data
PhyloP conserved positions