Example 1:  Using Galaxy to look for disease SNPs in a full genome

For this example we will use an artificial dataset consisting of the SNP calls from the Complete Genomics genome GS12880 with a few known disease variants added. This will provide a realistic background to search for the disease SNPs, but not necessarily a realistic collection of disease SNPs for a single individual to have. We chose an assortment of six SNPs from the PhenCode database, representing different genes and different parts of the gene. There are two coding SNPs (heterozygous) and four non-coding (one heterozygous and three homozygous). The four non-coding SNPs are located in a promoter region, a UTR, and two introns.

Disease SNPs planted in the sample dataset

chr11 5248153 5248154 G intron HbVar:thalassemia
chr11 5255743 5255744 C UTR HbVar:thalassemia
chr11 5275879 5275880 C/T coding HbVar:Hb E
chr16 222915 222916 T/C coding HbVar:Hb Lyon-Bron
chrX 31279779 31279780 T/C intron LMDp:muscular dystrophy
chrX 100641248 100641249 G promoter BTKbase:Agammaglobulinemia

This example builds a single sequential history, but it is organized into several parts according to the type of analysis being performed. These illustrate the following skills.

Part 1:  Preparing input data.

Part 2:  Selecting known coding SNPs predicted to be damaging, then finding their genes and associated pathways.

Part 3:  Running new predictions of coding SNPs likely to be detrimental.

Part 4:  Finding SNPs that fall in suspected functional regions.