Example 1: Using Galaxy to look for disease SNPs in a full genome
For this example we will use an artificial dataset consisting of the
SNP calls from the Complete Genomics genome GS12880 with a few known
disease variants added. This will provide a realistic background to
search for the disease SNPs, but not necessarily a realistic collection
of disease SNPs for a single individual to have. We chose an assortment
of six SNPs from the PhenCode
database, representing different genes and different parts of the
gene. There are two coding SNPs (heterozygous) and four non-coding
(one heterozygous and three homozygous). The four non-coding SNPs are
located in a promoter region, a UTR, and two introns.
Disease SNPs planted in the sample dataset
| chr11 |
5248153 |
5248154 |
G |
intron |
HbVar:thalassemia |
| chr11 |
5255743 |
5255744 |
C |
UTR |
HbVar:thalassemia |
| chr11 |
5275879 |
5275880 |
C/T |
coding |
HbVar:Hb E |
| chr16 |
222915 |
222916 |
T/C |
coding |
HbVar:Hb Lyon-Bron |
| chrX |
31279779 |
31279780 |
T/C |
intron |
LMDp:muscular dystrophy |
| chrX |
100641248 |
100641249 |
G |
promoter |
BTKbase:Agammaglobulinemia |
This example builds a single sequential history, but it is organized
into several parts according to the type of analysis being performed.
These illustrate the following skills.
Part 1:
Preparing input data.
- Uploading files
- Using Galaxy libraries
- Basic filtering
Part 2:
Selecting known coding SNPs predicted to be damaging, then finding their
genes and associated pathways.
- PolyPhen-2
- Gene-based analysis
Part 3:
Running new predictions of coding SNPs likely to be detrimental.
- SIFT
- Using published workflows
Part 4:
Finding SNPs that fall in suspected functional regions.
- Predicted regulatory regions
- ENCODE functional data
- PhyloP conserved positions