Computing the sequence coverage distributions

Start with a new history on the main public Galaxy server, and name it something meaningful to you, perhaps "Example 4". help icon For this example we will use the dataset called "human SNPs, like aye-aye" from the Genome Diversity shared library. It contains low-coverage SNP calls for multiple individuals, in gd_snp format. Import this dataset into your history help icon, and observe that it includes approximately nine million SNPs (blue arrow).

For our analysis we want to use only the more reliable SNPs. Although our dataset includes quality scores, these are not a dependable way of determining the reliability for low-coverage data. Instead we will use the coverage of each SNP as an indicator, since those with higher coverage are more likely to be correct.

To get a detailed view of what the coverage is like and to help us choose parameters for running later queries, we will run a coverage distribution. In the Genome Diversity section of the tool panel, click on the Coverage Distributions tool. The dataset we just imported should already be selected in the center form, and we want the default setting of computing the distributions for all of the individuals, so just go ahead and click the Execute button.

Note: the red arrows in the screen shots show the selections to be made, the blue arrows point out items of interest, and the green arrows are the actions that take you to the next step.

[screen shot]