Genome Diversity SNP (gd_snp) Format

The gd_snp file format was designed for use with the tools in the Genome Diversity section of Galaxy's tool panel. It is a sub-type of Galaxy's Tabular format, so it is plain text, containing columns of data separated by tab characters. The columns listed below are specified by the format, but these may be followed by additional, arbitrary columns. The file also has header lines at the beginning containing metadata needed by programs using this format, such as the column names, labels for the individuals and/or groups whose genotype data is in the table, and the starting column number (1-based) for that data.

===> The initial columns should always be:

  1. chr(scaf) = chromosome or scaffold
  2. pos = position 0-based
  3. A = reference allele
  4. B = alternate allele
  5. Q = summary genotype quality score over all individuals or groups, or -1 if not available

If there is no reference sequence for the species, there will be another set of columns to describe the position in the reference of a similar species where this aligned. If there is a reference for the original species these columns can be skipped.

  1. ref = chromosome of reference species
  2. rPos = position 0-based in reference species
  3. rnuc = reference allele

The following columns are in sets of four, one set per individual or group of individuals.

  1. 1A = for an individual read count supporting the reference allele, for a group the number of chromosomes with the reference allele
  2. 1B = for an individual read count supporting the alternate allele, for a group the number of chromosomes with the alternate allele
  3. 1G = genotype call for this individual or group with the number indicating the number of reference alleles (0,1,2)
  4. 1Q = quality score for the genotype call
The four columns can be repeated as many times as needed incrementing the number for the column name with each individual or group. -1 indicates missing or unknown data.

Example of human data where there is a reference:

#{"column_names":["chr","pos","A","B","Q","1A","1B","1G","1Q","2A","2B","2G","2Q","3A","3B","3G","3Q","4A","4B","4G","4Q"],
#"individuals":[["CEU",6],["GBR",10],["YRI",14],["LWK",18]],
#"dbkey":"hg19","pos":2,"rPos":2,"ref":1,"scaffold":1,"species":"hg19"}
chr1	10582	G	A	-1	133	37	1	0	152	26	1	0	172	4	1	0	190	4	1	0
chr1	10610	C	G	-1	170	0	2	0	171	7	1	0	176	0	2	0	189	5	1	0
chr1	13301	C	T	-1	150	20	1	0	160	18	1	0	145	31	1	0	145	49	1	0
chr1	13326	G	C	-1	170	0	2	0	169	9	1	0	174	2	1	0	190	4	1	0

Example of data without it's own reference (bear reads aligned to canFam2 reference):

#{"column_names":["scaf","pos","A","B","qual","ref","rpos","rnuc","1A","1B","1G","1Q","2A","2B","2G","2Q","3A","3B","3G","3Q","4A","4B","4G","4Q","5A","5B","5G","5Q","6A","6B","6G","6Q","pair","dist",
#"prim","rflp"],"dbkey":"canFam2","individuals":[["PB1",9],["PB2",13],["PB3",17],["PB4",21],["PB6",25],["PB8",29]],"pos":2,"rPos":7,"ref":6,"scaffold":1,"species":"bear"}
Contig161_chr1_4641264_4641879  115     C       T       73.5    chr1    4641382         C       6       0       2       45      8       0       2       51      15      0       2       72      5       0       2       42      6       0       2       45      10      0       2       57      Y       54      0.323   0
Contig48_chr1_10150253_10151311 11      A       G       94.3    chr1    10150264        A       1       0       2       30      1       0       2       30      1       0       2       30      3       0       2       36      1       0       2       30      1       0       2       30      Y       22      +99.    0