gd_snp format

gd_snp is a plain text, tab-delimited file. It contains specified information in some columns but may have other columns following those. It also contains metadata about the columns in comments in the beginning of the file. The metadata contains the column names, labels for the individuals and/or groups with genotype data in the table and the first column for that data (1 based), and some other metadata required by programs using this format.

The initial columns should always be:

  1. chr(scaf) = chromosome or scaffold
  2. pos = position 0-based
  3. A = reference allele
  4. B = alternate allele
  5. Q = summary genotype quality score over all individuals or groups, or -1 if not available

If there is no reference sequence for the species, there will be another set of columns to describe the position in the reference of a similar species where this aligned. If there is a reference for the original species these columns can be skipped.

  1. ref = chromosome of reference species
  2. rPos = position 0-based in reference species
  3. rnuc = reference allele

The following columns are in sets of four, one set per individual or group of individuals.

  1. 1A = for an individual read count supporting the reference allele, for a group the number of chromosomes with the reference allele
  2. 1B = for an individual read count supporting the alternate allele, for a group the number of chromosomes with the alternate allele
  3. 1G = genotype call for this individual or group with the number indicating the number of reference alleles (0,1,2)
  4. 1Q = quality score for the genotype call
The four columns can be repeated as many times as needed incrementing the number for the column name with each individual or group. -1 indicates missing or unknown data.

Example of human data where there is a reference:

#{"column_names":["chr","pos","A","B","Q","1A","1B","1G","1Q","2A","2B","2G","2Q","3A","3B","3G","3Q","4A","4B","4G","4Q"],
#"individuals":[["CEU",6],["GBR",10],["YRI",14],["LWK",18]],
#"dbkey":"hg19","pos":2,"rPos":2,"ref":1,"scaffold":1,"species":"hg19"}
chr1	10582	G	A	-1	133	37	1	0	152	26	1	0	172	4	1	0	190	4	1	0
chr1	10610	C	G	-1	170	0	2	0	171	7	1	0	176	0	2	0	189	5	1	0
chr1	13301	C	T	-1	150	20	1	0	160	18	1	0	145	31	1	0	145	49	1	0
chr1	13326	G	C	-1	170	0	2	0	169	9	1	0	174	2	1	0	190	4	1	0

Example of data without it's own reference (bear reads aligned to canFam2 reference):

#{"column_names":["scaf","pos","A","B","qual","ref","rpos","rnuc","1A","1B","1G","1Q","2A","2B","2G","2Q","3A","3B","3G","3Q","4A","4B","4G","4Q","5A","5B","5G","5Q","6A","6B","6G","6Q","pair","dist",
#"prim","rflp"],"dbkey":"canFam2","individuals":[["PB1",9],["PB2",13],["PB3",17],["PB4",21],["PB6",25],["PB8",29]],"pos":2,"rPos":7,"ref":6,"scaffold":1,"species":"bear"}
Contig161_chr1_4641264_4641879  115     C       T       73.5    chr1    4641382         C       6       0       2       45      8       0       2       51      15      0       2       72      5       0       2       42      6       0       2       45      10      0       2       57      Y       54      0.323   0
Contig48_chr1_10150253_10151311 11      A       G       94.3    chr1    10150264        A       1       0       2       30      1       0       2       30      1       0       2       30      3       0       2       36      1       0       2       30      1       0       2       30      Y       22      +99.    0