The Cluster History Analysis Package (CHAP)

General Instructions

Output Files

The primary output file is all.gc, which contains the details of all of the conversion observations in each species, and can be inspected directly if gc-info and Gmaj do not convey the desired information. Additional output files include non-redundant.gc, which lists only one representative line from all.gc for each distinct conversion event, species_tree_with_index.txt, which simply numbers the tree branches consecutively for reference purposes, and an assortment of MAF alignments, some of which are used by Gmaj.

The all.gc and non-redundant.gc files use the same format. The first line is a copy of the species_tree_with_index.txt file, labeling the tree edges so those associated with each conversion event can be indicated. The next line provides brief headers for the data columns, and subsequent lines contain detailed information for each paralogous pair of intervals where conversion was detected, using the following tab-separated fields.

Note that all position coordinates are one-based, closed-interval (i.e. the first nucleotide in the FastA sequence is called "1", and the intervals include both endpoints), and are specified relative to the entire given sequence for that species (e.g. conversion regions are not relative to the paralogs in which they are found). If an interval has an orientation (strand) of "", the endpoints are reported the same as if it were "+".

pair index for each pair of paralogous sequences within a species
species name of species containing the conversion
beg1,
end1
start and end positions of the first sequence (i.e., the first paralogous interval of the pair, in the named species)
species name of species (again)
beg2,
end2
start and end positions of the second sequence (i.e., the second paralogous interval)
orient orientation (strand) of the second sequence with respect to the first
length length of the first sequence
identity fraction of identical nucleotides for the two sequences
gc_len length of the conversion region (measured in the first sequence)
p-value P-value for the conversion test
gc_beg1,
gc_end1
start and end positions for the conversion region in the first sequence
gc_beg2,
gc_end2
start and end positions for the conversion region in the second sequence
direction direction of conversion, encoded as:
0:unknown
1:the first sequence is converted
2:the second sequence is converted
c1_name,
c1_start,
c1_end,
c1_orient
ortholog of the first sequence in the outgroup species
c2_name,
c2_start,
c2_end,
c2_orient
ortholog of the second sequence in the outgroup species
event_id identifying number for the conversion event (note that multiple observation lines may reflect the same event)
tree_branch indication of where the conversion event occurred in the tree topology, specified as a comma-separated list of possible edges
c1_blocks indices of alignment blocks containing the ortholog of the first sequence
c2_blocks indices of alignment blocks containing the ortholog of the second sequence
ortholog_status status of orthologs in the outgroup species, encoded as:
0:no orthologs
1:triplet test; only paralog #1 has an ortholog
2:triplet test; only paralog #2 has an ortholog
3:quadruplet test; both paralogs have distinct orthologs
4:"old dup" test; the conversion event covers almost the entire duplicated region, and both paralogs have distinct orthologs, indicating that the duplication preceded the speciation (this is the "alternative criterion" discussed in Song et al. (2011))

References

Song G, Hsu C-H, Riemer C, Zhang Y, Kim HL, Hoffmann F, Zhang L, Hardison RC, NISC Comparative Sequencing Program, Green ED, Miller W  (2011)  Conversion events in gene clusters.  Submitted.


June  2011