make
and
gcc
(though other C compilers could probably be used by adjusting
the Makefile
s). User commands are provided in the
form of Bourne shell scripts, which use various standard utilities such as
cat
, grep
, sed
, tr
, etc.
If you want to use our Gmaj program to view the results, you will also need
a Java runtime environment; for best compatibility
Sun's JRE (or JDK) is
recommended.
REPEATMASKER=RepeatMaskernear the start of the file
conversion.sh
to indicate its location
on your computer.
maketo compile the component programs and install them in the
bin
subdirectory.
seq.d
and put your FastA-formatted sequence files in it, giving each file the
appropriate species name, e.g., human
, vervet
.
annot.d
and put your gene annotation files in it. These files
use a "coding exons" format that is similar to the exons format supported
by our PipMaker server,
except that the position endpoints reflect coding regions only (i.e.
translation rather than transcription, so UTRs are excluded). The CHAP
distribution includes sample files
in this format. The file names must consist of the species name followed
by a .codex
extension, e.g. human.codex
,
vervet.codex
, etc.
In this format, the directionality of a gene (>
,
<
, or |
), the start and end positions of
its coding sequence, and its name should be on one line, followed by
lines specifying the coding start and end positions of each exon, which
must be listed in order of increasing address even if the gene is on
the reverse strand (<
). All positions use a 1-based,
closed-interval coordinate system (i.e., the first nucleotide in your
corresponding sequence file is called "1", and the specified ranges
include both endpoints). Names ending in _ps
indicate
pseudogenes (an exception to the "coding only" rule). Thus, the file
might begin as follows:
> 12910 14400 HBZ 12910 13004 13892 14096 14272 14400 > 23122 25156 HBZ_ps 23122 25156 > 25998 26708 HBM 25998 26089 26268 26472 26580 26708 ... etc.
These annotation files are not strictly necessary for finding conversions,
but they assist somewhat in ortholog detection. If present, they are also
used by the gc-info
summary program to compute statistics for
coding regions (with pseudogenes excluded), and by Gmaj to annotate its
display.
species_tree.txt
.
seq.d
and annot.d
as well as the species tree, from steps 2-4),
run the command:
../conversion.sh species_tree.txtThe pipeline may run for an hour or more.
../bin/gc-info non-redundant.gc annot.dand/or examine them in detail by running Gmaj with commands like
../gmaj.sh humanor
../gmaj.sh vervet
gmaj_geneconv.html
provides a short tour of how to use Gmaj to investigate conversions,
while more general documentation for Gmaj is available at
www.bx.psu.edu/miller_lab.
CRIT_BOUND=0.8near the start of the file
conversion.sh
. Note that values
below 60% or above 90% are generally not recommended.
The primary output file is all.gc
, which contains the details
of all of the conversion observations in each species, and can be inspected
directly if gc-info
and Gmaj do not convey the desired
information. Additional output files include non-redundant.gc
,
which lists only one representative line from all.gc
for each
distinct conversion event,
species_tree_with_index.txt
, which simply numbers
the tree branches consecutively for reference purposes, and an assortment
of MAF alignments, some of which are used by Gmaj.
The all.gc
and non-redundant.gc
files use the
same format. The first line is a copy of the
species_tree_with_index.txt
file, labeling the tree edges so
those associated with each conversion event can be indicated. The next line
provides brief headers for the data columns, and subsequent lines contain
detailed information for each paralogous pair of intervals where conversion
was detected, using the following tab-separated fields.
Note that all position coordinates are one-based, closed-interval (i.e.
the first nucleotide in the FastA sequence is called "1", and the intervals
include both endpoints), and are specified relative to the entire given
sequence for that species (e.g. conversion regions are not relative
to the paralogs in which they are found). If an interval has an orientation
(strand) of "−
",
the endpoints are reported the same as if it were
"+
".
pair | index for each pair of paralogous sequences within a species | ||||||||||
species | name of species containing the conversion | ||||||||||
beg1, end1 |
start and end positions of the first sequence (i.e., the first paralogous interval of the pair, in the named species) | ||||||||||
species | name of species (again) | ||||||||||
beg2, end2 |
start and end positions of the second sequence (i.e., the second paralogous interval) | ||||||||||
orient | orientation (strand) of the second sequence with respect to the first | ||||||||||
length | length of the first sequence | ||||||||||
identity | fraction of identical nucleotides for the two sequences | ||||||||||
gc_len | length of the conversion region (measured in the first sequence) | ||||||||||
p-value | P-value for the conversion test | ||||||||||
gc_beg1, gc_end1 |
start and end positions for the conversion region in the first sequence | ||||||||||
gc_beg2, gc_end2 |
start and end positions for the conversion region in the second sequence | ||||||||||
direction | direction of conversion, encoded as:
| ||||||||||
c1_name, c1_start, c1_end, c1_orient |
ortholog of the first sequence in the outgroup species | ||||||||||
c2_name, c2_start, c2_end, c2_orient |
ortholog of the second sequence in the outgroup species | ||||||||||
event_id | identifying number for the conversion event (note that multiple observation lines may reflect the same event) | ||||||||||
tree_branch | indication of where the conversion event occurred in the tree topology, specified as a comma-separated list of possible edges | ||||||||||
c1_blocks | indices of alignment blocks containing the ortholog of the first sequence | ||||||||||
c2_blocks | indices of alignment blocks containing the ortholog of the second sequence | ||||||||||
ortholog_status | status of orthologs in the outgroup species, encoded as:
|
Song G, Hsu C-H, Riemer C, Zhang Y, Kim HL, Hoffmann F, Zhang L, Hardison RC, NISC Comparative Sequencing Program, Green ED, Miller W (2011) Conversion events in gene clusters. Submitted.
June 2011