Index of /miller_lab/dist/CHAP

Name	Last modified	Size

Parent Directory		-
aglobin.example/	2012-07-26 12:58	-
annot.d/	2012-03-24 23:55	-
docs/	2012-03-30 18:59	-
gmaj_geneconv.images/	2011-06-09 23:13	-
CHAP.2011-03-17.compact.tar.gz	2011-03-17 16:40	2.4M
CHAP.2011-03-17.fast.tar.gz	2011-03-17 16:40	67M
CHAP.2011-06-10.compact.tar.gz	2011-06-10 14:54	2.3M
CHAP.2011-06-10.fast.tar.gz	2011-06-10 14:53	67M
CHAP.2011-08-02.compact.tar.gz	2011-08-02 16:21	2.3M
CHAP.2011-08-02.fast.tar.gz	2011-08-02 16:21	67M
CHAP.2012-03-30.compact.tar.gz	2012-03-30 23:34	3.3M
CHAP.2012-03-30.fast.tar.gz	2012-03-30 23:34	68M
CHAP.2012-05-03.compact.tar.gz	2012-05-03 16:39	3.3M
CHAP.2012-05-03.fast.tar.gz	2012-05-03 16:39	68M
CHAP.2012-07-02.compact.tar.gz	2012-07-02 17:27	3.3M
CHAP.2012-07-02.fast.tar.gz	2012-07-02 17:27	68M
CHAP.2012-07-26.compact.tar.gz	2012-07-26 12:54	3.3M
CHAP.2012-07-26.fast.tar.gz	2012-07-26 12:54	68M
gmaj_geneconv.html	2011-06-10 14:17	26K

The Cluster History Analysis Package (CHAP 2)

TABLE OF CONTENTS

Introduction
Installation
Data Preparation
Orthology Pipeline
Orthology Output
Conversion Pipeline
Conversion Output
Utility Programs
References

Introduction

This is the second major release of the CHAP package for analyzing the evolutionary history of gene clusters, discussed in Song et al. (2012). It includes the conversion detector pipeline from the original CHAP package (Song et al. 2011), and adds a new pipeline that focuses on identifying orthologous regions between species, using two distinct paradigms for defining orthology: X-orthology (based on genomic context) and N-orthology (based on sequence content).
Both of these methods rely on the conversion calls from the original pipeline, so the new pipeline always calls the old one automatically. Thus you can run the orthology script and get both orthology and conversion results, or if you are only interested in conversions you can just run the old command as before. (Actually the original pipeline produces orthologs too, because it needs them for detecting conversions, but they are obtained by a different method and are rough and preliminary.) Note that the conversion pipeline runs for all species at once, but the new orthology mapper currently runs only for the single reference species you specify.
Preparation of input data is nearly identical for the two pipelines, except that the new orthology one makes more use of gene annotations, especially for visualizing gene orthology, whereas annotations are recommended but not strictly necessary for the original conversion pipeline. We will discuss the input files for both pipelines together in one section, and then devote separate sections for the commands and output of the two programs.
Platform: This package is designed for Unix/Linux systems. The core programs are written in C and compiled with make and gcc (though other C compilers could probably be used by adjusting the Makefiles). User commands are provided in the form of Bourne shell scripts, which use various standard utilities such as cat, grep, sed, tr, etc. If you want to get automatic orthology diagrams or use the included Gmaj program to view the results interactively, you will also need a Java runtime environment; for best compatibility Sun's JRE (or JDK) is recommended.

Installation

The CHAP pipelines need to run the RepeatMasker program (Smit et al. 1996-2010), which can be obtained from www.repeatmasker.org. When installing RepeatMasker you will need to choose which sequence search engine and which repeat database library to use; we suggest Cross_Match and RepBase respectively, which are both free for academic use. If the RepeatMasker executable is not in your command path, modify the right-hand side of the line
```
    REPEATMASKER=RepeatMasker
```
near the start of the file conversion.sh to indicate its location on your computer.
In the directory containing the unpacked files from the CHAP distribution archive, which we will call the "package directory", type
```
    make
```
to compile the component programs and install them in the bin subdirectory.
For advanced users: By default (if you just run make), CHAP is configured to keep its scripts and Java programs in the package directory, while compiled binaries and resource data files are located in the bin and resources subdirectories, respectively. If you want to install it elsewhere (e.g. centrally for multiple users), you can edit the lines for CHAP_SCRIPT_DIR, CHAP_JAVA_DIR, CHAP_BINARY_DIR, and CHAP_RESOURCE_DIR at the top of the Makefile to specify the desired locations, and then run
```
    make install
```
(it is not necessary to run make first, but it doesn't hurt either). This will configure the installed scripts to look for their programs and resource files in the directories you have specified, instead of relative to the working cluster directory (which then no longer needs to be inside the package directory). However, it also means that users will need to modify the command paths in the examples accordingly.

Data Preparation

For each gene cluster that you want to analyze, do the following.

In the package directory, create a subdirectory for the cluster, which we will call the "cluster directory".
Sequences. In the cluster directory, create a subdirectory called seq.d and put your FastA-formatted sequence files in it, giving each file the appropriate species name, e.g., human, vervet.
Annotations. In the cluster directory, create another subdirectory called annot.d and put your gene annotation files in it. These files use a "coding exons" format that is similar to the exons format supported by our PipMaker server, except that the position endpoints reflect coding regions only (i.e. translation rather than transcription, so UTRs are excluded). The CHAP distribution includes sample files in this format. The file names must consist of the species name followed by a .codex extension, e.g. human.codex, vervet.codex, etc.
In this format, the directionality of a gene (>, <, or |), the start and end positions of its coding sequence, and its name should be on one line, followed by lines specifying the coding start and end positions of each exon, which must be listed in order of increasing address even if the gene is on the reverse strand (<). All positions are relative to the cluster sequence files you provide (not the entire chromosomes), and use a 1-based, closed-interval coordinate system (i.e., the first nucleotide in your corresponding sequence file is called "1", and the specified ranges include both endpoints). Names ending in _ps indicate pseudogenes (an exception to the "coding only" rule). We recommend limiting each gene name to a single word (i.e. without spaces), but if it has multiple words then the _ps suffix must be on the first word rather than the last one in order to be properly recognized.
Thus, the file might begin as follows:
```
     > 12910 14400 HBZ-T1
     12910 13004
     13892 14096
     14272 14400
     > 23122 25156 HBZ-T2_ps
     23122 25156
     > 25998 26708 HBK
     25998 26089
     26268 26472
     26580 26708
     ... etc.
```
The orthology pipeline requires these annotation files for making its gene orthology diagrams; you must supply gene annotations for the reference and at least one other species to get any figures. If you just want orthologous alignments for the sequences, or if you are just running the conversion pipeline, then these files are not strictly necessary but are still recommended for best accuracy. They assist somewhat in the preliminary ortholog detection for finding conversions, and the enhanced orthology mapper uses them to refine its similarity scoring. If present, they are also used by the gc-info summary program to compute conversion statistics for coding regions (with pseudogenes excluded), and by Gmaj to annotate its display.
If you do not know the actual gene locations in some of your sequences, you may be able to estimate them with a program such as Wise2 (Birney et al. 2004), using known protein sequences in, say, human to find gene structures within the DNA sequences of other species. The CHAP package includes a utility script called infer-annot.sh to help with automating this approach; please see the Utility Programs section for information on how to use it.
Species tree. Put a text file containing a binary species tree in the cluster directory. The species names in the tree must match the file names used for your sequences. CHAP uses a simplified version of the Newick format, where branch lengths are omitted, all leaf nodes have names but interior nodes do not, and the tree is rooted at an interior node. Quoted labels are not supported, nor are comments in square brackets. (However unlike TBA, CHAP does expect the usual commas and ending semicolon.) The tree can include line breaks and extra spaces/tabs, but its maximum total length (excluding whitespace) is currently 1000 characters. This file can have any name; for concreteness, let's suppose it is called species_tree.txt.

Orthology Pipeline

In the cluster directory (which contains subdirectories seq.d and annot.d as well as the species tree, from steps 2-4 of the Data Preparation section), run a command like
```
    ../ortho.sh species_tree.txt human
```
where the last argument is the name of the sequence to use as the reference. The pipeline may take from several minutes to an hour or more, depending on the complexity of the cluster's history and the number of sequences.
If desired, you can run the pipeline again for a different reference, e.g.
```
    ../ortho.sh species_tree.txt vervet --no_rm
```
This is currently rather inefficient because it runs the conversion pipeline again unnecessarily, but at least the --no_rm option avoids re-running RepeatMasker. Since the output files include the reference in their names, your earlier results should coexist peacefully without being overwritten. One small exception, however, is the inferred pseudogenes in the fig_annot.d directory, which are used by default for the PostScript figures and Gmaj viewer. These are computed in a theoretically reference-dependent manner, so their endpoints may change slightly when they are overwritten by a new run. Note that running multiple jobs simultaneously in the same cluster directory is not supported and may produce erroneous output, since they will attempt to use the same temporary scratch files.

Orthology Output

If you have Java installed and have provided gene annotation files for the reference and at least one other species, you will automatically get PostScript figures that summarize the genes' X-orthology and N-orthology. These files have names like human.x-ortho.eps and human.n-ortho.eps respectively, where the first part of the name indicates the reference sequence. They are placed in the figures.d directory.

The files in the ortho.d/events.d directory list the evolutionary events identified in the reference sequence by the orthology mapper; these may be helpful for interpreting the figures. Events are listed in reverse chronological order (i.e., most recent first). Lines beginning with "# sp" represent speciation events where the reference lineage split from the subtree containing the indicated species. The other lines contain six space-separated columns, listed in Table 1.

Table 1. Fields in event output files (ortho.d/events.d/*.events).

col_1

event type, encoded as:

d:	deletion
+:	duplication (same orientation)
−:	duplication (inverted)
c:	conversion (between paralogs with the same orientation)
v:	conversion (between paralogs with opposite orientations)

col_2,
col_3 start and end positions of the source region (or deleted region)

col_4,
col_5 start and end positions of the target region (0 0 for deletions)

col_6 percent identity of the two regions (0 for deletions)

The detailed orthology calls (for the entire sequences, not just genes) are stored as pairwise orthologous alignments in MAF format, in directories ortho.d/x-ortho.d and ortho.d/n-ortho.d. You can pass these to other tools for further analysis, or examine them visually by running Gmaj with commands like
```
    ../gmaj-ortho.sh human vervet context
```
or
```
    ../gmaj-ortho.sh human vervet content
```
See the Utility Programs section for more information about gmaj-ortho.sh.
The file docs/ortho.html provides examples of how to interpret the PostScript figures and use Gmaj to investigate the orthology results.
Note that the results for different choices of the reference sequence might not be completely consistent. In particular, determining the full evolutionary history of the cluster in all of the species simultaneously is a task planned for future work.
All of the conversion output is also produced.

Conversion Pipeline

Note that the orthology pipeline will run this for you automatically, so you only need to run it manually if you are not interested in the improved orthology calls. Also, the conversion pipeline always runs for all reference sequences, not just the one you specify for orthology.

In the cluster directory (which contains subdirectories seq.d and annot.d as well as the species tree, from steps 2-4 of the Data Preparation section), run the command
```
    ../conversion.sh species_tree.txt
```
The pipeline may run for an hour or more.
For advanced users: The conversion pipeline has a number of internal parameters that have been carefully tuned to reasonable defaults. One of these that is fundamental to our method for detecting conversions is the paralog coverage threshold for choosing whether to use the regular triplet/quadruplet criterion or the alternative "old dup" criterion: if a particular putative conversion covers more than the given fraction of its paralog pair by length, then the alternative criterion is used to test it. The default value for this threshold is 80%, and our simulation study showed that the results are not greatly affected by its exact value. However, if you do want to adjust it (e.g. for an unusual situation), you can edit the line
```
    CRIT_BOUND=0.8
```
near the start of the file conversion.sh. Note that values below 60% or above 90% are generally not recommended.

Conversion Output

You can get a tab-separated file with summary statistics on the conversions found in each species by running the command
```
    ../bin/gc-info non-redundant.gc annot.d self.d
```
To examine the conversion evidence in detail using Gmaj, run commands like
```
    ../gmaj-conv.sh human
```
where the argument is the reference species whose conversions you want to see. The Utility Programs section has more information about gmaj-conv.sh, and the file docs/gmaj_geneconv.html provides a short tour of how to use Gmaj to investigate conversions.
The primary output file from the pipeline is all.gc, which contains the details of all of the conversion observations in each species, and can be inspected directly if gc-info and Gmaj do not convey the desired information. Additional output files include non-redundant.gc, which lists only one representative line from all.gc for each distinct conversion event, species_tree_with_index.txt, which simply numbers the tree branches consecutively for reference purposes, and an assortment of MAF alignments, some of which are used by Gmaj.

The all.gc and non-redundant.gc files use the same format. The first line is a copy of the species_tree_with_index.txt file, labeling the tree edges so those associated with each conversion event can be indicated. The next line provides brief headers for the data columns, and subsequent lines contain detailed information for each paralogous pair of intervals where conversion was detected, using the tab-separated fields listed in Table 2.

Note that all position coordinates are 1-based, closed-interval (i.e. the first nucleotide in the FastA sequence is called "1", and the intervals include both endpoints), and are specified relative to the entire given sequence for that species (e.g. conversion regions are not relative to the paralogs in which they are found). If an interval has an orientation (strand) of "−", the endpoints are reported the same as if it were "+".

Table 2. Fields in conversion output files all.gc and non-redundant.gc.

pair

index for each pair of paralogous sequences within a species

species

name of species containing the conversion

beg1,
end1

start and end positions of the first sequence (i.e., the first paralogous interval of the pair, in the named species)

species

name of species (again)

beg2,
end2

start and end positions of the second sequence (i.e., the second paralogous interval)

orient

orientation (strand) of the second sequence with respect to the first

length

length of the first sequence

identity

fraction of identical nucleotides for the two sequences

gc_len

length of the conversion region (measured in the first sequence)

p-value

P-value for the conversion test

gc_beg1,
gc_end1

start and end positions for the conversion region in the first sequence

gc_beg2,
gc_end2

start and end positions for the conversion region in the second sequence

direction

direction of conversion, encoded as:

0:	unknown
1:	the first sequence is converted
2:	the second sequence is converted

c1_name,
c1_start,
c1_end,
c1_orient

ortholog of the first sequence in the outgroup species

c2_name,
c2_start,
c2_end,
c2_orient

ortholog of the second sequence in the outgroup species

event_id

identifying number for the conversion event (note that multiple observation lines may reflect the same event)

tree_branch

indication of where the conversion event occurred in the tree topology, specified as a comma-separated list of possible edges

c1_blocks

indices of alignment blocks containing the ortholog of the first sequence

c2_blocks

indices of alignment blocks containing the ortholog of the second sequence

ortholog_status

status of orthologs in the outgroup species, encoded as:

0:	no orthologs
1:	triplet test; only paralog #1 has an ortholog
2:	triplet test; only paralog #2 has an ortholog
3:	quadruplet test; both paralogs have distinct orthologs
4:	"old dup" test; the conversion event covers almost the entire duplicated region, and both paralogs have distinct orthologs, indicating that the duplication preceded the speciation (this is the "alternative criterion" discussed in Song et al. (2011))

Utility Programs

Note that running these programs without any arguments will typically give you a brief reminder of the usage syntax.

ortho-fig.sh
This script generates the PostScript orthology figures, by first creating the *.fig files describing the diagrams and then running the orthofig.jar program with appropriate parameters to do the actual drawing (*.eps). It is normally called automatically by the main ortho.sh pipeline, but you can rerun it manually if needed (e.g. to use a different set of gene annotations), via a command like
```
    ../ortho-fig.sh human annot_dir
```
The reference species must be one for which you have already run ortho.sh. If you do not specify an annotation directory, the default is to use fig_annot.d, which contains your original annotations from annot.d plus pseudogenes that have been inferred by the pipeline. The colors for the gene boxes are specified in the file ortho-fig.colors, which you can edit if desired. By default this file is located in the package's resources directory.
orthofig.jar
This is the Java program that draws the PostScript figures from the *.fig files. You can rerun it manually to change the drawing parameters, but only after the *.fig files have been created. The ortho-fig.sh script prints the parameters it is using, so you can just tweak the ones you want to. For an explanation of the available parameters, run the command
```
    java -jar ../orthofig.jar -help
```
gmaj-ortho.sh
This script runs Gmaj to view the orthology calls between the reference and another species. The orthologous alignments are shown superimposed (in black) on the full set of chained pairwise alignments between the two sequences (brown). Use a command like
```
    ../gmaj-ortho.sh human vervet orth_type annot_dir
```
where orth_type is either "context", "content", or "cage" (the latter specifies the preliminary orthology calls made by the conversion pipeline's CAGE program). Orthology results by context (X-orthology) and by content (N-orthology) are only available for the reference species you specified when running ortho.sh, but the CAGE calls are produced for all reference species. Again, if you don't specify an annotation directory, then fig_annot.d is used by default (unless you only ran conversion.sh instead of ortho.sh, in which case fig_annot.d was not created, so annot.d is used).

Gmaj can draw annotations on the alignment plots in the form of colored background bands called underlays. By default the CHAP scripts build underlays for Gmaj automatically from your gene annotation files, but you can override this by supplying your own underlay files (e.g. to include items other than genes and exons). These files must have names like human.underlays, vervet.underlays, etc., and follow the format specified in the documentation for the main release of Gmaj (except that a new color PaleGray has been added for CHAP). You can put them either in the annotation directory you specify or in annot.d (putting them in fig_annot.d is also possible but not recommended because that directory is wiped out and recreated each time ortho.sh is run). Note that the default underlays are placed in temp_underlays.d; this directory is wiped out and recreated whenever the Gmaj scripts are run, but you can use the files in it as examples or templates for making your own custom underlay files.
gmaj-conv.sh
This script runs Gmaj to view the conversion calls for a particular reference species and examine the evidence for them. It has a number of parameters available for customization, but only the reference species is required.
```
    ../gmaj-conv.sh human annot_dir genomic_offset "title" exon_color
```
Examples:
```
    ../gmaj-conv.sh human
    ../gmaj-conv.sh human my_annot.d 31334805 "Conversions in the human CCL region" LightYellow
    ../gmaj-conv.sh human "" 0 "" None
```
The parameters are position-dependent; if you want to keep the default annotations, or if you do not want an offset or a title, then use "", 0, and "" respectively to reach subsequent options. As before, the default annotation directory is fig_annot.d if it exists, otherwise annot.d. The genomic_offset is added to all position labels in the reference sequence, so they can be displayed with respect to e.g. the entire chromosome instead of the provided cluster sequence. The title is applied to the Gmaj window, and exon_color is used for building default underlays if you haven't supplied custom ones for a particular sequence (see the discussion of underlay files above). The list of valid underlay colors is available in the documentation for the main release of Gmaj (except that a new color PaleGray has been added for CHAP), or you can specify "None" to have this script suppress all underlays. The default exon color is LightGray.
gc-info
This is a compiled C program located with the other binaries (by default in the package's bin directory). It computes some summary statistics about the detected conversions in each of the species, and prints them in a tab-separated format with column headers.
```
    ../bin/gc-info non-redundant.gc annot.d self.d
```
infer-annot.sh
This script aims to help you obtain estimated gene annotations for non-reference species from those of a reference species, using the Wise2 software from EBI (Birney et al. 2004).
First, download and install Wise2 according to the instructions that come with it. If the installed location is not in your command path, modify the right-hand side of the line
```
    GENEWISE=genewise
```
near the start of CHAP's infer-annot.sh script to specify the path for the genewise executable on your computer.
Next, go to your cluster directory, create the annot.d subdirectory, and put your annotation file for the reference species in it (e.g. human.codex), as described in the Data Preparation section. (You could use any subdirectory name for this inference step, but it needs to be called annot.d in order for the ortho.sh and conversion.sh scripts to find it later.) Put all of your sequences in the seq.d directory as usual.
Finally, from the cluster directory, run the command
```
    ../infer-annot.sh human annot.d
```
This will use the reference annotations to estimate gene and exon locations for all of the other sequences that don't already have annotation files, and put the new *.codex files in the same directory (annot.d). Of course, you may edit these as desired before going on to run the ortho.sh or conversion.sh pipeline.

cleanout.sh

This script is provided to help clean up a specified cluster directory, removing files and subdirectories added by the CHAP pipelines. Any material added by users to pipeline-created directories will be wiped out when the directories are removed, but other user files will generally be left alone.

    ../cleanout.sh cluster_dir clean_level refseq_name

If you are already in the cluster directory to be cleaned, you can use "." for that parameter. The clean_level controls which files and directories are removed, with higher values specifying increasingly thorough/drastic cleanup, as follows. They are cumulative, with each level including all lower ones.

`0`:	temporary scratch files normally deleted automatically by the pipelines; useful if a script did not finish due to an error
`1`:	additional intermediate output from pipeline programs (but final results and files needed by Gmaj and the figure generator are kept)
`2`:	result files for all reference sequences other than the specified one
`3`:	all output except RepeatMasker results and Gmaj user preferences; useful with the pipelines' `--no_rm` option to avoid the delay of re-masking
`4`:	all output; only the original user input should remain

The refseq_name is only used for level 2. It specifies the reference sequence of the results you want to keep; others will be discarded.

Examples:

    ../cleanout.sh . 1          # good for routine tidying
    ../cleanout.sh . 2 human    # used on aglobin.example to save space

For details on exactly which files are deleted at which levels, please see the comments for the variable assignments in the top section of the script.

References

Birney E, Clamp M, Durbin R (2004) GeneWise and Genomewise. Genome Res. 14:988. PubMed 15123596

Smit AFA, Hubley R, Green P (1996-2010) RepeatMasker Open-3.0. Unpublished; http://www.repeatmasker.org.

Song G, Hsu C-H, Riemer C, Zhang Y, Kim HL, Hoffmann F, Zhang L, Hardison RC, NISC Comparative Sequencing Program, Green ED, Miller W (2011) Conversion events in gene clusters. BMC Evol. Biol. 11:226. PubMed 21798034

Song G, Riemer C, Dickins B, Kim HL, Zhang L, Zhang Y, Hsu C-H, Hardison RC, NISC Comparative Sequencing Program, Green ED, Miller W (2012) Revealing mammalian evolutionary relationships by comparative analysis of gene clusters. To appear in Genome Biol. Evol.

March 2012