Genomic Environment Predicts
Expression Patterns
on the Human Inactive X
Chromosome
Written by Chungoo Park
If you have any questions concerining the
source codes, their usages, or need assistance, please feel free to contact me
(cxp440@psu.edu).
v Analyze I or E subgenomes in Xp22.
Step
One: Define the Inactivated
and Escape subgenome composition.
Extract
gene lists from X-inactivation profile which came from a supplementary file in
Carrel and Willard (2005) ÒX-inactivation profile reveals extensive variability
in X-linked gene expression in femalesÓ Nature 17;434(7031):400-404. Genes were
considered to be X inactivated if silenced in all nine inactive X containing
hybrids assayed or if expressed in only a single hybrid (0/9 or 1/9); they will
be a candidate members for I subgenome. Similarly, genes were scored as
escaping XCI if expressed in eight or nine out of nine inactivate X hybrids
tested (8/9 or 9/9); they will be used for E subgenome. Note especially to
analyze the entire X chromsome, ESTs were not considered to make I or E
subgenome.
1.
Create two files (ÒWhole_X_new_subject_list.txtÓ and ÒWhole_X_new_escape_
list.txtÓ). Each file has geneÕs profile which met the
each criteria.
*
Source Code: Make_list_for_subject_escape.pl
#!/usr/bin/perl
-w
use
strict;
use
warnings;
my
$file = "../../SourceData/Expression_X.txt";
my
$Out_Subject = "Whole_X_new_subject_list.txt";
my
$Out_Escape = "Whole_X_new_escape_list.txt";
my
$cLine_File;
open
(FILE, "<$file") || die
"Sorry, I couldn't open the After_lists_known.txt file: $!\n";
while
(defined($cLine_File = <FILE>)){
chomp($cLine_File);
if
(!($cLine_File =~ /Pseudoautosomal/)){
if
($cLine_File =~ /0 \/ 9/){
open
(SUB, ">>$Out_Subject") || die
"Sorry, I couldn't open the Whole_X_subject_list.txt file: $!\n";
print
SUB "$cLine_File\n";
close
(SUB);
}
if
($cLine_File =~ /1 \/ 9/){
open
(SUB, ">>$Out_Subject") || die
"Sorry, I couldn't open the Whole_X_subject_list.txt file: $!\n";
print
SUB "$cLine_File\n";
close
(SUB);
}
if
($cLine_File =~ /8 \/ 9/){
open
(ESC, ">>$Out_Escape") || die
"Sorry, I couldn't open the Whole_X_escape_list.txt file: $!\n";
print
ESC "$cLine_File\n";
close
(ESC);
}
if
($cLine_File =~ /9 \/ 9/){
open
(ESC, ">>$Out_Escape") || die
"Sorry, I couldn't open the Whole_X_escape_list.txt file: $!\n";
print
ESC "$cLine_File\n";
close
(ESC);
}
}
}
close(FILE);
2.
Divide an input file (ÒWhole_X_new_subject_list.txtÓ or ÒWhole_X_new_escape_
list.txtÓ) into two files (ÒWhole_X_new_subject_list_none_ESTs.txtÓ
and ÒWhole _X_new_subject_list_Ests.txtÓ OR
ÒWhole_X_new_escape_list_none_ESTs.txtÓ and ÒWhole_X_new_escape_list_Ests.txtÓ)
by the presence or absence of EST.
*
Source Code: Make_list_by_EST.pl
#!/usr/bin/perl
-w
use
strict;
use
warnings;
my
$file = "Whole_X_new_escape_list.txt";
my
$Out_None_Ests = "Whole_X_new_escape_list_none_ESTs.txt";
my
$Out_Ests = "Whole_X_new_escape_list_Ests.txt";
my
$cLine_File;
open
(FILE, "<$file") || die
"Sorry, I couldn't open the input file: $!\n";
while
(defined($cLine_File = <FILE>)){
chomp($cLine_File);
if
($cLine_File =~ /EST/){
if
($cLine_File =~ /EST support/){
open
(NON_EST, ">>$Out_None_Ests") || die
"Sorry, I couldn't open the output_none_EST file: $!\n";
print
NON_EST "$cLine_File\n";
close
(NON_EST);
}else{
open
(EST, ">>$Out_Ests") || die
"Sorry, I couldn't open the output_EST file: $!\n";
print
EST "$cLine_File\n";
close
(EST);
}
}else{
open
(NON_EST, ">>$Out_None_Ests") || die
"Sorry, I couldn't open the output_none_EST file: $!\n";
print
NON_EST "$cLine_File\n";
close
(NON_EST);
}
}
close(FILE);
3.
Build three different sized subgenomes (100kb, 200kb and 500kb) surrounding the
transcription start sites (TSSs) of genes. Note that the 100/200/500kb contigs
include 50/100/250kb upstream and downstream of the TSS.
* Source Code for I subgenome: Make_Subgenome_subject_against_WholeX.pl
- Output: Subgenome_Subject_50K_wo_EST.txt or
Subgenome_Subject_100K_wo_EST.txt or
Subgenome_Subject_250K_wo_EST.txt
#!/usr/bin/perl
-w
use
strict;
use
warnings;
my
$subject = "Whole_X_new_subject_list_none_ESTs.txt";
my
$escape = "Whole_X_new_escape_list_none_ESTs.txt";
my
%unManaged = ();
my
%Managed = ();
my
$sequence_length = 250000;
sub
check_escape{
my
$func_start_seq = $_[0];
my
$func_end_seq = $_[1];
open
(CLON_ESC, "< $escape") ||
die
"Sorry, I couldn't open the escape.txt for clone: $!\n";
my
$func_cLine;
my
($esc_start_seq, $esc_end_seq);
while
(defined($func_cLine = <CLON_ESC>)){
my
@func_Database = $func_cLine =~ /^(\S+) - (\S+)\t(\S*|\s*)\t/;
my
$func_gene_start = $func_Database[0];
my
$func_gene_end = $func_Database[1];
my
$func_strand = $func_Database[2];
if
($func_strand eq "-"){
$esc_start_seq
= $func_gene_end - $sequence_length;
$esc_end_seq
= $func_gene_end + $sequence_length;
}else{
$esc_start_seq
= $func_gene_start - $sequence_length;
$esc_end_seq
= $func_gene_start + $sequence_length;
}
if
(($func_start_seq >= $esc_start_seq) && ($func_start_seq <
$esc_end_seq)){
return
0;
}
if
(($func_end_seq > $esc_start_seq) && ($func_end_seq <=
$esc_end_seq)){
return
0;
}
}
return
1;
close(CLON_ESC);
}
sub
merge_seq{
my
$func_first = $_[0];
my
$func_end = $_[1];
my
($begin, $end);
delete
$unManaged{$func_first};
while
(($begin, $end) = each(%unManaged)){
if
(($func_first < $begin) && ($func_end > $begin)){
$func_end
= $end;
delete
$unManaged{$begin};
}
if
(($func_first < $end) && ($func_end > $end)){
$func_first
= $begin;
delete
$unManaged{$begin};
}
}
$Managed{$func_first}
= $func_end;
}
open
(SUBJECT, "< $subject") ||
die
"Sorry, I couldn't open the subject.txt: $!\n";
my
$cLine;
my
@Database = ();
my
($first, $last);
while
(defined($cLine = <SUBJECT>)){
my
($start_seq, $end_seq);
@Database
= $cLine =~ /^(\S+) - (\S+)\t(\S*|\s*)\t/;
my
$gene_start = $Database[0];
my
$gene_end = $Database[1];
my
$strand = $Database[2];
if
($strand eq "-"){
$start_seq
= $gene_end - $sequence_length;
$end_seq
= $gene_end + $sequence_length;
}else{
$start_seq
= $gene_start - $sequence_length;
$end_seq
= $gene_start + $sequence_length;
}
my
$result_escape = check_escape($start_seq, $end_seq);
if
($result_escape eq "1"){
$unManaged{$start_seq}
= $end_seq;
}else{
# print
"$gene_start $gene_end $start_seq $end_seq\n"
}
}
while
(($first, $last) = each(%unManaged)){
merge_seq($first,
$last);
}
my
$myCount = 0;
for(;
$myCount < 100; $myCount++){
%unManaged
= %Managed;
%Managed
= ();
while
(($first, $last) = each(%unManaged)){
merge_seq($first,
$last);
}
}
my
($aa, $bb);
while
(($aa, $bb) = each(%Managed)){
print
"Managed \t $aa \t $bb\n";
}
close(SUBJECT);
* Source Code for E subgenome: Make_Subgenome_escape_against_WholeX.pl
-
Output: Subgenome_Escape_50K_wo_EST.txt or
Subgenome_Escape_100K_wo_EST.txt or
Subgenome_Escape_250K_wo_EST.txt
#!/usr/bin/perl
-w
use
strict;
use
warnings;
my
$subject = "Whole_X_new_subject_list_none_ESTs.txt";
my
$escape = "Whole_X_new_escape_list_none_ESTs.txt";
my
%unManaged = ();
my
%Managed = ();
my
$sequence_length = 250000;
sub
check_subject{
my
$func_start_seq = $_[0];
my
$func_end_seq = $_[1];
open
(CLON_SUB, "< $subject") ||
die
"Sorry, I couldn't open the subject.txt for clone: $!\n";
my
$func_cLine;
my
($esc_start_seq, $esc_end_seq);
while
(defined($func_cLine = <CLON_SUB>)){
my
@func_Database = $func_cLine =~ /^(\S+) - (\S+)\t(\S*|\s*)\t/;
my
$func_gene_start = $func_Database[0];
my
$func_gene_end = $func_Database[1];
my
$func_strand = $func_Database[2];
if
($func_strand eq "-"){
$esc_start_seq
= $func_gene_end - $sequence_length;
$esc_end_seq
= $func_gene_end + $sequence_length;
}else{
$esc_start_seq
= $func_gene_start - $sequence_length;
$esc_end_seq
= $func_gene_start + $sequence_length;
}
if
(($func_start_seq >= $esc_start_seq) && ($func_start_seq <
$esc_end_seq)){
return
0;
}
if
(($func_end_seq > $esc_start_seq) && ($func_end_seq <=
$esc_end_seq)){
return
0;
}
}
return
1;
close(CLON_SUB);
}
sub
merge_seq{
my
$func_first = $_[0];
my
$func_end = $_[1];
my
($begin, $end);
delete
$unManaged{$func_first};
while
(($begin, $end) = each(%unManaged)){
if
(($func_first < $begin) && ($func_end > $begin)){
$func_end
= $end;
delete
$unManaged{$begin};
}
if
(($func_first < $end) && ($func_end > $end)){
$func_first
= $begin;
delete
$unManaged{$begin};
}
}
$Managed{$func_first}
= $func_end;
}
open
(ESCAPE, "< $escape") ||
die
"Sorry, I couldn't open the escape.txt: $!\n";
my
$cLine;
my
@Database = ();
my
($first, $last);
while
(defined($cLine = <ESCAPE>)){
my
($start_seq, $end_seq);
@Database
= $cLine =~ /^(\S+) - (\S+)\t(\S*|\s*)\t/;
my
$gene_start = $Database[0];
my
$gene_end = $Database[1];
my
$strand = $Database[2];
if
($strand eq "-"){
$start_seq
= $gene_end - $sequence_length;
$end_seq
= $gene_end + $sequence_length;
}else{
$start_seq
= $gene_start - $sequence_length;
$end_seq
= $gene_start + $sequence_length;
}
my
$result_subject = check_subject($start_seq, $end_seq);
if
($result_subject eq "1"){
$unManaged{$start_seq}
= $end_seq;
}else{
# print
"$gene_start $gene_end $start_seq $end_seq\n";
}
}
while
(($first, $last) = each(%unManaged)){
merge_seq($first,
$last);
}
my
$myCount = 0;
for(;
$myCount < 100; $myCount++){
%unManaged
= %Managed;
%Managed
= ();
while
(($first, $last) = each(%unManaged)){
merge_seq($first,
$last);
}
}
my
($aa, $bb);
while
(($aa, $bb) = each(%Managed)){
print
"Managed \t $aa \t $bb\n";
}
close(ESCAPE);
Note:
at the $sequence_length variable, the length of contig was assigned.
4.
Confirm the outputs by the step 2: Whether the subgenomes include the ESTs.
* Source Code for I subgenome: Check_Subject_Subgenome.pl
#!/usr/bin/perl
-w
use
strict;
use
warnings;
my
$Subject_file = "Subgenome_Subject_250K_wo_EST.txt";
my
$Escape_Ests = "Whole_X_new_escape_list_Ests.txt";
my
$cLine_File;
open
(ESCAPE, "<$Escape_Ests") || die
"Sorry, I couldn't open the Escape_Ests file: $!\n";
while
(defined($cLine_File = <ESCAPE>)){
chomp($cLine_File);
my
@EST_Database = $cLine_File =~ /^(\S+) - (\S+)\t(\S*|\s*)\t/;
my
$est_start = $EST_Database[0];
my
$est_end = $EST_Database[1];
my
$cLine_Subject;
open
(FILE, "<$Subject_file") || die
"Sorry, I couldn't open the Subject_file: $!\n";
while
(defined($cLine_Subject = <FILE>)){
chomp($cLine_Subject);
my
@Subject_Database = $cLine_Subject =~ /(\S+)\s*/g;
my
$contig_start = $Subject_Database[1];
my
$contig_end = $Subject_Database[2];
if
(($contig_start <= $est_start) && ($contig_end >= $est_start)){
if
($contig_end >= $est_end){
print
"$cLine_Subject \t Include_EST \t $cLine_File \n";
}else{ print
"$cLine_Subject \t Overlap_Start \t $cLine_File \n"; }
}else{
if (($contig_start >= $est_start)
&& ($contig_start <= $est_end)){
if
($contig_end >= $est_end){
print
"$cLine_Subject \t Overlap_End \t $cLine_File \n";
}else{ print
"$cLine_Subject \t Include_Contig \t $cLine_File \n"; }
}
}
}
close(FILE);
}
close(ESCAPE);
* Source Code for E subgenome: Check_Escape_Subgenome.pl
#!/usr/bin/perl
-w
use
strict;
use
warnings;
my
$Subject_file = "Subgenome_Escape_250K_wo_EST.txt";
my
$Escape_Ests = "Whole_X_new_subject_list_Ests.txt";
my
$cLine_File;
open
(ESCAPE, "<$Escape_Ests") || die
"Sorry, I couldn't open the Escape_Ests file: $!\n";
while
(defined($cLine_File = <ESCAPE>)){
chomp($cLine_File);
my
@EST_Database = $cLine_File =~ /^(\S+) - (\S+)\t(\S*|\s*)\t/;
my
$est_start = $EST_Database[0];
my
$est_end = $EST_Database[1];
my
$cLine_Subject;
open
(FILE, "<$Subject_file") || die
"Sorry, I couldn't open the Subject_file: $!\n";
while
(defined($cLine_Subject = <FILE>)){
chomp($cLine_Subject);
my
@Subject_Database = $cLine_Subject =~ /(\S+)\s*/g;
my
$contig_start = $Subject_Database[1];
my
$contig_end = $Subject_Database[2];
if
(($contig_start <= $est_start) && ($contig_end >= $est_start)){
if
($contig_end >= $est_end){
print
"$cLine_Subject \t Include_EST \t $cLine_File \n";
}else{ print
"$cLine_Subject \t Overlap_Start \t $cLine_File \n"; }
}else{
if (($contig_start >= $est_start)
&& ($contig_start <= $est_end)){
if
($contig_end >= $est_end){
print
"$cLine_Subject \t Overlap_End \t $cLine_File \n";
}else{ print
"$cLine_Subject \t Include_Contig \t $cLine_File \n"; }
}
}
}
close(FILE);
}
close(ESCAPE);
5.
Confirm the outputs by the step 3: (1) whether contigs are overlapped with
contigs having opposite inactivation profile; (2) within the range of contigs
whether there exist genes having opposite inactivation profile
* Source Code for (1): Verify_Subgenome.pl
#!/usr/bin/perl
-w
use
strict;
use
warnings;
my
$subject = "Subgenome_Subject_250K_wo_EST.txt";
my
$escape = "Subgenome_Escape_250K_wo_EST.txt";
open
(SUB, "< $subject") ||
die
"Sorry, I couldn't open the subject for clone: $!\n";
my
$cLine_Sub;
while
(defined($cLine_Sub = <SUB>)){
my
@cSub_DB = $cLine_Sub =~ /(\S+)\s*/g;
my
$Sub_begin = $cSub_DB[1];
my
$Sub_end = $cSub_DB[2];
open
(ESC, "< $escape") ||
die
"Sorry, I couldn't open the escape for clone: $!\n";
my
$cLine_Esc;
while
(defined($cLine_Esc = <ESC>)){
my
@cEsc_DB = $cLine_Esc =~ /(\S+)\s*/g;
my
$Esc_begin = $cEsc_DB[1];
my
$Esc_end = $cEsc_DB[2];
if
(($Sub_begin <= $Esc_begin) && ($Sub_end >= $Esc_begin)){
print
"$Sub_begin \t $Sub_end \t $Esc_begin \t $Esc_end\n";
}
if
(($Sub_begin <= $Esc_end) && ($Sub_end >= $Esc_end)){
print
"$Sub_begin \t $Sub_end \t $Esc_begin \t $Esc_end\n";
}
}
close(ESC);
}
close(SUB);
* Source Code for (2): Verify_Subgenome_2.pl
#!/usr/bin/perl
-w
use
strict;
use
warnings;
#my
$subject = "Subgenome_Subject_250K_wo_EST.txt";
#my
$escape = "Whole_X_new_escape_list.txt";
my
$subject = "Subgenome_Escape_250K_wo_EST.txt";
my
$escape = "Whole_X_new_subject_list.txt";
open
(SUB, "< $subject") ||
die
"Sorry, I couldn't open the subject for clone: $!\n";
my
$cLine_Sub;
while
(defined($cLine_Sub = <SUB>)){
my
@cSub_DB = $cLine_Sub =~ /(\S+)\s*/g;
my
$Sub_begin = $cSub_DB[1];
my
$Sub_end = $cSub_DB[2];
open
(ESC, "< $escape") ||
die
"Sorry, I couldn't open the escape for clone: $!\n";
my
$cLine_Esc;
while
(defined($cLine_Esc = <ESC>)){
my
@cEsc_DB = $cLine_Esc =~ /(\S+)\s*/g;
my
$Esc_begin = $cEsc_DB[0];
my
$Esc_end = $cEsc_DB[2];
if
(($Sub_begin <= $Esc_begin) && ($Sub_end >= $Esc_begin)){
print
"$Sub_begin \t $Sub_end \t $Esc_begin \t $Esc_end\n";
}
if
(($Sub_begin <= $Esc_end) && ($Sub_end >= $Esc_end)){
print
"$Sub_begin \t $Sub_end \t $Esc_begin \t $Esc_end\n";
}
}
close(ESC);
}
close(SUB);
Step
Two: Assess the presence of
each possible oligomer of specified size (8-, 12-, 16-, 20-, and 24-mers) and
count these oligomers within each subgenomes. First of all, we recall the
procedure which criteria was used for the Xp22 analysis; oligomers were
considered significantly overrepresented in a subgenome if they were enriched
at least five-fold and occurred at least ten times in that subgenome.
Remarkable one was that the size of I subgenome was similar to that of E
subgenome.
After
finding overrepresented oligomer, they were assigned to either specific class
of repeatitive elements or to none-repeatitive elements.
1.
Extract genomic sequence using a coordinate of each contigs for I and E Subgenome.
* Source Code: GetSeq_From_Position_PCG.pl
-
input: ÒSubgenome_Subject_50K_wo_EST.txtÓ
- output:
../[50k|100K|250K]_Data/[input_name].txtÓnÓ
- variables
to be modified: $outfile
#!/usr/bin/perl
-w
#########################################################################
##
Extract the genomic sequences from input files ##
##
including locations of start and end with respect to subgenome ##
#########################################################################
use
Getopt::Std;
getopts('i:');
$path
= $opt_i; #The list of files
#
Define the variable for file analysis
my
@cDatabase = ();
my
$file;
my
$cTempDatabase = "";
my
$cTempFile = "";
my
@cFile = ();
my
$X_chr =
"/Users/chungoopark/Research/X-inactivation/Sequence_Analysis/Chromosome/chromFa/chrX.fa";
my
$outfile = "/Users/chungoopark/Research/X-inactivation/Whole_X_New_Oligomer/50K_Data/";
my
($dStart, $dEnd);
my
$index = 1;
my
@sequence = ();
my
$sequenceintoDB;
my
($dbID, $dbNum, $indexSize);
undef
$/;
open(DBFILE,
"< $X_chr") ||
die
"Sorry, I couldn't open the database file: $!\n";
while(defined
($cTempDatabase = <DBFILE>)){
@cDatabase
= split(/\n/, $cTempDatabase);
}
close(DBFILE);
calPositionFile();
sub
calPositionFile{
$/
= "\n";
#open
the novel.txt file
open(FILE,
"< $path") ||
die
"Sorry, I couldn't open the input file with respect to subgenome:
$!\n";
while(defined
($cTempFile = <FILE>)){
@cFile
= $cTempFile =~ /(\S+)\s*/g;
$dStart
= $cFile[1];
$dEnd
= $cFile[2];
calSequence();
$sequenceintoDB
= join(" ", @sequence);
$sequenceintoDB
=~ s/\s//g;
open(OUT,
">$outfile$path$index") ||
die
"Sorry, I couldn't make outfile: $!\n";
print
OUT $sequenceintoDB;
$index++;
}
close(FILE);
}
sub
calSequence{
my
$iCount;
my
($StartArray, $StartPosition);
my
($EndArray, $EndPosition);
my
$NumArray;
my
@TempArray;
@sequence
= ();
($StartArray,
$StartPosition) = extractArray($dStart);
($EndArray,
$EndPosition) = extractArray($dEnd);
foreach
$NumArray ($StartArray .. $EndArray){
#Initiate
varable
@TempArray
=();
if
($NumArray == $StartArray || $NumArray == $EndArray){
@TempArray
= $cDatabase[$NumArray] =~ /([a-zA-Z])/g;
#
Error Check
if
(( $#TempArray + 1) != 50){
die
"Parsing error ... Array Count...\n";
}
if
($NumArray == $StartArray){
$iCount
= $StartPosition;
while
($iCount < 50){ push(@sequence,
$TempArray[$iCount++]); }
}else{
$iCount
= 0;
while
($iCount < $EndPosition + 1){ push(@sequence,
$TempArray[$iCount++]); }
}
}elsif
(( $StartArray < $NumArray) && ($EndArray > $NumArray)){
@TempArray
= $cDatabase[$NumArray] =~ /(\S)/g;
#
Error Check
if
(( $#TempArray + 1) != 50){
die
"Parsing error ... Array Count...\n";
}
$iCount
= 0;
while
($iCount < 50){ push(@sequence,
$TempArray[$iCount++]); }
}
}
}
sub
extractArray{
my
($Array, $Position);
my
$Value = $_[0];
$Array
= ($Value / 50) + 1;
$Array
= sprintf("%d", $Array);
$Position
= ($Value % 50) - 1;
return
($Array, $Position);
}
2.
Concatenate all contigs representing either genes which were inactivated or
escaping inactivation. Finally make two subgenomes that were call I subgenome
and E subgenome, respectively.
* Source Code: Make_Whole_Subgenome.pl
-
input: all files that were generated by the previous step (1)
- output:
../[50k|100K|250K]_Data/Whole_[Subject|Escape]_[50K|100K|250K].txt[input_name].txt
- variables to be modified: $file,
$Out, $number_contig
#!/usr/bin/perl
-w
use
strict;
use
warnings;
#my
$file = "../50K_Data/Subgenome_Subject_50K_wo_EST.txt";
#my
$Out = "../50K_Data/Whole_Subject_50K.txt";
#my
$number_contig = 172;
my
$file = "../50K_Data/Subgenome_Escape_50K_wo_EST.txt";
my
$Out = "../50K_Data/Whole_Escape_50K.txt";
my
$number_contig = 21;
my
$temp_number = 1;
for(;
$temp_number <= $number_contig; $temp_number++){
my
$contig_name = $file . $temp_number;
my
@temp = ("cat $contig_name >> $Out");
system(@temp);
}
3.
Count possible oligomers of specified size (8-, 12-, 16-, 20- and 24-mers)
within each subgenome.
* Source Code: CountMer_PCG.pl
- input: output of previous step(2) with option Ò-sÓ
and number (denote the length of oligomer; e.g. 8, 12, 16, 20 and 24) with
option Ò-pÓ
- output: name of first parameter + name of second parameter
- variables to be modified: None
#!/usr/bin/perl
#CountMer_PCG.pl
use
Getopt::Std;
sub
complement {
my
$basein = @_[0];
if
($basein =~ s/G/c/gi) {
$basein
=~ tr/a-z/A-Z/;
return
$basein;
}elsif
($basein =~ s/A/t/gi) {
$basein
=~ tr/a-z/A-Z/;
return
$basein;
}elsif
($basein =~ s/T/a/gi) {
$basein
=~ tr/a-z/A-Z/;
return
$basein;
}elsif
($basein =~ s/C/g/gi) {
$basein
=~ tr/a-z/A-Z/;
return
$basein;
}
}
sub
reversesite {
my
$sitein = @_[0];
my
@REsite = split //, $sitein;
my
@revsite; #Array to hold the reversed site
my
$countletter = 0;
while
($#REsite >= 0) {
$revsite[$countletter]
= pop(@REsite);
$revsite[$countletter]
= complement($revsite[$countletter]);
$countletter++;
}
my
$returnsite = "";
foreach
$revsite (@revsite){
$returnsite
= $returnsite . $revsite;
}
return
$returnsite;
}
sub
notpalin {
my
$InHex = @_[0];
if
($InHex eq reversesite($InHex)) {
return
0;
}
else {
return
1;
}
}
getopts('p:s:');
$SeqFile
= $opt_s;
$PolSize
= $opt_p;
$OutFile
= $SeqFile . "." . $PolSize;
#Grab
the sequence from the files and throw it into an array
#file
is a flat text FASTA file, so skip lines starting with >
open(SEQFILE,
"$SeqFile") or die("Could not open Sequence File $SeqFile
\n");
for
($i = 0; $i < ($PolSize - 1); $i++) {
$PolHold[$i]
= "N";
}
%PolyHash
= ();
$ContigSize
= 0;
while
($line = <SEQFILE>) {
chomp($line);
@bases
= split //,$line;
if
($bases[0] ne '>'){
for
($i = 0; $i < @bases; $i++){
$bases[$i]
=~ tr/a-z/A-Z/;
unless
($bases[$i] =~ m/G|A|T|C/){
$bases[$i]
= "N";
}
$PolHold[($PolSize
- 1)] = $bases[$i];
$hexout
= "";
for
($hexpos = 0; $hexpos <= ($PolSize -1); $hexpos++) {
$hexout .= $PolHold[$hexpos];
}
$PolyHash{$hexout}++;
#print
"$hexout \t $PolyHash{$hexout} \n";
for
($j =0; $j <($PolSize - 1); $j++){
$PolHold[$j]
= $PolHold [($j + 1)];
}
$ContigSize++;
}
}
}
close(SEQFILE);
print
"Read Sequence. \n";
#Step
through the Hash to eliminate duplicate entries and N-containing entries.
while
(($key, $value) = each %PolyHash) {
# print
"$key \t $value\n";
if
($key =~ m/N/) {
delete
$PolyHash{$key};
}
elsif (notpalin($key)) {
$revkey
= reversesite($key);
# if
(exists $PolyHash{$revkey}) {
############################
#
Modified #
############################
if
($revkey gt $key) {
$PolyHash{$key}
+= $PolyHash{$revkey};
delete
$PolyHash{$revkey};
}elsif
($key gt $revkey) {
$PolyHash{$revkey}
+= $PolyHash{$key};
delete
$PolyHash{$key};
}
# }
}
}
print
"Cleaned up Escape. \n ";
#Open
a file to output the polymer count results
open(RESTFILE,
">$OutFile") or die ('Could not open Result File');
while
(($key, $value) = each %PolyHash) {
# print
"$key \t $value\n";
print
RESTFILE "$key\t$value\n";
}
close
(RESTFILE);
print
"Reported Polymers. \n";
print
"Contig is $ContigSize Base Pairs. \n";
4.
Pick up oligomers that satisfy our criteria in terms of number of occurrence
(10 or more instance).
*
Source Code: New_10Inst.pl
- input: output
of previous step(3) with option Ò-iÓ
- output: name of input file + Ò10InstÓ
#!/usr/bin/perl
#10Inst.pl
#Opens
up a Count output file and reprints just those Polymers with 10 or more
occurences
#Output
File = InputFile.10Inst
use
Getopt::Std;
getopts('i:');
$InFile
= $opt_i;
$OutFile
= $InFile . ".10Inst";
open(INFILE,
"$InFile") or die("Failed on Input $InFile $!\n");
open(OUTFILE,">$OutFile")
or die("Failed on Output $OutFile $!\n");
while
($line = <INFILE>){
chomp($line);
@lineparts
= split /\t/, $line;
if
($lineparts[1] > 9){
print
OUTFILE "$line\n";
# print
"$line\n";
}
}
close(INFILE);
close(OUTFILE);
5.
Run shell scripts in order to find how many oligomers exist each contigs in I
and E subgenomes. These shell scripts used the FFP2_PCG.pl
*
Source Code: [50K | 100K | 250K]_myScript_for10Inst_[8|12|16|20|24].sh
perl FFP2_PCG.pl -s ./50K_Data/Subgenome_Escape_50K_wo_EST.txt1 -p
Whole_Escape_50K.txt.8.10Inst
****
* Additional source Code: FFP2_PCG.pl
#!/usr/bin/perl
use
Getopt::Std;
sub
complement {
my
$basein = @_[0];
if
($basein =~ s/G/c/gi) {
$basein
=~ tr/a-z/A-Z/;
return
$basein;
}elsif
($basein =~ s/A/t/gi) {
$basein
=~ tr/a-z/A-Z/;
return
$basein;
}elsif
($basein =~ s/T/a/gi) {
$basein
=~ tr/a-z/A-Z/;
return
$basein;
}elsif
($basein =~ s/C/g/gi) {
$basein
=~ tr/a-z/A-Z/;
return
$basein;
}
}
sub
reversesite {
my
$sitein = @_[0];
my
@REsite = split //, $sitein;
my
@revsite; #Array to hold the reversed site
my
$countletter = 0;
while
($#REsite >= 0) {
$revsite[$countletter]
= pop(@REsite);
# print
"$revsite[$countletter]";
$revsite[$countletter]
= complement($revsite[$countletter]);
# print
"$revsite[$countletter]";
$countletter++;
}
my
$returnsite = "";
foreach
$revsite (@revsite){
$returnsite
= $returnsite . $revsite;
}
#print
"$sitein $returnsite\n";
return
$returnsite;
}
getopts('p:s:');
$SeqFile
= $opt_s;
$PolFile
= $opt_p;
$OutFile
= $SeqFile . $PolFile;
%TestPol
= (); #Initialize Hash for storing Test Polymers
#Grab
the polymers from files and throw it into an array
open
(POLFILE, "$PolFile") or die ("Could not open Polymer File
$PolFile\n");
while
($line = <POLFILE>) {
@PolInfo
= split /\t/,$line;
$PolInfo[0]
=~ s/\s//;
$TestPol{$PolInfo[0]}
= "NONE found";
}
close
(POLFILE);
print
"Read Polymers in.\n";
@PolymerSize
= split //,$PolInfo[0];
$PolSize
= scalar(@PolymerSize);
#file
is a flat text FASTA file, so skip lines starting with >
open(SEQFILE,
"$SeqFile") or die("Could not open Sequence File $SeqFile \n");
for
($i = 0; $i < ($PolSize - 1); $i++) {
$PolHold[$i]
= "N";
}
$NotFound
= 1;
%PolyHash
= ();
while
($line = <SEQFILE>) {
chomp($line);
@bases
= split //,$line;
if
($bases[0] ne '>'){
for
($i = 0; $i < @bases; $i++){
$PolHold[($PolSize
- 1)] = $bases[$i];
$hexout
= "";
for
($hexpos = 0; $hexpos <= ($PolSize -1); $hexpos++) {
############################
#
Modified #
############################
$PolHold[$hexpos]
=~ tr/a-z/A-Z/;
$hexout .= $PolHold[$hexpos];
}
#Code
to make sure that you're using the lowest version of $hexout
$testseq1
= reversesite($hexout);
if
($hexout gt $testseq1) {
# print
"$hexout $testseq1\n";
$hexout
= $testseq1;
}
$PolyHash{$hexout}++;
#print
"$hexout \t $PolyHash{$hexout} \n";
for
($j =0; $j <($PolSize - 1); $j++){
$PolHold[$j]
= $PolHold [($j + 1)];
}
}
}
else {
print
"Error\n";
#Code
to compare the hashes
while
(($key, $value) = each %TestPol) {
if
(exists $PolyHash{$key}) {
if
($TestPol{$key} eq "NONE found") {
$TestPol{$key}
= "$CurrNT\t$PolyHash{$key}\t"
}
else {
$TestPol{$key}
.= "$CurrNT\t$PolyHash{$key}\t";
}
}
}
@NTDesc
= split /\s/,$line;
@NTName
= split /\W/, $NTDesc[0];
$CurrNT
= $NTName[4];
%PolyHash
= ();
print
"Starting NT $CurrNT\n";
}
}
while
(($key, $value) = each %TestPol) {
#############################
#
Modified #
#############################
$myRevkey
= reversesite($key);
if
(exists $PolyHash{$key}) {
if
($TestPol{$key} eq "NONE found") {
$TestPol{$key}
= "$CurrNT\t$PolyHash{$key}\t";
}
else {
$TestPol{$key}
.= "$CurrNT\t$PolyHash{$key}\t";
}
}elsif
(exists $PolyHash{$myRevkey}) {
if
($TestPol{$key} eq "NONE found") {
$TestPol{$key}
= "$CurrNT\t$PolyHash{$myRevkey}\t";
}
else {
$TestPol{$key}
.= "$CurrNT\t$PolyHash{$myRevkey}\t";
}
}
}
close(SEQFILE);
#Open
a file to output the polymer count results
open(RESTFILE,
">$OutFile") or die ('Could not open Result File');
while
(($key, $value) = each %TestPol) {
# unless
($value eq "NONE found"){
print
RESTFILE "$key\t$value\n";
# }
}
close
(RESTFILE);
6.
Apply second criteria in terms of the fold difference (i.e., to be
overrepresented in I subgenome, an oligomer has to be overrepsented 5-fold
based on the difference of subgenome size between I and E).
* Source
Code: Comp_Subgenome_5_against_[subject | escape]_PCG.pl50K_[Subject |
Escape]_contig_PCG.pl
- input:
For subject, Ò[50K | 100K| 250K]_Subject_Contig_[12 |
16 | 20 | 24].txtÓ with option Ò-sÓ and ÒWhol_Escape_[50K | 100K |
250K].txt.[12 | 16 | 20 | 24]Ó with option Ò-eÓ.
For escape, Ò[50K | 100K|
250K]_Escape_Contig_[12 | 16 | 20 | 24].txtÓ with option Ò-sÓ and
ÒWhol_Subject_[50K | 100K | 250K].txt.[12 | 16 | 20 | 24]Ó with option Ò-eÓ.
- output: Diff_5_ + name following with option Ò-sÓ + name following with option Ò-eÓ
For
Comp_Subgenome_5_against_subject_PCG.pl
#!/usr/bin/perl
#DecMatchMer5.pl
use
Getopt::Std;
getopts('e:s:');
$EscFile
= $opt_e;
$SubFile
= $opt_s;
$OutFile
= "Diff_5_" . $opt_s .
$opt_e;
$key
=0;
$line
= 0;
$value
= 0;
open(SEQFILE,
"$EscFile") or die("Could not open Data File $EscFile\n");
while
($line = <SEQFILE>) {
chomp($line);
($key,
$value) = split /\t/,$line;
$key
=~ s/\s//gis;
$value
=~ s/\s//gis;
$EscCnt{$key}
= $value;
}
close(SEQFILE);
print
"Read Escape Data. \n";
open(SEQFILE,
"$SubFile") or die("Could not open Data File $SubFile\n");
while
($line = <SEQFILE>) {
chomp($line);
($key,
$value) = split /\t/,$line;
$key
=~ s/\s//gis;
$value
=~ s/\s//gis;
$SubCnt{$key}
= $value;
}
close(SEQFILE);
print
"Read Subject Data. \n";
#Open
a file to output the comparison results
open(RESTFILE,
">$OutFile") or die ('Could not open Result File');
while
(($sub_oligo, $sub_number) = each(%SubCnt)){
if
($EscCnt{$sub_oligo}){
my
$number = $EscCnt{$sub_oligo};
if($sub_number
>= (5 * $number)){
print
RESTFILE "$sub_oligo\tESCAPE\t$number\t$sub_number\n";
}
}else{
if
($sub_number >= 5){
print
RESTFILE "$sub_oligo\tESCAPE\t0\t$sub_number\n";
}
}
}
close
(RESTFILE);
7.
Carry out LocFFP_PCG.pl in order to report which oligomers were found in a
particular location within genomic subgenome sequence.
*
Source Code: LocFFC_PCG.pl
- input: ÒWhole_[Subject | Escape].txt with option
Ò-sÓ AND output of previous step (e.g., Diff_5_*) with option Ò-pÓ
- output: Location + name following with option Ò-pÓ
#!/usr/bin/perl
#LocFFP.pl
#Location
Fast Find PolyMer
#opens
up a FASTA file and reports which polymers are found at what BP
use
Getopt::Std;
sub
complement {
my
$basein = @_[0];
if
($basein =~ s/G/c/gi) {
$basein
=~ tr/a-z/A-Z/;
return
$basein;
}elsif
($basein =~ s/A/t/gi) {
$basein
=~ tr/a-z/A-Z/;
return
$basein;
}elsif
($basein =~ s/T/a/gi) {
$basein
=~ tr/a-z/A-Z/;
return
$basein;
}elsif
($basein =~ s/C/g/gi) {
$basein
=~ tr/a-z/A-Z/;
return
$basein;
}
}
sub
reversesite {
my
$sitein = @_[0];
my
@REsite = split //, $sitein;
my
@revsite; #Array to hold the reversed site
my
$countletter = 0;
while
($#REsite >= 0) {
$revsite[$countletter]
= pop(@REsite);
# print
"$revsite[$countletter]";
$revsite[$countletter]
= complement($revsite[$countletter]);
# print
"$revsite[$countletter]";
$countletter++;
}
my
$returnsite = "";
foreach
$revsite (@revsite){
$returnsite
= $returnsite . $revsite;
}
#print
"$sitein $returnsite\n";
return
$returnsite;
}
getopts('p:s:');
$SeqFile
= $opt_s;
$PolFile
= $opt_p;
$OutFile
= "Location" . $PolFile;
%TestPol
= (); #Initialize Hash for storing Test Polymers
#Grab
the polymers from files and throw it into an array
open
(POLFILE, "$PolFile") or die ("Could not open Polymer File
$PolFile\n");
while
($line = <POLFILE>) {
@PolInfo
= split /\t/,$line;
$PolInfo[0]
=~ s/\s//;
$TestPol{$PolInfo[0]}
= "NONE found";
}
close
(POLFILE);
print
"Read Polymers in.\n";
@PolymerSize
= split //,$PolInfo[0];
$PolSize
= scalar(@PolymerSize);
#file
is a flat text FASTA file, so skip lines starting with >
open(SEQFILE,
"$SeqFile") or die("Could not open Sequence File $SeqFile
\n");
#Open
a file to output the polymer count results
open(RESTFILE,
">$OutFile") or die ('Could not open Result File');
for
($i = 0; $i < ($PolSize - 1); $i++) {
$PolHold[$i]
= "N";
}
$NotFound
= 1;
$CurrBP=0;
while
($line = <SEQFILE>) {
chomp($line);
@bases
= split //,$line;
if
($bases[0] ne '>'){
for
($i = 0; $i < @bases; $i++){
$PolHold[($PolSize
- 1)] = $bases[$i];
$CurrBP++;
$hexout
= "";
for
($hexpos = 0; $hexpos <= ($PolSize -1); $hexpos++) {
############################
#
Modified #
############################
$PolHold[$hexpos]
=~ tr/a-z/A-Z/;
$hexout .= $PolHold[$hexpos];
}
#Code
to make sure that you're using the lowest version of $hexout
$testseq1
= reversesite($hexout);
if
($hexout gt $testseq1) {
# print
"$hexout $testseq1\n";
$hexout
= $testseq1;
}
if
(exists($TestPol{$hexout})){
print
RESTFILE "$CurrBP\t$hexout\n";
}
for
($j =0; $j <($PolSize - 1); $j++){
$PolHold[$j]
= $PolHold [($j + 1)];
}
}
}
}
close(SEQFILE);
close
(RESTFILE);
8.
Assign the overrepresent oligomers into repeat region or non-repeat region
(i.g., check whether a particular overrepresent oligomers lay on the region
which was defined as repetitive element.
*
Source Code: RM_alloc_From_Oligomer.pl
- input: None
- output: Result_[Subject | Escape]_[12 | 16 | 20 |
24]_Oligo_[50K | 100K | 250K].txt
- variables to be modified: $Table, $Oligomer,
$RepeatMasker
#!/usr/bin/perl
-w
use
strict;
use
warnings;
#####
For subject
my
$Table = "Diff_5_100K_Subject_Contig_24.txtWhole_Escape_100K.txt.24";
my
$Oligomer =
"LocationDiff_5_100K_Subject_Contig_24.txtWhole_Escape_100K.txt.24";
my
$RepeatMasker = "../100K_Data/Whole_Subject_100K.txt.out";
####
For escape
#my
$Table = "Diff_5_250K_Escape_Contig_20.txtWhole_Subject_250K.txt.20";
#my
$Oligomer =
"LocationDiff_5_250K_Escape_Contig_20.txtWhole_Subject_250K.txt.20";
#my
$RepeatMasker = "../250K_Data/Whole_Escape_250K.txt.out";
open(TABLE,
"< $Table") ||
die
"Sorry, I couldn't open the Table file: $!\n";
my
$cLine_Table;
while(defined
($cLine_Table = <TABLE>)){
my
@cArray_Table = $cLine_Table =~ /(\S+)\s*/g;
my
$Table_Oligomer = $cArray_Table[0];
####
For subject
my
$Table_number = $cArray_Table[2];
####
For escape
# my
$Table_number = $cArray_Table[3];
my
%Contents_RM = ();
my
$Count = 0;
my
$TempCount = 0;
open(OLIGOMER,
"< $Oligomer") ||
die
"Sorry, I couldn't open the Oligomer file: $!\n";
my
$cLine_Oligomer;
while(defined
($cLine_Oligomer = <OLIGOMER>)){
my
@cArray_Oligomer = $cLine_Oligomer =~ /(\S+)\s*/g;
my
$loc_Oligomer = $cArray_Oligomer[0];
my
$Seq_Oligomer = $cArray_Oligomer[1];
if
($Table_Oligomer eq $Seq_Oligomer){
open(REPEAT,
"< $RepeatMasker") ||
die
"Sorry, I couldn't open the RepeattMasker file: $!\n";
my
$cLine_Repeat;
while(defined
($cLine_Repeat = <REPEAT>)){
chomp($cLine_Repeat);
my
@cArray_Repeat = $cLine_Repeat =~ /(\S+)\s*/g;
if
($cArray_Repeat[0] =~ /[0-9]/){
my
$pos_begin = $cArray_Repeat[5];
my
$pos_end = $cArray_Repeat[6];
if
(($pos_begin + 16 <= ($loc_Oligomer-16)) && ($pos_end >=
$loc_Oligomer)){
$Contents_RM{$cArray_Repeat[10]}++;
# print
"$Table_Oligomer\t$cArray_Repeat[10]\n";
$Count++;
}
}
}
close(REPEAT);
$TempCount++;
}
}
close(OLIGOMER);
if
($TempCount eq $Table_number){
my
$myKey;
my
$diff = $Table_number - $Count;
print
"$Table_Oligomer\t$Table_number\t$Count\t$diff\t";
foreach
$myKey (sort keys %Contents_RM){
print
"$myKey\t$Contents_RM{$myKey}\t";
}
print
"\n";
}
else{ print
"Eroor occur...\n"; }
}
close(TABLE);
9.
Summary the prvious results by means of the repeat elements.
*
Source Code: Ordering_Result_files.pl
-
input: None
- output:
Order_Result_[Subject | Escape]_[8| 12 | 16 | 20 | 24]_Oligo_[50K | 100K |
250K].txt
#!/usr/bin/perl
-w
use
strict;
use
warnings;
my
$Result = "Result_Subject_24_Oligo_100K.txt";
open(RESULT,
"< $Result") ||
die
"Sorry, I couldn't open the Result file: $!\n";
my
$cLine_Result;
while(defined
($cLine_Result = <RESULT>)){
chomp($cLine_Result);
my
@cArray_Result = $cLine_Result =~ /(\S+)\s*/g;
my
$total_number = scalar(@cArray_Result);
my
%My_Hash = ();
my
$start_number = 4;
# print
"$cLine_Result \t \t \t $total_number\n";
for(;$start_number
< $total_number;){
$My_Hash{$cArray_Result[$start_number]}
= $cArray_Result[$start_number+1];
$start_number
+=2;
}
print
"$cArray_Result[0]\t$cArray_Result[1]\t";
my
$per_re = ($cArray_Result[2] / $cArray_Result[1]) * 100;
print
"$cArray_Result[2]\t";
printf
"%3.1f", $per_re;
my
$per_re_none = ($cArray_Result[3] / $cArray_Result[1]) * 100;
print
"%\t$cArray_Result[3]\t";
printf
"%3.1f", $per_re_none;
my
$my_key;
my
$my_value;
foreach
$my_key (sort {$My_Hash{$b} <=> $My_Hash{$a}} keys %My_Hash){
print
"%\t$my_key\t$My_Hash{$my_key}\t";
$my_value
= $My_Hash{$my_key};
last;
}
my
$per_class_by_all = ($my_value / $cArray_Result[1]) *100;
my
$per_class_by_re;
if
($cArray_Result[2] == 0){
$per_class_by_re
= 0;
}else{
$per_class_by_re
= ($my_value / $cArray_Result[2]) *100;
}
printf
"%3.1f", $per_class_by_all;
print
"%\t";
printf
"%3.1f", $per_class_by_re;
print
"%\n";
}
close(RESULT);
v I did two statistic tests (these will be called
random test and autosomal test) using oligomers from Xp22 / Whole
X to evaluate the significance of the overrepresented oligomers.
-
For autosomal test
:
The aim of autosomal test is to access the statistical significance of an
enriched oligomer that
were identified in either I or E subgenomes for the Xp22 (Whole X) as compared
to autosomes. The procedures for autosomal test are as follows. Dowonload all known genes annotated on autosomes
and TSS information from UCSC Table Browser; Generate sequence fragments 100kb,
200kb, and 500kb surrounding the TSS of each gene; Merge fragments of adjacent
genes if sequence fragments; Randomly pick up contigs with replacement to make
up new E and I subgenomes of the Xp22 (Whole X) subgenome length.
:
This process was repeated 1000 times; Calculate how many times an oligomer was
present in the autosomal subgenome. To determine the empirical p-value, I
calculate
the number of autosomal subgenomes (which were generated by above
procedures)
at which an oligomer was presented at the same or greater number of opies than in the original I or E
subgenomes.
1.
Extract gene information in Òknown_gene_list.txtÓ coming from UCSC table
Browser.
*
Source Code: Extract_gene_info.pl
-
input: known_gene_list.txt
-
output: input_for_Extract_sequence.txt
#!/usr/bin/perl
-w
use
strict;
use
warnings;
my
$Gene_Database = "known_gene_list.txt";
open(GENE_DB,
"< $Gene_Database") ||
die
"Sorry, I couldn't open the known_gene_list file: $!\n";
my
$cLine_Gene;
while(defined
($cLine_Gene = <GENE_DB>)){
my
@cArray_Gene = $cLine_Gene =~ /(\S+)\s*/g;
my
$List_Name = $cArray_Gene[0];
my
$List_Chrom = $cArray_Gene[1];
my
$List_Strand = $cArray_Gene[2];
my
$List_txStart = $cArray_Gene[3];
my
$List_txEnd = $cArray_Gene[4];
if
($List_Chrom =~ /chr\d{1,2}$/){
my
$length = 50000;
my
($contigStart, $contigEnd);
if
($List_Strand eq "+"){
$contigStart
= $List_txStart - $length;
$contigEnd
= $List_txStart + $length;
}elsif
($List_Strand eq "-"){
$contigStart
= $List_txEnd - $length;
$contigEnd
= $List_txEnd + $length;
}else
{
print
"Occur the error....\n";
}
if
($contigStart > 0 && $contigEnd > 0){
print
"$List_Name \t $List_Chrom \t $List_Strand \t $List_txStart \t $List_txEnd
\t $contigStart \t $contigEnd \n";
}
}
}
close(GENE_DB);
2.
Merge fragments of adjacent genes if sequence fragments overapped
*
Source Code: Extract_gene_info.pl
- input:
input_for_Extract_sequence.txt
- output: Merged_contigs_[50K | 100K | 250K].txt
#!/usr/bin/perl
-w
use
strict;
use
warnings;
my
($dbNum, $dbID);
my
%beforeMerge = ();
my
%afterMerge = ();
sub
merge_seq{
my
$func_first = $_[0];
my
$func_end = $_[1];
my
($Check_first, $Check_last);
delete
$beforeMerge{$func_first};
while
(($Check_first, $Check_last) = each(%beforeMerge)){
if
(($func_first < $Check_first) && ($func_end > $Check_last)){
delete
$beforeMerge{$Check_first};
}else{
if
(($func_first < $Check_first) && ($func_end > $Check_first)){
$func_end
= $Check_last;
delete
$beforeMerge{$Check_first};
}
if
(($func_first < $Check_last) && ($func_end > $Check_last)){
$func_first
= $Check_first;
delete
$beforeMerge{$Check_first};
}
}
}
$afterMerge{$func_first}
= $func_end;
}
foreach
$dbNum (1..22){ #for
starting of foreach
$dbID
= "chr".$dbNum;
my
$Gene_Database = "input_for_Extract_sequence.txt";
open(GENE_DB,
"< $Gene_Database") ||
die
"Sorry, I couldn't open the input_for_Extract_sequence.txt file:
$!\n";
my
$cLine_Gene;
while(defined
($cLine_Gene = <GENE_DB>)){
my
@cArray_Gene = $cLine_Gene =~ /(\S+)\s*/g;
my
$List_Chrom = $cArray_Gene[1];
my
$Contig_Start = $cArray_Gene[5];
my
$Contig_End = $cArray_Gene[6];
if
($List_Chrom eq $dbID){
$beforeMerge{$Contig_Start}
= $Contig_End;
}
}
close(GENE_DB);
my
($first, $last);
while
(($first, $last) = each(%beforeMerge)){
merge_seq($first,
$last);
}
my
$myCount = 0;
for(;
$myCount < 100; $myCount++){
%beforeMerge
= %afterMerge;
%afterMerge
= ();
my
@my_key_array = ();
while
(($first, $last) = each(%beforeMerge)){
push(@my_key_array,
$first);
}
foreach
my $my_key_val (sort {$a <=> $b} @my_key_array){
my
$my_value = $beforeMerge{$my_key_val};
if
($my_value){
merge_seq($my_key_val,
$my_value);
}
}
}
my
($a, $b);
while
(($a, $b) = each(%afterMerge)){
my
$length = $b - $a +1;
print
"$dbID \t $a \t $b \t $length \n";
}
%beforeMerge
= ();
%afterMerge
= ();
}
3.
Extract genomic sequences from ÒMerged_contig_[50K | 100K | 250K].txt.
*
Source Code: Seq_Extract_From_Position.pl
- input:
Merged_contigs_[50K | 100K | 250K].txt
-
output: Directory ./Sequences/[50K | 100K | 250K]_Sequence/
- variables to be
modified: $path, $outputfile
#!/usr/bin/perl
-w
use
strict;
use
warnings;
#
Define the variable for file analysis
my
@cDatabase = ();
my
$file;
my
$cTempDatabase = "";
my
$cTempFile = "";
my
@cFile = ();
my
$path = "Result_after_Merge.txt";
my
$databasePath = "/Users/chungoopark/Research/X-inactivation/Sequence_Analysis/Chromosome/chromFa/";
my
$outfile =
"/Users/chungoopark/Research/X-inactivation/Autosomal_Oligomer/Sequences/50K_Sequence/";
my
($DNA_ID, $chr_num);
my
@sequence = ();
my
$sequenceintoDB;
my
($dbID, $dbNum, $indexSize);
my
($chrom, $dStart, $dEnd);
my
$file_ID = 1;
foreach
$dbNum (1,10,11,12,13,14,15,16,17,18,19,2,20,21,22,3,4,5,6,7,8,9){ #for starting of foreach
$DNA_ID
= "";
$chr_num
= "";
$dStart
= "";
$dEnd
= "";
$cTempDatabase
= "";
@sequence
= ();
@cDatabase
= ();
undef
$/;
$dbID
= "chr".$dbNum.".fa";
my
$dbIDcheck = "chr".$dbNum;
open(DBFILE,
"< $databasePath$dbID") ||
die
"Sorry, I couldn't open the database file: $!\n";
while(defined
($cTempDatabase = <DBFILE>)){
@cDatabase
= split(/\n/, $cTempDatabase);
}
print
"upload $dbID done\n";
calPositionFile($dbIDcheck);
}
close(DBFILE);
sub
calPositionFile{
$/
= "\n";
#open
the novel.txt file
open(FILE,
"< $path") ||
die
"Sorry, I couldn't open the input file with respect to subgenome:
$!\n";
while(defined
($cTempFile = <FILE>)){
@cFile
= $cTempFile =~ /(\S+)\s*/g;
if
($cFile[0] eq $_[0]){
$chrom
= $cFile[0];
$dStart
= $cFile[1];
$dEnd
= $cFile[2];
calSequence();
$sequenceintoDB
= join(" ", @sequence);
$sequenceintoDB
=~ s/\s//g;
open(OUT,
">$outfile$file_ID") ||
die
"Sorry, I couldn't make outfile: $!\n";
print
OUT $sequenceintoDB;
$file_ID
++;
}
}
close(FILE);
}
sub
calSequence{
my
$iCount;
my
($StartArray, $StartPosition);
my
($EndArray, $EndPosition);
my
$NumArray;
my
@TempArray;
@sequence
= ();
($StartArray,
$StartPosition) = extractArray($dStart);
($EndArray,
$EndPosition) = extractArray($dEnd);
foreach
$NumArray ($StartArray .. $EndArray){
#Initiate
varable
@TempArray
=();
if
($NumArray == $StartArray || $NumArray == $EndArray){
@TempArray
= $cDatabase[$NumArray] =~ /([a-zA-Z])/g;
#
Error Check
if
(( $#TempArray + 1) != 50){
die
"$dStart $dEnd First_Parsing error ... Array Count...\n";
}
if
($NumArray == $StartArray){
$iCount
= $StartPosition;
while
($iCount < 50){ push(@sequence,
$TempArray[$iCount++]); }
}else{
$iCount
= 0;
while
($iCount < $EndPosition + 1){ push(@sequence,
$TempArray[$iCount++]); }
}
}elsif
(( $StartArray < $NumArray) && ($EndArray > $NumArray)){
@TempArray
= $cDatabase[$NumArray] =~ /(\S)/g;
#
Error Check
if
(( $#TempArray + 1) != 50){
die
"$dStart $dEnd Second_ Parsing error ... Array Count...\n";