Genomic Environment Predicts Expression Patterns

on the Human Inactive X Chromosome

 

Written by Chungoo Park

 

If you have any questions concerining the source codes, their usages, or need assistance, please feel free to contact me (cxp440@psu.edu).

 

 

v    Analyze I or E subgenomes in Xp22.

Step One: Define the Inactivated and Escape subgenome composition.

 

Extract gene lists from X-inactivation profile which came from a supplementary file in Carrel and Willard (2005) ÒX-inactivation profile reveals extensive variability in X-linked gene expression in femalesÓ Nature 17;434(7031):400-404. Genes were considered to be X inactivated if silenced in all nine inactive X containing hybrids assayed or if expressed in only a single hybrid (0/9 or 1/9); they will be a candidate members for I subgenome. Similarly, genes were scored as escaping XCI if expressed in eight or nine out of nine inactivate X hybrids tested (8/9 or 9/9); they will be used for E subgenome. Note especially to analyze the entire X chromsome, ESTs were not considered to make I or E subgenome.

 

1. Create two files (ÒWhole_X_new_subject_list.txtÓ and ÒWhole_X_new_escape_ 

    list.txtÓ). Each file has geneÕs profile which met the each criteria.

 

   * Source Code: Make_list_for_subject_escape.pl   

 

#!/usr/bin/perl -w

 

use strict;

use warnings;

 

my $file = "../../SourceData/Expression_X.txt";

my $Out_Subject = "Whole_X_new_subject_list.txt";

my $Out_Escape = "Whole_X_new_escape_list.txt";

 

my $cLine_File;

 

open (FILE, "<$file")     ||                die "Sorry, I couldn't open the After_lists_known.txt file: $!\n";

while (defined($cLine_File = <FILE>)){

                  chomp($cLine_File);

                 

                  if (!($cLine_File =~ /Pseudoautosomal/)){

                                   

                                    if ($cLine_File =~ /0 \/ 9/){

                                                      open (SUB, ">>$Out_Subject")      ||                die "Sorry, I couldn't open the Whole_X_subject_list.txt file: $!\n";

                                                      print SUB "$cLine_File\n";

                                                      close (SUB);

                                    }

 

                                    if ($cLine_File =~ /1 \/ 9/){

                                                      open (SUB, ">>$Out_Subject")      ||                die "Sorry, I couldn't open the Whole_X_subject_list.txt file: $!\n";

                                                      print SUB "$cLine_File\n";

                                                      close (SUB);

                                    }

 

                                    if ($cLine_File =~ /8 \/ 9/){

                                                      open (ESC, ">>$Out_Escape")       ||                die "Sorry, I couldn't open the Whole_X_escape_list.txt file: $!\n";

                                                      print ESC "$cLine_File\n";

                                                      close (ESC);

                                    }

 

                                    if ($cLine_File =~ /9 \/ 9/){

                                                      open (ESC, ">>$Out_Escape")       ||                die "Sorry, I couldn't open the Whole_X_escape_list.txt file: $!\n";

                                                      print ESC "$cLine_File\n";

                                                      close (ESC);

                                    }

                  }

}

close(FILE);

 

2. Divide an input file (ÒWhole_X_new_subject_list.txtÓ or ÒWhole_X_new_escape_ 

list.txtÓ) into two files (ÒWhole_X_new_subject_list_none_ESTs.txtÓ and ÒWhole _X_new_subject_list_Ests.txtÓ OR ÒWhole_X_new_escape_list_none_ESTs.txtÓ and ÒWhole_X_new_escape_list_Ests.txtÓ) by the presence or absence of EST.

 

   * Source Code: Make_list_by_EST.pl   

 

#!/usr/bin/perl -w

 

use strict;

use warnings;

 

my $file = "Whole_X_new_escape_list.txt";

my $Out_None_Ests = "Whole_X_new_escape_list_none_ESTs.txt";

my $Out_Ests = "Whole_X_new_escape_list_Ests.txt";

 

my $cLine_File;

 

open (FILE, "<$file")     ||                die "Sorry, I couldn't open the input file: $!\n";

while (defined($cLine_File = <FILE>)){

                  chomp($cLine_File);

                 

                                   

                  if ($cLine_File =~ /EST/){

                                    if ($cLine_File =~ /EST support/){

                                                      open (NON_EST, ">>$Out_None_Ests")        ||                die "Sorry, I couldn't open the output_none_EST file: $!\n";

                                                      print NON_EST "$cLine_File\n";

                                                      close (NON_EST);

                                    }else{

                                                      open (EST, ">>$Out_Ests")            ||                die "Sorry, I couldn't open the output_EST file: $!\n";

                                                      print EST "$cLine_File\n";

                                                      close (EST);

                                    }

                  }else{

                                    open (NON_EST, ">>$Out_None_Ests")        ||                die "Sorry, I couldn't open the output_none_EST file: $!\n";

                                    print NON_EST "$cLine_File\n";

                                    close (NON_EST);

                  }

 

}

close(FILE);

 

3. Build three different sized subgenomes (100kb, 200kb and 500kb) surrounding the transcription start sites (TSSs) of genes. Note that the 100/200/500kb contigs include 50/100/250kb upstream and downstream of the TSS.

 

         * Source Code for I subgenome: Make_Subgenome_subject_against_WholeX.pl

            - Output: Subgenome_Subject_50K_wo_EST.txt or

                            Subgenome_Subject_100K_wo_EST.txt or

                            Subgenome_Subject_250K_wo_EST.txt

 

#!/usr/bin/perl -w

 

use strict;

use warnings;

 

my $subject = "Whole_X_new_subject_list_none_ESTs.txt";

my $escape = "Whole_X_new_escape_list_none_ESTs.txt";

 

my %unManaged = ();

my %Managed = ();

my $sequence_length = 250000;

 

sub check_escape{

                  my $func_start_seq = $_[0];

                  my $func_end_seq = $_[1];

 

                  open (CLON_ESC, "< $escape")     ||

                                    die "Sorry, I couldn't open the escape.txt for clone: $!\n";

                  my $func_cLine;

                  my ($esc_start_seq, $esc_end_seq);

                  while (defined($func_cLine = <CLON_ESC>)){

 

                                    my @func_Database = $func_cLine =~ /^(\S+) - (\S+)\t(\S*|\s*)\t/;

                                    my $func_gene_start = $func_Database[0];

                                    my $func_gene_end = $func_Database[1];

                                    my $func_strand = $func_Database[2];

 

                                    if ($func_strand eq "-"){

                                                      $esc_start_seq = $func_gene_end - $sequence_length;

                                                      $esc_end_seq = $func_gene_end + $sequence_length;

                                    }else{

                                                      $esc_start_seq = $func_gene_start - $sequence_length;

                                                      $esc_end_seq = $func_gene_start + $sequence_length;

                                    }

 

                                    if (($func_start_seq >= $esc_start_seq) && ($func_start_seq < $esc_end_seq)){

                                                      return 0;

                                    }

                                    if (($func_end_seq > $esc_start_seq) && ($func_end_seq <= $esc_end_seq)){

                                                      return 0;

                                    }

                  }

                  return 1;

                  close(CLON_ESC);

}

 

sub merge_seq{

                  my $func_first = $_[0];

                  my $func_end = $_[1];

                  my ($begin, $end);

                  delete $unManaged{$func_first};

                  while (($begin, $end) = each(%unManaged)){

                                    if (($func_first < $begin) && ($func_end > $begin)){

                                                      $func_end = $end;

                                                      delete $unManaged{$begin};

                                    }                

                                    if (($func_first < $end) && ($func_end > $end)){

                                                      $func_first = $begin;

                                                      delete $unManaged{$begin};

                                    }

                  }

                  $Managed{$func_first} = $func_end;

}

 

open (SUBJECT, "< $subject")        ||

                  die "Sorry, I couldn't open the subject.txt: $!\n";

 

my $cLine;

my @Database = ();

my ($first, $last);

 

while (defined($cLine = <SUBJECT>)){

                  my ($start_seq, $end_seq);

                  @Database = $cLine =~ /^(\S+) - (\S+)\t(\S*|\s*)\t/;

                  my $gene_start = $Database[0];

                  my $gene_end = $Database[1];

                  my $strand = $Database[2];

 

                  if ($strand eq "-"){

                                    $start_seq = $gene_end - $sequence_length;

                                    $end_seq = $gene_end + $sequence_length;

                  }else{

                                    $start_seq = $gene_start - $sequence_length;

                                    $end_seq = $gene_start + $sequence_length;

                  }

                 

                  my $result_escape = check_escape($start_seq, $end_seq);

                  if ($result_escape eq "1"){

                                    $unManaged{$start_seq} = $end_seq;

                  }else{

#                                  print "$gene_start $gene_end $start_seq $end_seq\n"

                  }

}

 

while (($first, $last) = each(%unManaged)){

                  merge_seq($first, $last);

}

 

my $myCount = 0;

for(; $myCount < 100; $myCount++){

                  %unManaged = %Managed;

                  %Managed = ();

                  while (($first, $last) = each(%unManaged)){

                                    merge_seq($first, $last);

                  }

}

 

my ($aa, $bb);

while (($aa, $bb) = each(%Managed)){

                  print "Managed \t $aa \t $bb\n";

}

close(SUBJECT);

 

         * Source Code for E subgenome: Make_Subgenome_escape_against_WholeX.pl

           - Output: Subgenome_Escape_50K_wo_EST.txt or

                          Subgenome_Escape_100K_wo_EST.txt or

                          Subgenome_Escape_250K_wo_EST.txt

 

#!/usr/bin/perl -w

 

use strict;

use warnings;

 

my $subject = "Whole_X_new_subject_list_none_ESTs.txt";

my $escape = "Whole_X_new_escape_list_none_ESTs.txt";

 

my %unManaged = ();

my %Managed = ();

my $sequence_length = 250000;

 

sub check_subject{

                  my $func_start_seq = $_[0];

                  my $func_end_seq = $_[1];

 

                  open (CLON_SUB, "< $subject")    ||

                                    die "Sorry, I couldn't open the subject.txt for clone: $!\n";

                  my $func_cLine;

                  my ($esc_start_seq, $esc_end_seq);

                  while (defined($func_cLine = <CLON_SUB>)){

 

                                    my @func_Database = $func_cLine =~ /^(\S+) - (\S+)\t(\S*|\s*)\t/;

                                    my $func_gene_start = $func_Database[0];

                                    my $func_gene_end = $func_Database[1];

                                    my $func_strand = $func_Database[2];

 

                                    if ($func_strand eq "-"){

                                                      $esc_start_seq = $func_gene_end - $sequence_length;

                                                      $esc_end_seq = $func_gene_end + $sequence_length;

                                    }else{

                                                      $esc_start_seq = $func_gene_start - $sequence_length;

                                                      $esc_end_seq = $func_gene_start + $sequence_length;

                                    }

 

                                    if (($func_start_seq >= $esc_start_seq) && ($func_start_seq < $esc_end_seq)){

                                                      return 0;

                                    }

                                    if (($func_end_seq > $esc_start_seq) && ($func_end_seq <= $esc_end_seq)){

                                                      return 0;

                                    }

                  }

                  return 1;

                  close(CLON_SUB);

}

 

sub merge_seq{

                  my $func_first = $_[0];

                  my $func_end = $_[1];

                  my ($begin, $end);

                  delete $unManaged{$func_first};

                  while (($begin, $end) = each(%unManaged)){

                                    if (($func_first < $begin) && ($func_end > $begin)){

                                                      $func_end = $end;

                                                      delete $unManaged{$begin};

                                    }                

                                    if (($func_first < $end) && ($func_end > $end)){

                                                      $func_first = $begin;

                                                      delete $unManaged{$begin};

                                    }

                  }

                  $Managed{$func_first} = $func_end;

}

 

open (ESCAPE, "< $escape")          ||

                  die "Sorry, I couldn't open the escape.txt: $!\n";

 

my $cLine;

my @Database = ();

my ($first, $last);

 

while (defined($cLine = <ESCAPE>)){

 

                  my ($start_seq, $end_seq);

                  @Database = $cLine =~ /^(\S+) - (\S+)\t(\S*|\s*)\t/;

                  my $gene_start = $Database[0];

                  my $gene_end = $Database[1];

                  my $strand = $Database[2];

 

                  if ($strand eq "-"){

                                    $start_seq = $gene_end - $sequence_length;

                                    $end_seq = $gene_end + $sequence_length;

                  }else{

                                    $start_seq = $gene_start - $sequence_length;

                                    $end_seq = $gene_start + $sequence_length;

                  }

                 

                  my $result_subject = check_subject($start_seq, $end_seq);

                  if ($result_subject eq "1"){

                                    $unManaged{$start_seq} = $end_seq;

                  }else{

#                                  print "$gene_start $gene_end $start_seq $end_seq\n";

                  }

}

 

while (($first, $last) = each(%unManaged)){

                  merge_seq($first, $last);

}

 

my $myCount = 0;

for(; $myCount < 100; $myCount++){

                  %unManaged = %Managed;

                  %Managed = ();

                  while (($first, $last) = each(%unManaged)){

                                    merge_seq($first, $last);

                  }

}

 

my ($aa, $bb);

while (($aa, $bb) = each(%Managed)){

                  print "Managed \t $aa \t $bb\n";

}

close(ESCAPE);

 

Note: at the $sequence_length variable, the length of contig was assigned.

 

4. Confirm the outputs by the step 2: Whether the subgenomes include the ESTs.

 

         * Source Code for I subgenome: Check_Subject_Subgenome.pl

 

#!/usr/bin/perl -w

 

use strict;

use warnings;

 

my $Subject_file = "Subgenome_Subject_250K_wo_EST.txt";

my $Escape_Ests = "Whole_X_new_escape_list_Ests.txt";

 

my $cLine_File;

 

open (ESCAPE, "<$Escape_Ests")   ||                die "Sorry, I couldn't open the Escape_Ests file: $!\n";

while (defined($cLine_File = <ESCAPE>)){

                  chomp($cLine_File);

 

                  my @EST_Database = $cLine_File =~ /^(\S+) - (\S+)\t(\S*|\s*)\t/;

                  my $est_start = $EST_Database[0];

                  my $est_end = $EST_Database[1];

 

                  my $cLine_Subject;

                  open (FILE, "<$Subject_file")         ||                die "Sorry, I couldn't open the Subject_file: $!\n";

                  while (defined($cLine_Subject = <FILE>)){

                                    chomp($cLine_Subject);

                                    my @Subject_Database = $cLine_Subject =~ /(\S+)\s*/g;

                                    my $contig_start = $Subject_Database[1];

                                    my $contig_end = $Subject_Database[2];

                                   

                                    if (($contig_start <= $est_start) && ($contig_end >= $est_start)){

                                                      if ($contig_end >= $est_end){

                                                                        print "$cLine_Subject \t Include_EST \t $cLine_File \n";

                                                      }else{         print "$cLine_Subject \t Overlap_Start \t $cLine_File \n";   }

                                    }else{

                                                       if (($contig_start >= $est_start) && ($contig_start <= $est_end)){

                                                                        if ($contig_end >= $est_end){

                                                                                          print "$cLine_Subject \t Overlap_End \t $cLine_File \n";

                                                                        }else{         print "$cLine_Subject \t Include_Contig \t $cLine_File \n";  }

                                                      }

                                    }

                  }

                  close(FILE);

}

close(ESCAPE);

 

         * Source Code for E subgenome: Check_Escape_Subgenome.pl

 

#!/usr/bin/perl -w

 

use strict;

use warnings;

 

my $Subject_file = "Subgenome_Escape_250K_wo_EST.txt";

my $Escape_Ests = "Whole_X_new_subject_list_Ests.txt";

 

my $cLine_File;

 

open (ESCAPE, "<$Escape_Ests")   ||                die "Sorry, I couldn't open the Escape_Ests file: $!\n";

while (defined($cLine_File = <ESCAPE>)){

                  chomp($cLine_File);

 

                  my @EST_Database = $cLine_File =~ /^(\S+) - (\S+)\t(\S*|\s*)\t/;

                  my $est_start = $EST_Database[0];

                  my $est_end = $EST_Database[1];

 

                  my $cLine_Subject;

                  open (FILE, "<$Subject_file")         ||                die "Sorry, I couldn't open the Subject_file: $!\n";

                  while (defined($cLine_Subject = <FILE>)){

                                    chomp($cLine_Subject);

                                    my @Subject_Database = $cLine_Subject =~ /(\S+)\s*/g;

                                    my $contig_start = $Subject_Database[1];

                                    my $contig_end = $Subject_Database[2];

                                   

                                    if (($contig_start <= $est_start) && ($contig_end >= $est_start)){

                                                      if ($contig_end >= $est_end){

                                                                        print "$cLine_Subject \t Include_EST \t $cLine_File \n";

                                                      }else{         print "$cLine_Subject \t Overlap_Start \t $cLine_File \n";   }

                                    }else{

                                                       if (($contig_start >= $est_start) && ($contig_start <= $est_end)){

                                                                        if ($contig_end >= $est_end){

                                                                                          print "$cLine_Subject \t Overlap_End \t $cLine_File \n";

                                                                        }else{         print "$cLine_Subject \t Include_Contig \t $cLine_File \n";  }

                                                      }

                                    }

                  }

                  close(FILE);

}

close(ESCAPE);

 

5. Confirm the outputs by the step 3: (1) whether contigs are overlapped with contigs having opposite inactivation profile; (2) within the range of contigs whether there exist genes having opposite inactivation profile

 

         * Source Code for (1): Verify_Subgenome.pl

 

#!/usr/bin/perl -w

 

use strict;

use warnings;

 

my $subject = "Subgenome_Subject_250K_wo_EST.txt";

my $escape = "Subgenome_Escape_250K_wo_EST.txt";

 

open (SUB, "< $subject")               ||

                  die "Sorry, I couldn't open the subject for clone: $!\n";

my $cLine_Sub;

while (defined($cLine_Sub = <SUB>)){

                  my @cSub_DB = $cLine_Sub =~ /(\S+)\s*/g;

                  my $Sub_begin = $cSub_DB[1];

                  my $Sub_end = $cSub_DB[2];

 

                  open (ESC, "< $escape")                ||

                                                      die "Sorry, I couldn't open the escape for clone: $!\n";

                  my $cLine_Esc;

                  while (defined($cLine_Esc = <ESC>)){

                                    my @cEsc_DB = $cLine_Esc =~ /(\S+)\s*/g;

                                    my $Esc_begin = $cEsc_DB[1];

                                    my $Esc_end = $cEsc_DB[2];

 

                                    if (($Sub_begin <= $Esc_begin) && ($Sub_end >= $Esc_begin)){

                                                      print "$Sub_begin \t $Sub_end \t $Esc_begin \t $Esc_end\n";

                                    }

                                    if (($Sub_begin <= $Esc_end) && ($Sub_end >= $Esc_end)){

                                                      print "$Sub_begin \t $Sub_end \t $Esc_begin \t $Esc_end\n";

                                    }

                  }

                  close(ESC);

}

close(SUB);

 

         * Source Code for (2): Verify_Subgenome_2.pl

 

#!/usr/bin/perl -w

 

use strict;

use warnings;

 

#my $subject = "Subgenome_Subject_250K_wo_EST.txt";

#my $escape = "Whole_X_new_escape_list.txt";

 

my $subject = "Subgenome_Escape_250K_wo_EST.txt";

my $escape = "Whole_X_new_subject_list.txt";

 

open (SUB, "< $subject")               ||

                  die "Sorry, I couldn't open the subject for clone: $!\n";

my $cLine_Sub;

while (defined($cLine_Sub = <SUB>)){

                  my @cSub_DB = $cLine_Sub =~ /(\S+)\s*/g;

                  my $Sub_begin = $cSub_DB[1];

                  my $Sub_end = $cSub_DB[2];

 

                  open (ESC, "< $escape")                ||

                                                      die "Sorry, I couldn't open the escape for clone: $!\n";

                  my $cLine_Esc;

                  while (defined($cLine_Esc = <ESC>)){

                                    my @cEsc_DB = $cLine_Esc =~ /(\S+)\s*/g;

                                    my $Esc_begin = $cEsc_DB[0];

                                    my $Esc_end = $cEsc_DB[2];

 

                                    if (($Sub_begin <= $Esc_begin) && ($Sub_end >= $Esc_begin)){

                                                      print "$Sub_begin \t $Sub_end \t $Esc_begin \t $Esc_end\n";

                                    }

                                    if (($Sub_begin <= $Esc_end) && ($Sub_end >= $Esc_end)){

                                                      print "$Sub_begin \t $Sub_end \t $Esc_begin \t $Esc_end\n";

                                    }

                  }

                  close(ESC);

}

close(SUB);

 

 

Step Two: Assess the presence of each possible oligomer of specified size (8-, 12-, 16-, 20-, and 24-mers) and count these oligomers within each subgenomes. First of all, we recall the procedure which criteria was used for the Xp22 analysis; oligomers were considered significantly overrepresented in a subgenome if they were enriched at least five-fold and occurred at least ten times in that subgenome. Remarkable one was that the size of I subgenome was similar to that of E subgenome.

                  After finding overrepresented oligomer, they were assigned to either specific class of repeatitive elements or to none-repeatitive elements.

 

1. Extract genomic sequence using a coordinate of each contigs for I and E  Subgenome.

 

         * Source Code: GetSeq_From_Position_PCG.pl

           - input: ÒSubgenome_Subject_50K_wo_EST.txtÓ

           - output: ../[50k|100K|250K]_Data/[input_name].txtÓnÓ

           - variables to be modified: $outfile

 

#!/usr/bin/perl -w

 

#########################################################################

## Extract the genomic sequences from input files                      ##

## including locations of start and end with respect to subgenome      ##

#########################################################################

 

use Getopt::Std;

 

getopts('i:');

$path = $opt_i; #The list of files

 

# Define the variable for file analysis

my @cDatabase = ();

my $file;

my $cTempDatabase = "";

my $cTempFile = "";

my @cFile = ();

 

my $X_chr = "/Users/chungoopark/Research/X-inactivation/Sequence_Analysis/Chromosome/chromFa/chrX.fa";

my $outfile = "/Users/chungoopark/Research/X-inactivation/Whole_X_New_Oligomer/50K_Data/";

 

my ($dStart, $dEnd);

my $index = 1;

my @sequence = ();

my $sequenceintoDB;

my ($dbID, $dbNum, $indexSize);

 

undef $/;

open(DBFILE, "< $X_chr")              ||

                  die "Sorry, I couldn't open the database file: $!\n";

 

while(defined ($cTempDatabase = <DBFILE>)){

                  @cDatabase = split(/\n/, $cTempDatabase);                    

}

close(DBFILE);

 

calPositionFile();

 

sub calPositionFile{

 

                  $/ = "\n";   

 

                  #open the novel.txt file

                  open(FILE, "< $path")   ||

                                    die "Sorry, I couldn't open the input file with respect to subgenome: $!\n";

 

                  while(defined ($cTempFile = <FILE>)){

 

                                    @cFile = $cTempFile =~ /(\S+)\s*/g;

                                    $dStart = $cFile[1];

                                    $dEnd = $cFile[2];

                                                     

                                    calSequence();

 

                                    $sequenceintoDB = join(" ", @sequence);

                                    $sequenceintoDB =~ s/\s//g;

                                   

                                    open(OUT, ">$outfile$path$index")                ||

                                                      die "Sorry, I couldn't make outfile: $!\n";

                                    print OUT $sequenceintoDB;

 

                                    $index++;

                  }

                  close(FILE);

}

 

sub calSequence{

 

                  my $iCount;

                  my ($StartArray, $StartPosition);

                  my ($EndArray, $EndPosition);

                  my $NumArray;

                  my @TempArray;

                  @sequence = ();

                 

                  ($StartArray, $StartPosition) = extractArray($dStart);

                  ($EndArray, $EndPosition) = extractArray($dEnd);

 

                  foreach $NumArray ($StartArray .. $EndArray){

                                    #Initiate varable

                                    @TempArray =();

                                                     

                                    if ($NumArray == $StartArray || $NumArray == $EndArray){

                                                                       

                                                      @TempArray = $cDatabase[$NumArray] =~ /([a-zA-Z])/g;

                                                                       

                                                      # Error Check

                                                      if (( $#TempArray + 1) != 50){

                                                                        die "Parsing error ... Array Count...\n";

                                                      }

                                                                       

                                                      if ($NumArray == $StartArray){

                                                                        $iCount = $StartPosition;

                                                                        while ($iCount < 50){   push(@sequence, $TempArray[$iCount++]); }

                                                      }else{

                                                                        $iCount = 0;

                                                                        while ($iCount < $EndPosition + 1){                push(@sequence, $TempArray[$iCount++]); }

                                                      }

 

                                    }elsif (( $StartArray < $NumArray) && ($EndArray > $NumArray)){

 

                                                      @TempArray = $cDatabase[$NumArray] =~ /(\S)/g;

 

                                                      # Error Check

                                                      if (( $#TempArray + 1) != 50){

                                                                        die "Parsing error ... Array Count...\n";

                                                      }

                                                      $iCount = 0;

                                                      while ($iCount < 50){   push(@sequence, $TempArray[$iCount++]); }

                                    }

                  }

}

 

sub extractArray{

                  my ($Array, $Position);

                  my $Value = $_[0];

 

                  $Array = ($Value / 50) + 1;

                  $Array = sprintf("%d", $Array);

                  $Position = ($Value % 50) - 1;

 

                  return ($Array, $Position);

}

 

2. Concatenate all contigs representing either genes which were inactivated or escaping inactivation. Finally make two subgenomes that were call I subgenome and E subgenome, respectively.

 

         * Source Code: Make_Whole_Subgenome.pl

           - input: all files that were generated by the previous step (1)

          - output: ../[50k|100K|250K]_Data/Whole_[Subject|Escape]_[50K|100K|250K].txt[input_name].txt

          - variables to be modified: $file, $Out, $number_contig

 

#!/usr/bin/perl -w

 

use strict;

use warnings;

 

#my $file = "../50K_Data/Subgenome_Subject_50K_wo_EST.txt";

#my $Out = "../50K_Data/Whole_Subject_50K.txt";

#my $number_contig = 172;

my $file = "../50K_Data/Subgenome_Escape_50K_wo_EST.txt";

my $Out = "../50K_Data/Whole_Escape_50K.txt";

my $number_contig = 21;

 

my $temp_number = 1;

for(; $temp_number <= $number_contig; $temp_number++){

                  my $contig_name = $file . $temp_number;

                  my @temp = ("cat $contig_name >> $Out");

                  system(@temp);

}

 

3. Count possible oligomers of specified size (8-, 12-, 16-, 20- and 24-mers) within each subgenome.

 

         * Source Code: CountMer_PCG.pl

- input: output of previous step(2) with option Ò-sÓ and number (denote the length of oligomer; e.g. 8, 12, 16, 20 and 24) with option Ò-pÓ

- output: name of first parameter +  name of second parameter

- variables to be modified: None

 

#!/usr/bin/perl

 

#CountMer_PCG.pl

 

use Getopt::Std;

 

sub complement {

                  my $basein = @_[0];

                  if ($basein =~ s/G/c/gi) {

                                    $basein =~ tr/a-z/A-Z/;

                                    return $basein;

                  }elsif ($basein =~ s/A/t/gi) {

                                    $basein =~ tr/a-z/A-Z/;

                                    return $basein;

                  }elsif ($basein =~ s/T/a/gi) {

                                    $basein =~ tr/a-z/A-Z/;

                                    return $basein;

                  }elsif ($basein =~ s/C/g/gi) {

                                    $basein =~ tr/a-z/A-Z/;

                                    return $basein;

                  }

}

 

sub reversesite {

                  my $sitein = @_[0];

                  my @REsite = split //, $sitein;

                  my @revsite; #Array to hold the reversed site

                  my $countletter = 0;

                  while ($#REsite >= 0) {

                                    $revsite[$countletter] = pop(@REsite);

                                    $revsite[$countletter] = complement($revsite[$countletter]);

                                    $countletter++;

                  }

                  my $returnsite = "";

                  foreach $revsite (@revsite){

                                    $returnsite = $returnsite . $revsite;

                  }

                  return $returnsite;

}

 

sub notpalin {

                  my $InHex = @_[0];

                  if ($InHex eq reversesite($InHex)) {

                                    return 0;

                  } else {

                                    return 1;

                  }

}

 

 

getopts('p:s:');

 

$SeqFile = $opt_s;

$PolSize = $opt_p;

$OutFile = $SeqFile . "." . $PolSize;

 

#Grab the sequence from the files and throw it into an array

 

#file is a flat text FASTA file, so skip lines starting with >

 

open(SEQFILE, "$SeqFile") or die("Could not open Sequence File $SeqFile \n");

 

for ($i = 0; $i < ($PolSize - 1); $i++) {

                  $PolHold[$i] = "N";

}

 

%PolyHash = ();

$ContigSize = 0;

while ($line = <SEQFILE>) {

                  chomp($line);

                  @bases = split //,$line;

                  if ($bases[0] ne '>'){

                                    for ($i = 0; $i < @bases; $i++){

                                                      $bases[$i] =~ tr/a-z/A-Z/;

                                                      unless ($bases[$i] =~ m/G|A|T|C/){

                                                                        $bases[$i] = "N";

                                                      }

                                                      $PolHold[($PolSize - 1)] = $bases[$i];

                                                      $hexout = "";

                                                      for ($hexpos = 0; $hexpos <= ($PolSize -1); $hexpos++) {

                                                                        $hexout  .= $PolHold[$hexpos];

                                                      }

                                                      $PolyHash{$hexout}++;

 

                                                      #print "$hexout \t $PolyHash{$hexout} \n";

                                                      for ($j =0; $j <($PolSize - 1); $j++){

                                                                        $PolHold[$j] = $PolHold [($j + 1)];

                                                      }

                                                      $ContigSize++;

                                    }

                  }                

}

close(SEQFILE);

print "Read Sequence. \n";

 

#Step through the Hash to eliminate duplicate entries and N-containing entries.

 

while (($key, $value) = each %PolyHash) {

#                print "$key \t $value\n";

                  if ($key =~ m/N/) {

                                    delete $PolyHash{$key};

                  } elsif (notpalin($key)) {

                                    $revkey = reversesite($key);

 

#                                  if (exists $PolyHash{$revkey}) {

                                                      ############################

                                                      #         Modified         #

                                                      ############################

                                                      if ($revkey gt $key) {

                                                                        $PolyHash{$key} += $PolyHash{$revkey};

                                                                        delete $PolyHash{$revkey};

                                                      }elsif ($key gt $revkey) {

                                                                        $PolyHash{$revkey} += $PolyHash{$key};

                                                                        delete $PolyHash{$key};

                                                      }

#                                  }

                  }

}

print "Cleaned up Escape. \n ";

 

#Open a file to output the polymer count results

 

open(RESTFILE, ">$OutFile") or die ('Could not open Result File');

 

while (($key, $value) = each %PolyHash) {

#                print "$key \t $value\n";

                  print RESTFILE "$key\t$value\n";

}

 

close (RESTFILE);

 

print "Reported Polymers. \n";

print "Contig is $ContigSize Base Pairs. \n";

 

4. Pick up oligomers that satisfy our criteria in terms of number of occurrence (10 or more instance).

 

   * Source Code: New_10Inst.pl

            - input: output of previous step(3) with option Ò-iÓ

- output: name of input file +  Ò10InstÓ

 

#!/usr/bin/perl

 

#10Inst.pl

#Opens up a Count output file and reprints just those Polymers with 10 or more occurences

#Output File = InputFile.10Inst

 

use Getopt::Std;

 

getopts('i:');

 

$InFile = $opt_i;

$OutFile = $InFile . ".10Inst";

 

open(INFILE, "$InFile") or die("Failed on Input $InFile $!\n");

open(OUTFILE,">$OutFile") or die("Failed on Output $OutFile $!\n");

 

while ($line = <INFILE>){

                  chomp($line);

                  @lineparts = split /\t/, $line;

                  if ($lineparts[1] > 9){

                                    print OUTFILE "$line\n";

#                                  print "$line\n";

                  }

}

close(INFILE);

close(OUTFILE);

 

 

5. Run shell scripts in order to find how many oligomers exist each contigs in I and E subgenomes. These shell scripts used the FFP2_PCG.pl

 

   * Source Code: [50K | 100K | 250K]_myScript_for10Inst_[8|12|16|20|24].sh

   perl FFP2_PCG.pl -s ./50K_Data/Subgenome_Escape_50K_wo_EST.txt1 -p Whole_Escape_50K.txt.8.10Inst

                                                            ****

 

         * Additional source Code: FFP2_PCG.pl

 

#!/usr/bin/perl

 

use Getopt::Std;

 

sub complement {

                  my $basein = @_[0];

                  if ($basein =~ s/G/c/gi) {

                                    $basein =~ tr/a-z/A-Z/;

                                    return $basein;

                  }elsif ($basein =~ s/A/t/gi) {

                                    $basein =~ tr/a-z/A-Z/;

                                    return $basein;

                  }elsif ($basein =~ s/T/a/gi) {

                                    $basein =~ tr/a-z/A-Z/;

                                    return $basein;

                  }elsif ($basein =~ s/C/g/gi) {

                                    $basein =~ tr/a-z/A-Z/;

                                    return $basein;

                  }

}

 

sub reversesite {

                  my $sitein = @_[0];

                  my @REsite = split //, $sitein;

                  my @revsite; #Array to hold the reversed site

                  my $countletter = 0;

                  while ($#REsite >= 0) {

                                    $revsite[$countletter] = pop(@REsite);

                  #                print "$revsite[$countletter]";

                                    $revsite[$countletter] = complement($revsite[$countletter]);

                  #                print "$revsite[$countletter]";

                                    $countletter++;

                  }

                  my $returnsite = "";

                  foreach $revsite (@revsite){

                                    $returnsite = $returnsite . $revsite;

                  }

                  #print "$sitein $returnsite\n";

                  return $returnsite;

}

 

 

getopts('p:s:');

 

$SeqFile = $opt_s;

$PolFile = $opt_p;

$OutFile = $SeqFile . $PolFile;

 

%TestPol = (); #Initialize Hash for storing Test Polymers

 

#Grab the polymers from files and throw it into an array

open (POLFILE, "$PolFile") or die ("Could not open Polymer File $PolFile\n");

while ($line = <POLFILE>) {

                  @PolInfo = split /\t/,$line;

                  $PolInfo[0] =~ s/\s//;

                  $TestPol{$PolInfo[0]} = "NONE found";

}

close (POLFILE);

 

print "Read Polymers in.\n";

 

@PolymerSize = split //,$PolInfo[0];

$PolSize = scalar(@PolymerSize);

 

#file is a flat text FASTA file, so skip lines starting with >

open(SEQFILE, "$SeqFile") or die("Could not open Sequence File $SeqFile \n");

 

for ($i = 0; $i < ($PolSize - 1); $i++) {

                  $PolHold[$i] = "N";

}

$NotFound = 1;          

 

%PolyHash = ();

 

while ($line = <SEQFILE>) {

                  chomp($line);

                  @bases = split //,$line;

                  if ($bases[0] ne '>'){

                                    for ($i = 0; $i < @bases; $i++){

                                                      $PolHold[($PolSize - 1)] = $bases[$i];

                                                      $hexout = "";

                                                      for ($hexpos = 0; $hexpos <= ($PolSize -1); $hexpos++) {

                                                                        ############################

                                                                        #         Modified         #

                                                                        ############################

                                                                        $PolHold[$hexpos] =~ tr/a-z/A-Z/;

 

                                                                        $hexout  .= $PolHold[$hexpos];

                                                      }

                                    #Code to make sure that you're using the lowest version of $hexout

                                                      $testseq1 = reversesite($hexout);

                                                      if ($hexout gt $testseq1) {

                                                      #                print "$hexout $testseq1\n";

                                                                        $hexout = $testseq1;

                                                      }

 

                                                      $PolyHash{$hexout}++;

                                                      #print "$hexout \t $PolyHash{$hexout} \n";

                                                      for ($j =0; $j <($PolSize - 1); $j++){

                                                                        $PolHold[$j] = $PolHold [($j + 1)];

                                                      }

                                    }

                  } else {

print "Error\n";

                                    #Code to compare the hashes

                                    while (($key, $value) = each %TestPol) {

                                                      if (exists $PolyHash{$key}) {

                                                                        if ($TestPol{$key} eq "NONE found") {

                                                                                          $TestPol{$key} = "$CurrNT\t$PolyHash{$key}\t"

                                                                        } else {

                                                                                          $TestPol{$key} .= "$CurrNT\t$PolyHash{$key}\t";

                                                                        }

                                                      }

                                    }

                                    @NTDesc = split /\s/,$line;

                                    @NTName = split /\W/, $NTDesc[0];

                                    $CurrNT = $NTName[4];

                                    %PolyHash = ();

                                    print "Starting NT $CurrNT\n";

                  }

}

 

while (($key, $value) = each %TestPol) {

 

                  #############################

                  #          Modified         #

                  #############################

                  $myRevkey = reversesite($key);

                  if (exists $PolyHash{$key}) {

                                    if ($TestPol{$key} eq "NONE found") {

                                                      $TestPol{$key} = "$CurrNT\t$PolyHash{$key}\t";

                                    } else {

                                                      $TestPol{$key} .= "$CurrNT\t$PolyHash{$key}\t";

                                    }

                  }elsif (exists $PolyHash{$myRevkey}) {

                                    if ($TestPol{$key} eq "NONE found") {

                                                      $TestPol{$key} = "$CurrNT\t$PolyHash{$myRevkey}\t";

                                    } else {

                                                      $TestPol{$key} .= "$CurrNT\t$PolyHash{$myRevkey}\t";

                                    }

                  }

                                   

}

 

 

close(SEQFILE);

 

#Open a file to output the polymer count results

 

open(RESTFILE, ">$OutFile") or die ('Could not open Result File');

 

while (($key, $value) = each %TestPol) {

#                unless ($value eq "NONE found"){

                                    print RESTFILE "$key\t$value\n";

#                }

}

 

close (RESTFILE);

 

6. Apply second criteria in terms of the fold difference (i.e., to be overrepresented in I subgenome, an oligomer has to be overrepsented 5-fold based on the difference of subgenome size between I and E).

 

  * Source Code: Comp_Subgenome_5_against_[subject | escape]_PCG.pl50K_[Subject |    

     Escape]_contig_PCG.pl

- input:

For subject, Ò[50K | 100K| 250K]_Subject_Contig_[12 | 16 | 20 | 24].txtÓ with option Ò-sÓ and ÒWhol_Escape_[50K | 100K | 250K].txt.[12 | 16 | 20 | 24]Ó with option Ò-eÓ.

For escape, Ò[50K | 100K| 250K]_Escape_Contig_[12 | 16 | 20 | 24].txtÓ with option Ò-sÓ and ÒWhol_Subject_[50K | 100K | 250K].txt.[12 | 16 | 20 | 24]Ó with option Ò-eÓ.

- output: Diff_5_ + name following with  option Ò-sÓ +  name following with option Ò-eÓ

 

For Comp_Subgenome_5_against_subject_PCG.pl

 

#!/usr/bin/perl

 

#DecMatchMer5.pl

 

use Getopt::Std;

 

getopts('e:s:');

 

$EscFile = $opt_e;

$SubFile = $opt_s;

$OutFile = "Diff_5_" . $opt_s .  $opt_e;

 

$key =0;

$line = 0;

$value = 0;

 

open(SEQFILE, "$EscFile") or die("Could not open Data File $EscFile\n");

while ($line = <SEQFILE>) {

                  chomp($line);

                  ($key, $value) = split /\t/,$line;

                  $key =~ s/\s//gis;

                  $value =~ s/\s//gis;

                  $EscCnt{$key} = $value;

}                

close(SEQFILE);

print "Read Escape Data. \n";

 

open(SEQFILE, "$SubFile") or die("Could not open Data File $SubFile\n");

while ($line = <SEQFILE>) {

                  chomp($line);

                  ($key, $value) = split /\t/,$line;

                  $key =~ s/\s//gis;

                  $value =~ s/\s//gis;

                  $SubCnt{$key} = $value;

}                

close(SEQFILE);

 

print "Read Subject Data. \n";

 

#Open a file to output the comparison results

open(RESTFILE, ">$OutFile") or die ('Could not open Result File');

 

while (($sub_oligo, $sub_number) = each(%SubCnt)){

 

                  if ($EscCnt{$sub_oligo}){

                                    my $number = $EscCnt{$sub_oligo};

                                    if($sub_number >= (5 * $number)){

                                                      print RESTFILE "$sub_oligo\tESCAPE\t$number\t$sub_number\n";

                                    }

 

                  }else{

                                    if ($sub_number >= 5){

                                                      print RESTFILE "$sub_oligo\tESCAPE\t0\t$sub_number\n";

                                    }

                  }

}                                  

close (RESTFILE);

 

7. Carry out LocFFP_PCG.pl in order to report which oligomers were found in a particular location within genomic subgenome sequence.

 

   * Source Code: LocFFC_PCG.pl

- input: ÒWhole_[Subject | Escape].txt with option Ò-sÓ AND output of previous step (e.g., Diff_5_*) with option Ò-pÓ

- output: Location + name following with  option Ò-pÓ

 

#!/usr/bin/perl

 

#LocFFP.pl

#Location Fast Find PolyMer

#opens up a FASTA file and reports which polymers are found at what BP

 

use Getopt::Std;

 

sub complement {

                  my $basein = @_[0];

                  if ($basein =~ s/G/c/gi) {

                                    $basein =~ tr/a-z/A-Z/;

                                    return $basein;

                  }elsif ($basein =~ s/A/t/gi) {

                                    $basein =~ tr/a-z/A-Z/;

                                    return $basein;

                  }elsif ($basein =~ s/T/a/gi) {

                                    $basein =~ tr/a-z/A-Z/;

                                    return $basein;

                  }elsif ($basein =~ s/C/g/gi) {

                                    $basein =~ tr/a-z/A-Z/;

                                    return $basein;

                  }

}

 

sub reversesite {

                  my $sitein = @_[0];

                  my @REsite = split //, $sitein;

                  my @revsite; #Array to hold the reversed site

                  my $countletter = 0;

                  while ($#REsite >= 0) {

                                    $revsite[$countletter] = pop(@REsite);

                  #                print "$revsite[$countletter]";

                                    $revsite[$countletter] = complement($revsite[$countletter]);

                  #                print "$revsite[$countletter]";

                                    $countletter++;

                  }

                  my $returnsite = "";

                  foreach $revsite (@revsite){

                                    $returnsite = $returnsite . $revsite;

                  }

                  #print "$sitein $returnsite\n";

                  return $returnsite;

}

 

 

getopts('p:s:');

 

$SeqFile = $opt_s;

$PolFile = $opt_p;

$OutFile = "Location" . $PolFile;

 

%TestPol = (); #Initialize Hash for storing Test Polymers

 

#Grab the polymers from files and throw it into an array

open (POLFILE, "$PolFile") or die ("Could not open Polymer File $PolFile\n");

while ($line = <POLFILE>) {

                  @PolInfo = split /\t/,$line;

                  $PolInfo[0] =~ s/\s//;

                  $TestPol{$PolInfo[0]} = "NONE found";

}

close (POLFILE);

 

print "Read Polymers in.\n";

 

@PolymerSize = split //,$PolInfo[0];

$PolSize = scalar(@PolymerSize);

 

#file is a flat text FASTA file, so skip lines starting with >

open(SEQFILE, "$SeqFile") or die("Could not open Sequence File $SeqFile \n");

#Open a file to output the polymer count results

open(RESTFILE, ">$OutFile") or die ('Could not open Result File');

 

for ($i = 0; $i < ($PolSize - 1); $i++) {

                  $PolHold[$i] = "N";

}

$NotFound = 1;          

 

$CurrBP=0;

 

while ($line = <SEQFILE>) {

                  chomp($line);

                  @bases = split //,$line;

                  if ($bases[0] ne '>'){

                                    for ($i = 0; $i < @bases; $i++){

                                                      $PolHold[($PolSize - 1)] = $bases[$i];

                                                      $CurrBP++;

                                                      $hexout = "";

                                                      for ($hexpos = 0; $hexpos <= ($PolSize -1); $hexpos++) {

 

                                                                        ############################

                                                                        #         Modified         #

                                                                        ############################

                                                                        $PolHold[$hexpos] =~ tr/a-z/A-Z/;

 

                                                                        $hexout  .= $PolHold[$hexpos];

                                                      }

                                                      #Code to make sure that you're using the lowest version of $hexout

                                                      $testseq1 = reversesite($hexout);

                                                      if ($hexout gt $testseq1) {

                                                      #                print "$hexout $testseq1\n";

                                                                        $hexout = $testseq1;

                                                      }

                                                      if (exists($TestPol{$hexout})){

                                                                        print RESTFILE "$CurrBP\t$hexout\n";

                                                      }

                                                      for ($j =0; $j <($PolSize - 1); $j++){

                                                                        $PolHold[$j] = $PolHold [($j + 1)];

                                                      }

                                    }

                  }

}

close(SEQFILE);

 

close (RESTFILE);

 

8. Assign the overrepresent oligomers into repeat region or non-repeat region (i.g., check whether a particular overrepresent oligomers lay on the region which was defined as repetitive element.

 

   * Source Code: RM_alloc_From_Oligomer.pl

            - input: None

- output: Result_[Subject | Escape]_[12 | 16 | 20 | 24]_Oligo_[50K | 100K | 250K].txt

- variables to be modified: $Table, $Oligomer, $RepeatMasker

 

#!/usr/bin/perl -w

 

use strict;

use warnings;

 

##### For subject

my $Table = "Diff_5_100K_Subject_Contig_24.txtWhole_Escape_100K.txt.24";

my $Oligomer = "LocationDiff_5_100K_Subject_Contig_24.txtWhole_Escape_100K.txt.24";

my $RepeatMasker = "../100K_Data/Whole_Subject_100K.txt.out";

 

#### For escape

#my $Table = "Diff_5_250K_Escape_Contig_20.txtWhole_Subject_250K.txt.20";

#my $Oligomer = "LocationDiff_5_250K_Escape_Contig_20.txtWhole_Subject_250K.txt.20";

#my $RepeatMasker = "../250K_Data/Whole_Escape_250K.txt.out";

 

open(TABLE, "< $Table")               ||

                  die "Sorry, I couldn't open the Table file: $!\n";

 

my $cLine_Table;

while(defined ($cLine_Table = <TABLE>)){

 

                  my @cArray_Table = $cLine_Table =~ /(\S+)\s*/g;

                  my $Table_Oligomer = $cArray_Table[0];

#### For subject

                  my $Table_number = $cArray_Table[2];

#### For escape

#                my $Table_number = $cArray_Table[3];

 

                  my %Contents_RM = ();

                  my $Count = 0;

                  my $TempCount = 0;

                                                     

                  open(OLIGOMER, "< $Oligomer")  ||

                                    die "Sorry, I couldn't open the Oligomer file: $!\n";

 

                  my $cLine_Oligomer;

                  while(defined ($cLine_Oligomer = <OLIGOMER>)){

 

                                    my @cArray_Oligomer = $cLine_Oligomer =~ /(\S+)\s*/g;

                                    my $loc_Oligomer = $cArray_Oligomer[0];

                                    my $Seq_Oligomer = $cArray_Oligomer[1];

 

                                    if ($Table_Oligomer eq $Seq_Oligomer){

                                                      open(REPEAT, "< $RepeatMasker")                ||

                                                                        die "Sorry, I couldn't open the RepeattMasker file: $!\n";

 

                                                      my $cLine_Repeat;

                                                      while(defined ($cLine_Repeat = <REPEAT>)){

                                                                        chomp($cLine_Repeat);

                                                                        my @cArray_Repeat = $cLine_Repeat =~ /(\S+)\s*/g;

                                                                        if ($cArray_Repeat[0] =~ /[0-9]/){

                                                                                          my $pos_begin = $cArray_Repeat[5];

                                                                                          my $pos_end = $cArray_Repeat[6];

 

                                                                                          if (($pos_begin + 16 <= ($loc_Oligomer-16)) && ($pos_end >= $loc_Oligomer)){

                                                                                                            $Contents_RM{$cArray_Repeat[10]}++;

#                                                                                                          print "$Table_Oligomer\t$cArray_Repeat[10]\n";

                                                                                                            $Count++;

                                                                                          }

                                                                        }

                                                      }

                                                      close(REPEAT);

                                                      $TempCount++;

                                    }

                  }

                  close(OLIGOMER);

 

                  if ($TempCount eq $Table_number){

                                    my $myKey;

                                    my $diff = $Table_number - $Count;

                                    print "$Table_Oligomer\t$Table_number\t$Count\t$diff\t";

                                    foreach $myKey (sort keys %Contents_RM){

                                                      print "$myKey\t$Contents_RM{$myKey}\t";

                                    }

                                    print "\n";

                  }

                  else{          print "Eroor occur...\n";                 }

}

close(TABLE);

 

9. Summary the prvious results by means of the repeat elements.

 

   * Source Code: Ordering_Result_files.pl

            - input: None

            - output: Order_Result_[Subject | Escape]_[8| 12 | 16 | 20 | 24]_Oligo_[50K | 100K | 250K].txt

 

#!/usr/bin/perl -w

 

use strict;

use warnings;

 

my $Result = "Result_Subject_24_Oligo_100K.txt";

 

open(RESULT, "< $Result")            ||

                  die "Sorry, I couldn't open the Result file: $!\n";

 

my $cLine_Result;

while(defined ($cLine_Result = <RESULT>)){

                  chomp($cLine_Result);

                  my @cArray_Result = $cLine_Result =~ /(\S+)\s*/g;

                  my $total_number = scalar(@cArray_Result);

                  my %My_Hash = ();

                  my $start_number = 4;

 

#                print "$cLine_Result \t \t \t $total_number\n";

 

                  for(;$start_number < $total_number;){

                                    $My_Hash{$cArray_Result[$start_number]} = $cArray_Result[$start_number+1];

                                    $start_number +=2;

                  }

                 

                  print "$cArray_Result[0]\t$cArray_Result[1]\t";

                  my $per_re = ($cArray_Result[2] / $cArray_Result[1]) * 100;

                  print "$cArray_Result[2]\t";

                  printf "%3.1f", $per_re;

                  my $per_re_none = ($cArray_Result[3] / $cArray_Result[1]) * 100;

                  print "%\t$cArray_Result[3]\t";

                  printf "%3.1f", $per_re_none;

                  my $my_key;

                  my $my_value;

                  foreach $my_key (sort {$My_Hash{$b} <=> $My_Hash{$a}} keys %My_Hash){

                                    print "%\t$my_key\t$My_Hash{$my_key}\t";

                                    $my_value = $My_Hash{$my_key};

                                    last;

                  }

                  my $per_class_by_all = ($my_value / $cArray_Result[1]) *100;

                  my $per_class_by_re;

                  if ($cArray_Result[2] == 0){

                                    $per_class_by_re = 0;

                  }else{

                                    $per_class_by_re = ($my_value / $cArray_Result[2]) *100;

                  }

                  printf "%3.1f", $per_class_by_all;

                  print "%\t";

                  printf "%3.1f", $per_class_by_re;

                  print "%\n";

 

}

close(RESULT);

 

 

v    I did two statistic tests (these will be called random test and autosomal test) using oligomers from Xp22 / Whole X to evaluate the significance of the overrepresented oligomers.

 

- For autosomal test

: The aim of autosomal test is to access the statistical significance of an enriched    oligomer that were identified in either I or E subgenomes for the Xp22 (Whole X) as compared to autosomes. The procedures for autosomal test are as follows. Dowonload  all known genes annotated on autosomes and TSS information from UCSC Table Browser; Generate sequence fragments 100kb, 200kb, and 500kb surrounding the TSS of each gene; Merge fragments of adjacent genes if sequence fragments; Randomly pick up contigs with replacement to make up new E and I subgenomes of the Xp22 (Whole X) subgenome length.

: This process was repeated 1000 times; Calculate how many times an oligomer was present in the autosomal subgenome. To determine the empirical p-value, I

calculate the number of autosomal subgenomes (which were generated by above

procedures) at which an oligomer was presented at the same or greater number of  opies than in the original I or E subgenomes.

 

1. Extract gene information in Òknown_gene_list.txtÓ coming from UCSC table Browser.

 

   * Source Code: Extract_gene_info.pl

            - input: known_gene_list.txt

            - output: input_for_Extract_sequence.txt

 

#!/usr/bin/perl -w

 

use strict;

use warnings;

 

my $Gene_Database = "known_gene_list.txt";

 

open(GENE_DB, "< $Gene_Database")          ||

                  die "Sorry, I couldn't open the known_gene_list file: $!\n";

 

my $cLine_Gene;

while(defined ($cLine_Gene = <GENE_DB>)){

 

                  my @cArray_Gene = $cLine_Gene =~ /(\S+)\s*/g;

                  my $List_Name = $cArray_Gene[0];

                  my $List_Chrom = $cArray_Gene[1];

                  my $List_Strand = $cArray_Gene[2];

                  my $List_txStart = $cArray_Gene[3];

                  my $List_txEnd = $cArray_Gene[4];

 

                  if ($List_Chrom =~ /chr\d{1,2}$/){

 

                                    my $length = 50000;

                                    my ($contigStart, $contigEnd);

                                    if ($List_Strand eq "+"){

                                                      $contigStart = $List_txStart - $length;

                                                      $contigEnd = $List_txStart + $length;

                                    }elsif ($List_Strand eq "-"){

                                                      $contigStart = $List_txEnd - $length;

                                                      $contigEnd = $List_txEnd + $length;

                                    }else {

                                                      print "Occur the error....\n";

                                    }

 

                                    if ($contigStart > 0 && $contigEnd > 0){

 

                                                      print "$List_Name \t $List_Chrom \t $List_Strand \t $List_txStart \t $List_txEnd \t $contigStart \t $contigEnd \n";

                                    }

                  }                

 

}

close(GENE_DB);

 

2. Merge fragments of adjacent genes if sequence fragments overapped

 

   * Source Code: Extract_gene_info.pl

            - input: input_for_Extract_sequence.txt

- output: Merged_contigs_[50K | 100K | 250K].txt

 

#!/usr/bin/perl -w

 

use strict;

use warnings;

 

my ($dbNum, $dbID);

my %beforeMerge = ();

my %afterMerge = ();

 

sub merge_seq{

                  my $func_first = $_[0];

                  my $func_end = $_[1];

                  my ($Check_first, $Check_last);

                  delete $beforeMerge{$func_first};

 

                  while (($Check_first, $Check_last) = each(%beforeMerge)){

 

                                    if (($func_first < $Check_first) && ($func_end > $Check_last)){

                                                      delete $beforeMerge{$Check_first};

                                    }else{

                                                      if (($func_first < $Check_first) && ($func_end > $Check_first)){

                                                                        $func_end = $Check_last;

                                                                        delete $beforeMerge{$Check_first};

                                   

                                                      }                

                                                      if (($func_first < $Check_last) && ($func_end > $Check_last)){

                                                                        $func_first = $Check_first;

                                                                        delete $beforeMerge{$Check_first};

                                                      }

                                    }

                  }

                  $afterMerge{$func_first} = $func_end;

}

 

foreach $dbNum (1..22){   #for starting of foreach

 

                  $dbID = "chr".$dbNum;

 

                  my $Gene_Database = "input_for_Extract_sequence.txt";

 

                  open(GENE_DB, "< $Gene_Database")          ||

                                    die "Sorry, I couldn't open the input_for_Extract_sequence.txt file: $!\n";

 

                  my $cLine_Gene;

                  while(defined ($cLine_Gene = <GENE_DB>)){

 

                                    my @cArray_Gene = $cLine_Gene =~ /(\S+)\s*/g;

                                    my $List_Chrom = $cArray_Gene[1];

                                    my $Contig_Start = $cArray_Gene[5];

                                    my $Contig_End = $cArray_Gene[6];

                 

                                    if ($List_Chrom eq $dbID){

                                                      $beforeMerge{$Contig_Start} = $Contig_End;

                                    }

                  }

                  close(GENE_DB);

 

                  my ($first, $last);

                  while (($first, $last) = each(%beforeMerge)){

                                    merge_seq($first, $last);

                  }

 

                  my $myCount = 0;

                  for(; $myCount < 100; $myCount++){

 

                                    %beforeMerge = %afterMerge;

                                    %afterMerge = ();

                                    my @my_key_array = ();

                                    while (($first, $last) = each(%beforeMerge)){

                                                      push(@my_key_array, $first);

                                    }

                                    foreach my $my_key_val (sort {$a <=> $b} @my_key_array){

                                                      my $my_value = $beforeMerge{$my_key_val};

                                                      if ($my_value){

                                                                        merge_seq($my_key_val, $my_value);

                                                      }

                                    }

                  }

 

                  my ($a, $b);

                  while (($a, $b) = each(%afterMerge)){

                                    my $length = $b - $a +1;

                                    print "$dbID \t $a \t $b \t $length \n";

                  }

 

                  %beforeMerge = ();

                  %afterMerge = ();

}

 

3. Extract genomic sequences from ÒMerged_contig_[50K | 100K | 250K].txt.

 

   * Source Code: Seq_Extract_From_Position.pl

            - input: Merged_contigs_[50K | 100K | 250K].txt

            - output: Directory ./Sequences/[50K | 100K | 250K]_Sequence/

            - variables to be modified: $path, $outputfile

 

#!/usr/bin/perl -w

 

use strict;

use warnings;

 

# Define the variable for file analysis

my @cDatabase = ();

my $file;

my $cTempDatabase = "";

my $cTempFile = "";

my @cFile = ();

my $path = "Result_after_Merge.txt";

my $databasePath = "/Users/chungoopark/Research/X-inactivation/Sequence_Analysis/Chromosome/chromFa/";

my $outfile = "/Users/chungoopark/Research/X-inactivation/Autosomal_Oligomer/Sequences/50K_Sequence/";

 

my ($DNA_ID, $chr_num);

my @sequence = ();

my $sequenceintoDB;

my ($dbID, $dbNum, $indexSize);

my ($chrom, $dStart, $dEnd);

my $file_ID = 1;

 

foreach $dbNum (1,10,11,12,13,14,15,16,17,18,19,2,20,21,22,3,4,5,6,7,8,9){   #for starting of foreach

 

                  $DNA_ID = "";

                  $chr_num = "";

                  $dStart = "";

                  $dEnd = "";

                  $cTempDatabase = "";

                  @sequence = ();

                  @cDatabase = ();

 

                  undef $/;

                  $dbID = "chr".$dbNum.".fa";

                  my $dbIDcheck = "chr".$dbNum;

 

                  open(DBFILE, "< $databasePath$dbID")          ||

                                    die "Sorry, I couldn't open the database file: $!\n";

 

                  while(defined ($cTempDatabase = <DBFILE>)){

                                    @cDatabase = split(/\n/, $cTempDatabase);                    

                  }

                  print "upload $dbID done\n";

 

                  calPositionFile($dbIDcheck);

 

}

close(DBFILE);

 

sub calPositionFile{

 

                  $/ = "\n";   

 

                  #open the novel.txt file

                  open(FILE, "< $path")   ||

                                    die "Sorry, I couldn't open the input file with respect to subgenome: $!\n";

 

                  while(defined ($cTempFile = <FILE>)){

 

                                    @cFile = $cTempFile =~ /(\S+)\s*/g;

                                    if ($cFile[0] eq $_[0]){

                                                      $chrom = $cFile[0];

                                                      $dStart = $cFile[1];

                                                      $dEnd = $cFile[2];

                                                     

                                                      calSequence();

 

                                                      $sequenceintoDB = join(" ", @sequence);

                                                      $sequenceintoDB =~ s/\s//g;

                                                      open(OUT, ">$outfile$file_ID")     ||

                                                                        die "Sorry, I couldn't make outfile: $!\n";

                                                      print OUT $sequenceintoDB;

                                                      $file_ID ++;

                                    }

 

                  }

                  close(FILE);

}

 

sub calSequence{

 

                  my $iCount;

                  my ($StartArray, $StartPosition);

                  my ($EndArray, $EndPosition);

                  my $NumArray;

                  my @TempArray;

                  @sequence = ();

                 

                  ($StartArray, $StartPosition) = extractArray($dStart);

                  ($EndArray, $EndPosition) = extractArray($dEnd);

 

                  foreach $NumArray ($StartArray .. $EndArray){

                                    #Initiate varable

                                    @TempArray =();

                                                     

                                    if ($NumArray == $StartArray || $NumArray == $EndArray){

                                                                       

                                                      @TempArray = $cDatabase[$NumArray] =~ /([a-zA-Z])/g;

                                                                       

                                                      # Error Check

                                                      if (( $#TempArray + 1) != 50){

                                                                        die "$dStart $dEnd First_Parsing error ... Array Count...\n";

                                                      }

                                                                       

                                                      if ($NumArray == $StartArray){

                                                                        $iCount = $StartPosition;

                                                                        while ($iCount < 50){   push(@sequence, $TempArray[$iCount++]); }

                                                      }else{

                                                                        $iCount = 0;

                                                                        while ($iCount < $EndPosition + 1){                push(@sequence, $TempArray[$iCount++]); }

                                                      }

 

                                    }elsif (( $StartArray < $NumArray) && ($EndArray > $NumArray)){

 

                                                      @TempArray = $cDatabase[$NumArray] =~ /(\S)/g;

 

                                                      # Error Check

                                                      if (( $#TempArray + 1) != 50){

                                                                        die "$dStart $dEnd Second_ Parsing error ... Array Count...\n";