R notes
From CCGB
This place is for R code we want to remember. Currently it contains various topics that might later be reorganized. There is also a separate section for empirical FDR.
More pages from the HardisonLab.
The Reality
- R is a powerful statistical language that has many useful functions, especially for plotting.
- R is nearly impossible to use.
- Grad students wrote this code: use at your own risk.
- The R project
Contents |
PDF plots
Symbols
Most font styles have a subset devoted to symbols, but making use of them can be difficult. An alternative is to use plotmath(), which will draw a given symbol in an expression. However, if you want to use a character from a font to express a mathematical symbol, you can make use of the "font" parameter in text().
- ∩ (intersect). text(x, y,"\307", font=5)
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
|---|---|---|---|---|---|---|---|---|
| 240 | € | ϒ | ′ | ≤ | ⁄ | ∞ | ƒ | ♣ |
| 250 | • | ♥ | ♠ | ↔ | ← | ↑ | → | ↓ |
| 260 | ° | ± | ″ | ≥ | × | ∝ | ∂ | • |
| 270 | ÷ | ≠ | ≡ | ≈ | … | | ⎯ | ↵ |
| 300 | ℵ | ℑ | ℜ | ℘ | ⊗ | ⊕ | ∅ | ∩ |
| 310 | ∪ | ⊃ | ⊇ | ⊄ | ⊂ | ⊆ | ∈ | ∉ |
| 320 | ∠ | ∇ | ® | © | ™ | ∏ | √ | ⋅ |
| 330 | ¬ | ∧ | ∨ | ⇔ | ⇐ | ⇑ | *> | *v |
| 340 | ◊ | 〈 | | | | ∑ | ⎛ | ⎜ |
| 350 | ⎝ | ⎡ | ⎢ | ⎣ | ⎧ | ⎨ | ⎩ | ⎪ |
| 360 | | 〉 | ∫ | ⌠ | ⎮ | ⌡ | ⎞ | ⎟ |
| 370 | ⎠ | ⎤ | ⎥ | ⎦ | ⎫ | ⎬ | ⎭ |
Pasting data into R
Importing data into R can be tricky if the data are complex or of mixed type (see the R-data manual). Pasting simple data into the console, however, is quite easy.
scan() from the console
The default file argument for scan(file="") instructs it to read from the console:
> scan("")
1: 1
2: 2
3: 3
4: 4
5:
Read 4 items
[1] 1 2 3 4
Using textConnection with scan()
Paste or type data into quotes
s = "" s = "1 2 3 5 76 7 43"
Use textConnection() and scan() on a string.
x=scan(textConnection(s))
This makes scan read the string as if it were a file.
x=scan(textConnection("1 2 3 5 76 7 43"))
> x
[1] 1 2 3 5 76 7 43
> mean(x)
[1] 19.57143
This example wraps the output of textConnection without saving it to a variable. This may eventually produce a warning when R closes the connection automatically (see Warnings). To avoid this, use a variable for the connection.
s = "" s = "1 2 3 5 76 7 43" zz <- textConnection(s) x=scan(zz) close(zz)
paste.table()
This function will wrap read.table() with the textConnection(). If you have complex data that is inconvenient to save in a file, this function will suffice.
paste.table <- function(pastedstring, header = FALSE, sep = "", quote = "\"'",
dec = ".", row.names, col.names,
as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown")
{
input = textConnection(pastedstring);
output = read.table(input, header, sep, quote,
dec, row.names, col.names,
as.is, na.strings, colClasses, nrows,
skip, check.names, fill,
strip.white, blank.lines.skip,
comment.char,
allowEscapes, flush,
stringsAsFactors,
fileEncoding, encoding);
close(input);
output;
}
Usage example:
paste.table('TAG LABEL COLOR AREA AXIS_RATIO ANGLE POSITION_X POSITION_Y
+ A NoName 204,204,204 5,000 1 0 218 174
+ B NoName 204,204,204 4,000 1 0 176 178
+ C NoName 204,204,204 5,000 2 0 184 155', header=T)
TAG LABEL COLOR AREA AXIS_RATIO ANGLE POSITION_X POSITION_Y
1 A NoName 204,204,204 5,000 1 0 218 174
2 B NoName 204,204,204 4,000 1 0 176 178
3 C NoName 204,204,204 5,000 2 0 184 155
Warnings
8192 byte limit
There is a limit of 8192 bytes on the size of strings which can be parsed. Otherwise, you will get an error message:
Error: input buffer overflow.
Therefore, if you really want to read in a large string, save it in a file first, then use the following command (read.table etc.).
Closing the connection
The textConnection is like an open file. Use close() to avoid garbage collection warning messages:
Warning messages:
1: closing unused connection 8 ("1 2 3")
2: closing unused connection 7 ("1 2 3")
3: closing unused connection 6 ("1 2 3")
4: closing unused connection 5 (s)
5: closing unused connection 4 (s)
6: closing unused connection 3 (s)
==Reading Excel Spreadsheets into R==
Much of our data is kept in .xls format, and can be saved as a text file for use in R. However, R has very strict constraints on reading in a table of data from text file. Therefore, added parameters to '''read.table()''' and '''scan()''' may be necessary.
===Irregular spreadsheet format===
Single name column, followed by cells of data that might be different in each row:
:Rows of irregular length make an incomplete table of data. Use '''read.table()''' with ''fill'' and ''col.names'' to read-in the file with a fixed number of columns, replacing the blank cells with NA. This is just to get the data into R in a format we can work with.
[[Image:Spreadsheet.png|500x250px]]
<pre>
x = read.table(file, col.names=1:100, fill=T)
Now we want to assign all the values of a given row to the name in the first column. This is not a straightforward operation in R, so we have to take apart the dataframe and make a new one with the different values.
- Remove any columns that aren't numeric
- Turn rows to columns t() for transpose, and use c(x,recursive=T) to concatenate the columns into a single vector.
- Replicate the names in the first column to correspond to the vector just created.
xdat = x[2:100]
vals = c(t(xdat), recursive=T )
names = rep( x[[1]], rep(100-1, nrow(x) ))
Lastly, create a new dataframe using the two vectors just created and delete all the NAs we needed to for reading in the data.
df = data.frame(name=names, val=vals)
df = df[! is.na(df$val),]
readIntoVertical()
The following function readIntoVertical() will handle the above steps. It also uses factors to sort the order of data with respect to the names by group mean and median.
readIntoVertical = function(file, maxcols=100, ...)
{
x = read.table(file, col.names=1:maxcols, fill=T)
print(sprintf("Read in %d lines", nrow(x)));
xdat = x[2:maxcols]
vals = c(t(xdat), recursive=T)
names = rep(x[[1]], rep(maxcols-1, nrow(x)))
df = data.frame(factormean=names, factormedian=names, val=vals)
df = df[! is.na(df$val),]
print(sprintf("%d datapoints",nrow(df)))
medians = aggregate(df$val, by=list(names=df$factormean), median)
medians = medians[order( medians$x ),]
means = aggregate( df$val, by=list(names=df$factormedian), mean )
means = means[ order( means$x ), ]
df$factormedian = factor( df$factormedian, levels = medians$names )
df$factormean = factor( df$factormean, levels = means$names )
print(sprintf("%d factors",nrow(medians)))
return(df)
}
Examples
x = readIntoVertical('Desktop/MergeTFFallYCrh.txt')
Dataframe:
> x[1:10,]
factormean factormedian val
1 GHN322 GHN322 0.82
2 GHN322 GHN322 0.82
3 GHN322 GHN322 0.79
4 GHN322 GHN322 0.75
5 GHN322 GHN322 1.38
6 GHN322 GHN322 1.19
7 GHN322 GHN322 1.97
8 GHN322 GHN322 1.10
9 GHN322 GHN322 0.45
100 GHN534 GHN534 1.46
The factor values are repeated as many times as there is a value corresponding to it. Because these columns are factors, and not just strings of characters, they have a hidden numeric property.
> levels(x$factormean) [1] "GHP300" "GHP21" "GHP42" "GHP297" "GHN213" "GHP130" "GHP301" "GHP118" "GHP90" "GHP20" "GHP283" "GHP185" "GHP227" [14] "GHP0" "GHN240" "GHN419" "GHP310" "GHP279" "GHP8" "GHP200" "GHP6" "GHP101" "GHP100" "GHP202" "GHP194" "GHP122" [27] "GHP22" "GHP234" "GHP206" "GHN159" "GHP78" "GHP184" "GHP203" "GHN133" "GHP89" "GHP17" "GHP87" "GHP30" "GHP197" [40] "GHP19" "GHP191" "GHP183" "GHP160" "GHP128" "GHP24" "GHP11" "GHN322" "GHP32" "GHP276" "GHP164" "GHN6" "GHN534" [53] "GHP201" "GHN391" "GHP2" "GHP26" "GHP12" "GHP117" "GHP15" "GHP311" "GHP198" "GHP169" "GHP27" "GHP199" "GHP23" [66] "GHP3" "GHN478" "GHP243" "GHP16" "GHN37" "GHP161" "GHP14" "GHP170" "GHP291" "GHP31" "GHP82" "GHP106" "GHP29" [79] "GHP18" "GHN781" "GHP28" "GHP159" "GHP167" "GHP293" "GHP173" "GHP9" "GHP25" "GHP223" "GHP13" "GHP172" "GHP246" [92] "GHP75" "GHP216" "GHP193" "GHP84" "GHP308" "GHP7" "GHP105" "GHP186" "GHP163" "GHP74" "GHP219" "GHP204" "GHP196" [105] "GHP156" "GHP270" "GHP4" "GHP1" "GHP314" "GHP182" "GHP150" "GHP309" "GHP264" "GHP222" "GHP205" "GHP275" "GHP88" [118] "GHP296" "GHP68" "GHP10" "GHP53" "GHP304" "GHP147" "GHP221" "GHP181" > levels(x$factormedian) [1] "GHP42" "GHP130" "GHP90" "GHP300" "GHP310" "GHP21" "GHN213" "GHP297" "GHP301" "GHP283" "GHP101" "GHP227" "GHP279" [14] "GHN240" "GHP20" "GHP118" "GHP0" "GHP6" "GHP185" "GHP8" "GHP122" "GHN419" "GHP200" "GHP100" "GHP202" "GHN159" [27] "GHP206" "GHP89" "GHP194" "GHP17" "GHP234" "GHP22" "GHN322" "GHP19" "GHP203" "GHN133" "GHP184" "GHP24" "GHP87" [40] "GHP11" "GHP291" "GHP183" "GHP197" "GHP78" "GHP191" "GHP30" "GHP2" "GHP32" "GHP26" "GHP128" "GHN6" "GHP160" [53] "GHP276" "GHP15" "GHP23" "GHP27" "GHP311" "GHP201" "GHP117" "GHN391" "GHN534" "GHP12" "GHP164" "GHP199" "GHP29" [66] "GHN37" "GHP13" "GHP243" "GHP16" "GHP161" "GHP28" "GHP169" "GHP198" "GHP14" "GHP3" "GHP25" "GHP106" "GHN478" [79] "GHP18" "GHP82" "GHP31" "GHP308" "GHP75" "GHP170" "GHP216" "GHN781" "GHP293" "GHP173" "GHP223" "GHP159" "GHP246" [92] "GHP9" "GHP7" "GHP163" "GHP167" "GHP219" "GHP74" "GHP105" "GHP84" "GHP172" "GHP193" "GHP156" "GHP196" "GHP205" [105] "GHP204" "GHP1" "GHP186" "GHP314" "GHP296" "GHP4" "GHP270" "GHP150" "GHP222" "GHP309" "GHP264" "GHP182" "GHP275" [118] "GHP88" "GHP10" "GHP68" "GHP53" "GHP221" "GHP147" "GHP304" "GHP181"
These two commands give the factors sorted by mean and median. The order is not the same.
Boxplot
Plotting against the column factormean will group the data into each factor (such as "GHN534") in the order of group means. Likewise for factormedian.
boxplot( val ~ factormean, x, las=2)
Points plot
A more complicated example to show the individual values.
plot( jitter(as.integer(x$factormean)), x$val, pch=19, axes=F,xlab='',ylab='fold change'); axis(1,at=1:length(levels(x$factormean)), labels=levels(x$factormean),las=2); axis(2);
Histogram (and Multiple Histogram)
Another easy way to display overall distribution of the values is histogram.
hist(x$val,breaks=seq(0,18,0.05),freq=T,col="blue",border=F,ylab="counts",xlab="Fold Enhancement",main='') axis(1,at=seq(0,18,1))
One problem here is that y-axis in "R hist()" is the raw counts of each bin when we set "freq = T".
To make y-axis be the real frequency, you can use the following commands:
y=hist(x$val,breaks=seq(0,18,0.05), plot=F) #must specify "breaks" here y$counts=y$counts/sum(y$counts) plot(y,col="blue", border = F, main='',xlab="Fold Enhancement") axis(1,at=seq(0,18,1))
Also I hacked the original hist() code and made a modified hist.mod() to plot really frequency on y-axis. The source code is:Media:Hist mod.R
Two ways to use hist_mod.R:
1. Copy the text in hist_mod.R and paste it directly to R console.
2. Download hist_mod.R and save it to your working directory, then source("hist_mod.R") in R console.
source("hist_mod.R")
hist.mod(x$val,breaks=seq(0,18,0.05),freq=T,col="blue",border=F,xlab="Fold Enhancement",main='')
axis(1,at=seq(0,18,1))
Multi-histogram
In R, a multi-histogram can be generated by the function "multhist()" embedded in package "plotrix".
Or one can specify "barplot()" function to generate a multi-histogram.


