Rcode FDR
From CCGB
For more pages from the Hardison Lab see HardisonLab.
Example Data
Define False Discovery Rate (FDR) as the expected number of false positives divided by the expected number of positives in the joint distribution.
Background sample
X = 1800 x = rnorm(X) ecdfx = ecdf(X)
Sample of positives
Y = 200 y = rnorm(Y) ecdfy = ecdf(y)
Joint sample
all = c(x,y)
Joint distribution
ecdfall = ecdf(all)
FDR
Define the expected number of false positives.
E(x, p) = X * p; X is the number drawn from the null distribution in the background set, x.
Ex = function(x) X * (1-ecdfx(x))
Define the observed number of positives.
E(All, p) = N * p; N is the number drawn from the full set, All.
Eall = function(x) N *(1-ecdfall(x))
The false discovery rate at some p value threshold is defined as the proportion of false positives at that threshold.
FDR(p) = E(X,p) / E(All,p)
fdr = function(x) Ex(x)/Eall(x)
Drawing samples from two distributions x, and y.
N = 2000
Y = 100
X = 2000 -Y
x = rnorm(X)
y = rnorm(Y,3)
The relationship between False Discovery Rate of all the tests, and p-value within a test, depends on the relative sizes of x and y
By sampling larger and larger portions of Y, the false discovery rate decreases.
R Code
Functions to compute FDR
Ex = function(x) X * (1-ecdfx(x))
Eall = function(x) N *(1-ecdfall(x))
fdr = function(x) Ex(x)/Eall(x)
graph 1 code
quartz(width=8,height=5) faintred = hsv(0,1,1, alpha=0.1) faintblue = hsv(2/3, 1, 1, alpha=0.1) plot(dnorm, -6,6) alpha = 0.01 x_alpha = qnorm(1-alpha) # right-tail series = seq(-6,x_alpha,0.01) seriesright = seq( x_alpha, 6, 0.01) polygon( c( series, x_alpha), c(dnorm(series), 0), col=faintred,border=NA) polygon( c(x_alpha, seriesright),c(0,dnorm(seriesright)), col='red',border=NA) series2 = seq(-3,6,0.01) polygon( c( series2, 6), c(dnorm(series2,3), 0), col=faintblue, border=NA) segments(0,0,0,dnorm(0),col='red', lty=2) segments(3,0,3,dnorm(0),col='blue', lty=2)
graph 2 code
Ex = function(x) X * (1-ecdfx(x))
Eall = function(x) N *(1-ecdfall(x))
fdr = function(x) Ex(x)/Eall(x)
df = data.frame(
Y = c(100,200,500,1000),
col = c("red","blue","green","black"))
for (row in 1:nrow(df))
{
Y = df[row, 'Y']
col = as.character( df[row, 'col'] )
X = 2000 -Y
N = 2000
x = rnorm(X)
ecdfx = ecdf(x)
y = rnorm(Y,3)
all = c(x,y)
all = all[order(all)]
ecdfall = ecdf(all)
ptest = 1-pnorm(all) # our set of p-values
q = fdr(all)
if( is.null(dev.list())) plot(ptest,q,type='n', xlab="p-value from test statistic", ylab="observed FDR")
lines( ptest, q, col=col)
}
test script
Ex = function(x) X * (1-ecdfx(x))
Eall = function(x) N *(1-ecdfall(x))
fdr = function(x) Ex(x)/Eall(x)
df = data.frame(
Y = c(100,200,500,1000),
col = c("red","blue","green","black"))
for (row in 1:nrow(df))
{
Y = df[row, 'Y']
col = as.character( df[row, 'col'] )
X = 2000 -Y
N = 2000
x = rnorm(X)
ecdfx = ecdf(x)
y = rnorm(Y,3)
all = c(x,y)
all = all[order(all)]
ecdfall = ecdf(all)
ptest = 1-pnorm(all) # our set of p-values
q = fdr(all)
if( is.null(dev.list())) plot(ptest,q,type='n', xlab="p-value from test statistic", ylab="observed FDR")
lines( ptest, q, col=col)
}
legend(.6,.2,legend=sprintf("Y=%d",df$Y), col=as.character(df$col), lty=1)
for (row in 1:nrow(df))
{
Y = df[row, 'Y']
col = as.character( df[row, 'col'] )
X = 2000 -Y
N = 2000
x = rnorm(X)
ecdfx = ecdf(x)
y = rnorm(Y,3)
all = c(x,y)
all = all[order(all)]
ecdfall = ecdf(all)
ptest = 1-pnorm(all) # our set of p-values
q = fdr(all)
if( is.null(dev.list())) plot(q,p.adjust(ptest),type='n', xlab="observed FDR", ylab="q value from p.adjust(method='FDR')")
lines( q, p.adjust(ptest,"fdr"), col=col)
}
legend(0,.9,legend=sprintf("Y=%d",df$Y), col=as.character(df$col), lty=1)
FDR provided by the R package
Command for FDR correction
Assume, p is a vector of p-values and one would like to calculate a FDR q value for each p value.
q=p.adjust(p,"fdr")
q is the resultant q-value vector.
Details in p.adjust("fdr")
The FDR controlling procedure is:
Rank all p-values in increasing order,. And find the largest i for which satisfies:
![]()
Details of R code:
i <- n:1 o <- order(p, decreasing = TRUE) #Rank p values in decreasing order. ro <- order(o) q = pmin(1, cummin(n/i * p[o]))[ro] # Correct p value by a factor n/i (p' = n/i * p). # Since FDR only cares about the largest i, minimal p' for k<=i is used (cummin function). # Return pmin(1, cummin(n/i*p)) as the q value vector.
References
Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57, 289–300.
.
And find the largest i for which satisfies:

