Rcode FDR

From CCGB

Jump to: navigation, search

For more pages from the Hardison Lab see HardisonLab.

Contents

Example Data

Define False Discovery Rate (FDR) as the expected number of false positives divided by the expected number of positives in the joint distribution.

Background sample

X = 1800
x = rnorm(X)
ecdfx = ecdf(X)

Sample of positives

Y = 200
y = rnorm(Y)
ecdfy = ecdf(y)

Joint sample

all = c(x,y)

Joint distribution

ecdfall = ecdf(all)

FDR

Define the expected number of false positives.

E(x, p) = X * p; X is the number drawn from the null distribution in the background set, x.

Ex = function(x) X * (1-ecdfx(x))

Define the observed number of positives.

E(All, p) = N * p; N is the number drawn from the full set, All.

Eall = function(x) N *(1-ecdfall(x))

The false discovery rate at some p value threshold is defined as the proportion of false positives at that threshold.

FDR(p) = E(X,p) / E(All,p)

fdr = function(x) Ex(x)/Eall(x)

Drawing samples from two distributions x, and y.

   N = 2000
   Y = 100
   X = 2000 -Y
      
   x = rnorm(X)
   y = rnorm(Y,3)

graph code

The relationship between False Discovery Rate of all the tests, and p-value within a test, depends on the relative sizes of x and y

By sampling larger and larger portions of Y, the false discovery rate decreases.

#graph 2 code

R Code

Functions to compute FDR

Ex = function(x) X * (1-ecdfx(x))

Eall = function(x) N *(1-ecdfall(x))

fdr = function(x) Ex(x)/Eall(x)

graph 1 code

Enlarge
quartz(width=8,height=5)
faintred = hsv(0,1,1, alpha=0.1)
faintblue = hsv(2/3, 1, 1, alpha=0.1)
plot(dnorm, -6,6)
alpha = 0.01
x_alpha = qnorm(1-alpha) # right-tail
series = seq(-6,x_alpha,0.01)
seriesright = seq( x_alpha, 6, 0.01)
polygon( c( series, x_alpha), c(dnorm(series), 0), col=faintred,border=NA)
polygon( c(x_alpha, seriesright),c(0,dnorm(seriesright)), col='red',border=NA)

series2 = seq(-3,6,0.01)
polygon( c( series2, 6), c(dnorm(series2,3), 0), col=faintblue, border=NA)

segments(0,0,0,dnorm(0),col='red', lty=2)
segments(3,0,3,dnorm(0),col='blue', lty=2)

graph 2 code

Enlarge
Ex = function(x) X * (1-ecdfx(x))
Eall = function(x) N *(1-ecdfall(x))
fdr = function(x) Ex(x)/Eall(x)

df = data.frame(
Y = c(100,200,500,1000),
col = c("red","blue","green","black"))

for (row in 1:nrow(df))
{
    Y = df[row, 'Y']
    col = as.character( df[row, 'col'] )
    X = 2000 -Y
    N = 2000
    x = rnorm(X)
    ecdfx = ecdf(x)

    y = rnorm(Y,3)
    all = c(x,y)
    all = all[order(all)]
    ecdfall = ecdf(all)
    ptest = 1-pnorm(all) # our set of p-values
    q = fdr(all)

    if( is.null(dev.list())) plot(ptest,q,type='n', xlab="p-value from test statistic", ylab="observed FDR")
    lines( ptest, q, col=col)
}

test script

Ex = function(x) X * (1-ecdfx(x))
Eall = function(x) N *(1-ecdfall(x))
fdr = function(x) Ex(x)/Eall(x)

df = data.frame(
Y = c(100,200,500,1000),
col = c("red","blue","green","black"))

for (row in 1:nrow(df))
{
    Y = df[row, 'Y']
    col = as.character( df[row, 'col'] )
    X = 2000 -Y
    N = 2000
    x = rnorm(X)
    ecdfx = ecdf(x)

    y = rnorm(Y,3)
    all = c(x,y)
    all = all[order(all)]
    ecdfall = ecdf(all)
    ptest = 1-pnorm(all) # our set of p-values
    q = fdr(all)

    if( is.null(dev.list())) plot(ptest,q,type='n', xlab="p-value from test statistic", ylab="observed FDR")
    lines( ptest, q, col=col)
}
legend(.6,.2,legend=sprintf("Y=%d",df$Y), col=as.character(df$col), lty=1)

for (row in 1:nrow(df))
{
    Y = df[row, 'Y']
    col = as.character( df[row, 'col'] )
    X = 2000 -Y
    N = 2000
    x = rnorm(X)
    ecdfx = ecdf(x)

    y = rnorm(Y,3)
    all = c(x,y)
    all = all[order(all)]
    ecdfall = ecdf(all)
    ptest = 1-pnorm(all) # our set of p-values
    q = fdr(all)

    if( is.null(dev.list())) plot(q,p.adjust(ptest),type='n', xlab="observed FDR", ylab="q value from p.adjust(method='FDR')")
    lines( q, p.adjust(ptest,"fdr"), col=col)
}
legend(0,.9,legend=sprintf("Y=%d",df$Y), col=as.character(df$col), lty=1)

FDR provided by the R package

Command for FDR correction

Assume, p is a vector of p-values and one would like to calculate a FDR q value for each p value.

q=p.adjust(p,"fdr")

q is the resultant q-value vector.

Details in p.adjust("fdr")

The FDR controlling procedure is:

Rank all p-values in increasing order, p_1\le p_2\le p_3\le ...\le p_n. 
And find the largest i for which satisfies: p_i\le \frac{i}{n} *\alpha\,


Details of R code:

i <- n:1
o <- order(p, decreasing = TRUE) 
  #Rank p values in decreasing order.
ro <- order(o)
q = pmin(1, cummin(n/i * p[o]))[ro]
  # Correct p value by a factor n/i (p' = n/i * p).
  # Since FDR only cares about the largest i, minimal p' for k<=i is used (cummin function).
  # Return pmin(1, cummin(n/i*p)) as the q value vector.

References

Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57, 289–300.

Personal tools