library(ggplot2) # for graphics
library(MASS) # for maximum likelihood estimation
# quick and dirty, a truncated normal distribution to work on the solution set
z <- rnorm(n=3000,mean=0.2)
z <- data.frame(1:3000,z)
names(z) <- list("ID","myVar")
z <- z[z$myVar>0,]
str(z)
summary(z$myVar)
View(z)
I first ran all of the probability density plots on the sample data above as a tutorial to get familiar with what each chunk does. Now, I’m importing data from Bickerton et al 2012 and using those data for this assignment. In short, I will be looking at a quantitative estimate of drug-likeness (QED) for a set of molecules.
z <- read.table("Data/NIHMS50746-supplement-AZ_chemsurvey_QED.csv",header=TRUE,sep=",")
# Only want the QED values and the chemical ID
z <- z[,c('CPD_ID','QED')]
# Let's work with 3000 values instead of over 17,000
samp <- sample(z$CPD_ID,size=3000)
z <- z[samp,]
str(z)
## 'data.frame': 3000 obs. of 2 variables:
## $ CPD_ID: int 12797 4339 11754 14474 10378 15469 3088 4334 11045 815 ...
## $ QED : num 0.729 0.618 0.075 0.58 0.185 0.445 0.549 0.704 0.686 0.793 ...
summary(z)
## CPD_ID QED
## Min. : 5 Min. :0.0080
## 1st Qu.: 4374 1st Qu.:0.4027
## Median : 8776 Median :0.5850
## Mean : 8681 Mean :0.5478
## 3rd Qu.:13037 3rd Qu.:0.7270
## Max. :17115 Max. :0.9460
#View(z) # Now z is 3000s rows containing chemical ID and corresponding QED values
Let’s see what the distribution of this sample of the data looks like. Note that every time this rmarkdown chunk is knitted, all of the following plots will look slightly different, because we will be taking a unique random sample of 3000 from the larger data set of >17,000 every time.
p1 <- ggplot(data=z, aes(x=QED, y=..density..)) +
geom_histogram(color="grey60",fill="cornsilk",size=0.2)
print(p1)
This is basically a density plot overlaid on the histogram; remember, a desnity plot is best thought of as a smoothed-out histogram.
p1 <- p1 + geom_density(linetype="dotted",size=0.75)
print(p1)
Maximum likelihood parameters (MLPs), for a given distribution, are the parameters that are most likely to give us a distribution that fits our data. In other words, assuming that the distribution of these data is normal, what parameters of a normal distribution (mean and standard deviation) will result in a curve that best represents our data. In yet more words, if we were to sample all the possible normal distributions, the distribution with these parameters is most likely to give us a distribution that matches our actual data. Gosh I hope that makes sense.
Also, we are using the function fitdistr
to calculate
MLPs
normPars <- fitdistr(z$QED,"normal")
print(normPars)
## mean sd
## 0.547780667 0.229197577
## (0.004184556) (0.002958928)
str(normPars)
## List of 5
## $ estimate: Named num [1:2] 0.548 0.229
## ..- attr(*, "names")= chr [1:2] "mean" "sd"
## $ sd : Named num [1:2] 0.00418 0.00296
## ..- attr(*, "names")= chr [1:2] "mean" "sd"
## $ vcov : num [1:2, 1:2] 1.75e-05 0.00 0.00 8.76e-06
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "mean" "sd"
## .. ..$ : chr [1:2] "mean" "sd"
## $ n : int 3000
## $ loglik : num 163
## - attr(*, "class")= chr "fitdistr"
normPars$estimate["mean"] # note structure of getting a named attribute
## mean
## 0.5477807
# normPars is a list
Now to visualize my wordy explanation from above: this red curve is the normal distribution that best fits these data. The parameters of this red curve are the maximum likelihood parameters.
meanML <- normPars$estimate["mean"]
sdML <- normPars$estimate["sd"]
xval <- seq(0,max(z$QED),len=length(z$QED))
stat <- stat_function(aes(x = xval, y = ..y..), fun = dnorm, colour="red", n = length(z$QED), args = list(mean = meanML, sd = sdML))
p1 + stat
Now, rinse and repeat with an exponential distribution
expoPars <- fitdistr(z$QED,"exponential")
rateML <- expoPars$estimate["rate"]
stat2 <- stat_function(aes(x = xval, y = ..y..), fun = dexp, colour="blue", n = length(z$QED), args = list(rate=rateML))
p1 + stat + stat2
Now with a uniform distribution. There is no need to use
fitdistr
because the maximum likelihood parameters are just
the minimum and the maximum of the data:
stat3 <- stat_function(aes(x = xval, y = ..y..), fun = dunif, colour="darkgreen", n = length(z$QED), args = list(min=min(z$QED), max=max(z$QED)))
p1 + stat + stat2 + stat3
So on and so forth
gammaPars <- fitdistr(z$QED,"gamma")
shapeML <- gammaPars$estimate["shape"]
rateML <- gammaPars$estimate["rate"]
stat4 <- stat_function(aes(x = xval, y = ..y..), fun = dgamma, colour="brown", n = length(z$QED), args = list(shape=shapeML, rate=rateML))
p1 + stat + stat2 + stat3 + stat4
Since these data is already between 0 and 1, I don’t need to rescale and can therefore plot the beta probability density curve on the same plot as before:
betaPars <- fitdistr(x=z$QED,start=list(shape1=1,shape2=2),"beta")
shape1ML <- betaPars$estimate["shape1"]
shape2ML <- betaPars$estimate["shape2"]
stat5 <- stat_function(aes(x = xval, y = ..y..), fun = dbeta, colour="orchid", n = length(z$QED), args = list(shape1=shape1ML,shape2=shape2ML))
p1 + stat + stat2 + stat3 + stat4 + stat5
It seems the normal distribution fits these data best.
# getting our MLPs
normPars <- fitdistr(z$QED,"normal")
print(normPars)
## mean sd
## 0.547780667 0.229197577
## (0.004184556) (0.002958928)
# extracting the ML mean and ML std
meanML <- normPars$estimate["mean"]
sdML <- normPars$estimate["sd"]
s <- rnorm(n=3000,mean=meanML,sd=sdML)
s <- data.frame(1:3000,s)
names(s) <- c('ID','QED')
ps <- ggplot(data=s, aes(x=QED, y=..density..)) +
geom_histogram(color="grey60",fill="darkolivegreen1",size=0.2)+
xlim(c(0,1))
ps+stat
pActual <- ggplot(data=z, aes(x=QED, y=..density..)) +
geom_histogram(color="grey60",fill="cornsilk",size=0.2)
pActual + stat + xlim(c(0,1))
The simulated data (green) seems to be an OK model of the actual data (yellow), however it does not capture the left-skewness of the actual data very well. I think in reality, the distribution of QED values for chemical space is most likely heavily right-skewed, in that most chemicals are not drug-like at all (very low QED), and only a few are somewhat to mostly drug-like (medium to high QED). In which case, this simulation does not model the data very well.