1. Running the Sample Code

library(ggplot2) # for graphics
library(MASS) # for maximum likelihood estimation

# quick and dirty, a truncated normal distribution to work on the solution set

z <- rnorm(n=3000,mean=0.2)
z <- data.frame(1:3000,z)
names(z) <- list("ID","myVar")
z <- z[z$myVar>0,]
str(z)
summary(z$myVar)
View(z)

I first ran all of the probability density plots on the sample data above as a tutorial to get familiar with what each chunk does. Now, I’m importing data from Bickerton et al 2012 and using those data for this assignment. In short, I will be looking at a quantitative estimate of drug-likeness (QED) for a set of molecules.

2. Using QED Data

z <- read.table("Data/NIHMS50746-supplement-AZ_chemsurvey_QED.csv",header=TRUE,sep=",")
# Only want the QED values and the chemical ID
z <- z[,c('CPD_ID','QED')]
# Let's work with 3000 values instead of over 17,000
samp <- sample(z$CPD_ID,size=3000)
z <- z[samp,]
str(z)

## 'data.frame':    3000 obs. of  2 variables:
##  $ CPD_ID: int  12797 4339 11754 14474 10378 15469 3088 4334 11045 815 ...
##  $ QED   : num  0.729 0.618 0.075 0.58 0.185 0.445 0.549 0.704 0.686 0.793 ...

summary(z)

##      CPD_ID           QED        
##  Min.   :    5   Min.   :0.0080  
##  1st Qu.: 4374   1st Qu.:0.4027  
##  Median : 8776   Median :0.5850  
##  Mean   : 8681   Mean   :0.5478  
##  3rd Qu.:13037   3rd Qu.:0.7270  
##  Max.   :17115   Max.   :0.9460

#View(z) # Now z is 3000s rows containing chemical ID and corresponding QED values

Histogram

Let’s see what the distribution of this sample of the data looks like. Note that every time this rmarkdown chunk is knitted, all of the following plots will look slightly different, because we will be taking a unique random sample of 3000 from the larger data set of >17,000 every time.

p1 <- ggplot(data=z, aes(x=QED, y=..density..)) +
  geom_histogram(color="grey60",fill="cornsilk",size=0.2) 
print(p1)

Empirical Density Curve

This is basically a density plot overlaid on the histogram; remember, a desnity plot is best thought of as a smoothed-out histogram.

p1 <-  p1 +  geom_density(linetype="dotted",size=0.75)
print(p1)

Max Likelihood Parameters

Maximum likelihood parameters (MLPs), for a given distribution, are the parameters that are most likely to give us a distribution that fits our data. In other words, assuming that the distribution of these data is normal, what parameters of a normal distribution (mean and standard deviation) will result in a curve that best represents our data. In yet more words, if we were to sample all the possible normal distributions, the distribution with these parameters is most likely to give us a distribution that matches our actual data. Gosh I hope that makes sense.

Also, we are using the function fitdistr to calculate MLPs

normPars <- fitdistr(z$QED,"normal")
print(normPars)

##       mean           sd     
##   0.547780667   0.229197577 
##  (0.004184556) (0.002958928)

str(normPars)

## List of 5
##  $ estimate: Named num [1:2] 0.548 0.229
##   ..- attr(*, "names")= chr [1:2] "mean" "sd"
##  $ sd      : Named num [1:2] 0.00418 0.00296
##   ..- attr(*, "names")= chr [1:2] "mean" "sd"
##  $ vcov    : num [1:2, 1:2] 1.75e-05 0.00 0.00 8.76e-06
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "mean" "sd"
##   .. ..$ : chr [1:2] "mean" "sd"
##  $ n       : int 3000
##  $ loglik  : num 163
##  - attr(*, "class")= chr "fitdistr"

normPars$estimate["mean"] # note structure of getting a named attribute

##      mean 
## 0.5477807

# normPars is a list

Normal Probability Density

Now to visualize my wordy explanation from above: this red curve is the normal distribution that best fits these data. The parameters of this red curve are the maximum likelihood parameters.

meanML <- normPars$estimate["mean"]
sdML <- normPars$estimate["sd"]

xval <- seq(0,max(z$QED),len=length(z$QED))

 stat <- stat_function(aes(x = xval, y = ..y..), fun = dnorm, colour="red", n = length(z$QED), args = list(mean = meanML, sd = sdML))
 p1 + stat

Exponential Probability Density

Now, rinse and repeat with an exponential distribution

expoPars <- fitdistr(z$QED,"exponential")
rateML <- expoPars$estimate["rate"]

stat2 <- stat_function(aes(x = xval, y = ..y..), fun = dexp, colour="blue", n = length(z$QED), args = list(rate=rateML))
 p1 + stat + stat2

Uniform Probability Density

Now with a uniform distribution. There is no need to use fitdistr because the maximum likelihood parameters are just the minimum and the maximum of the data:

stat3 <- stat_function(aes(x = xval, y = ..y..), fun = dunif, colour="darkgreen", n = length(z$QED), args = list(min=min(z$QED), max=max(z$QED)))
 p1 + stat + stat2 + stat3

Gamma Probability Density

So on and so forth

gammaPars <- fitdistr(z$QED,"gamma")
shapeML <- gammaPars$estimate["shape"]
rateML <- gammaPars$estimate["rate"]

stat4 <- stat_function(aes(x = xval, y = ..y..), fun = dgamma, colour="brown", n = length(z$QED), args = list(shape=shapeML, rate=rateML))
 p1 + stat + stat2 + stat3 + stat4

Beta Probability Density

Since these data is already between 0 and 1, I don’t need to rescale and can therefore plot the beta probability density curve on the same plot as before:

betaPars <- fitdistr(x=z$QED,start=list(shape1=1,shape2=2),"beta")

shape1ML <- betaPars$estimate["shape1"]
shape2ML <- betaPars$estimate["shape2"]

stat5 <- stat_function(aes(x = xval, y = ..y..), fun = dbeta, colour="orchid", n = length(z$QED), args = list(shape1=shape1ML,shape2=shape2ML))
p1 + stat + stat2 + stat3 + stat4 + stat5

3. What’s the best fitting distribution?

It seems the normal distribution fits these data best.

4. Simulating New Data

# getting our MLPs
normPars <- fitdistr(z$QED,"normal")
print(normPars)

##       mean           sd     
##   0.547780667   0.229197577 
##  (0.004184556) (0.002958928)

# extracting the ML mean and ML std
meanML <- normPars$estimate["mean"]
sdML <- normPars$estimate["sd"]
s <- rnorm(n=3000,mean=meanML,sd=sdML)
s <- data.frame(1:3000,s)
names(s) <- c('ID','QED')

ps <- ggplot(data=s, aes(x=QED, y=..density..)) +
  geom_histogram(color="grey60",fill="darkolivegreen1",size=0.2)+
  xlim(c(0,1))

ps+stat

pActual <- ggplot(data=z, aes(x=QED, y=..density..)) +
  geom_histogram(color="grey60",fill="cornsilk",size=0.2) 
pActual + stat + xlim(c(0,1))

The simulated data (green) seems to be an OK model of the actual data (yellow), however it does not capture the left-skewness of the actual data very well. I think in reality, the distribution of QED values for chemical space is most likely heavily right-skewed, in that most chemicals are not drug-like at all (very low QED), and only a few are somewhat to mostly drug-like (medium to high QED). In which case, this simulation does not model the data very well.

Homework_06

Noah Beckage

2/23/2022