library(tidyverse)
library(boot)
In this assignment, I’ll be simulating quantitative estimate of drug-likeness (QED) data, based on the same paper that was my data source for homework #6. The data set from this paper contains molecular information on 17,117 “diverse” compounds, as well as survey responses from AstraZeneca medicinal chemists regarding their intuition about said compounds. One question these chemists were asked was whether or not a given molecule was aesthetically “attractive”. The data from these responses can be used to determine whether or not there is a significant difference in QED (druglikeness) between those chemicals deemed attractive vs unattractive. If there is a significant difference in these two classes of chemicals, we would expect to see visually distinct distributions as well as significant statistics.
To address the aforementioned question with simulated data, a larger sample size is all the better; the ideal sample size would be the 17,117, as that is the actual size of the dataset we are trying to simulate. However, I’ll start with an arbitrary, smaller number to begin. According to the Bickerton paper, attractive chemicals have a mean QED of 0.67 with sd=0.16. Large, complex “unattractive” compounds have a mean QED of 0.34 with sd=0.24. These are the parameters I will be using; though they are not technically hypothetical as they are actually based on real data, I think I will still be able to complete this exercise, as I am still trying to simulate real data after all.
# Simulating the "attractive" dataset
simAttractive <- inv.logit(rnorm(n=3000,mean=.8,sd=.8))
# Making sure our mean and sd match the desired values
mean(simAttractive)
## [1] 0.6702512
sd(simAttractive)
## [1] 0.1592273
# Simulating the "unattractive" dataset
simUnattractive <- inv.logit(rnorm(n=3000,mean=-.88,sd=1.35))
# Making sure our mean and sd match the desired values
mean(simUnattractive)
## [1] 0.3478901
sd(simUnattractive)
## [1] 0.2379727
df <- data.frame(simAttractive,simUnattractive)
# df <- df %>% pivot_longer(cols=1:2,
# names_to="dataset",
# values_to="value")
# testing for a significant difference between the two distributions
t.test(df$simAttractive,df$simUnattractive, alternative='greater',var.equal=F)
##
## Welch Two Sample t-test
##
## data: df$simAttractive and df$simUnattractive
## t = 61.665, df = 5235.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.3137609 Inf
## sample estimates:
## mean of x mean of y
## 0.6702512 0.3478901
# Plotting the density curves of the simulated data
dAttractive <- density(df$simAttractive)
dUnattractive <- density(df$simUnattractive)
par(mfrow=c(1,1))
plot(dAttractive, lwd = 2, col = "red",
main = "Simulated Data", xlab = "QED",ylab="density")
lines(dUnattractive, col = "blue", lwd = 2)
legend(.225, 2.4, legend=c("Attractive", "Unattractive"),
col=c("red", "blue"),lty=1:1)
As indicated by the results of the t-test and the density plot above, these two distribution are significantly different.
# Trying 100 simulations with minimal difference between means to see if any simulations still result in a significant difference
hits <- 0
for (i in c(1:100)) {
simAttractive <- inv.logit(rnorm(n=3000,mean=-.76,sd=.8))
simUnattractive <- inv.logit(rnorm(n=3000,mean=-.88,sd=1.35))
t <- t.test(simAttractive,simUnattractive, alternative='greater',var.equal=F)
if (t$p.value<0.05){
hits <- hits+1
# these lines save the last sig dif simulated data sets to be plotted later
dAttractive <- density(simAttractive)
dUnattractive <- density(simUnattractive)
}
}
hits
## [1] 3
# Plotting again to help visualize
par(mfrow=c(1,1))
plot(dAttractive, lwd = 2, col = "red",
main = "Minimal Mean", ylab = "density",xlab="QED")
lines(dUnattractive, col = "blue", lwd = 2)
legend(.65, 2.25, legend=c("Attractive", "Unattractive"),
col=c("red", "blue"),lty=1:1)
hits
represents the number of simulated data sets that
resulted in a significant t-test result, out of 100 simulations. The
accompanying plot shows the most recent (for-loop wise) significantly
different simulated data set. It seems like the minimal difference
necessary for the two means of the distributions to still be
significantly different is 0.12 (given that their standard deviations
are unchanged).
# Trying 100 simulations to get an idea of the minimal sample size required to still get a significant difference in distribution means
hits <- 0
for (i in c(1:100)) {
simAttractive <- inv.logit(rnorm(n=2,mean=.8,sd=.8))
simUnattractive <- inv.logit(rnorm(n=2,mean=-.88,sd=1.35))
t <- t.test(simAttractive,simUnattractive, alternative='greater',var.equal=F)
if (t$p.value<0.05){
hits <- hits+1
}
}
hits
## [1] 22
Even with the minimal sample size of n=2, the simulated data sets are still significantly different roughly 25% of the time. Note: I figured it wasn’t worth making a plot for these data with such a small sample size.