library(ggplot2)
library(RColorBrewer)
library(viridisLite)
library(paletteer)
library(ggpubr)
library(grid)
This data set contains 10 molecular properties from a random sample
of 88,878 chemicals from the PubChem database. First column
SMILES
contains a unique
smile
string for each chemical. There is also a column with G-protein
coupled receptor (GPCR) activity scores and one with qualitative
estimates of drug-likeness (QED).
For ease of computing, I’m going to take a random sample of 10,000 rows from this data frame.
all_data <- read.csv("C:/Users/nbeck/Desktop/Spring 2022/CS 287/Project/CS287-Project/data/PubChemData.csv")
small_data <- all_data[sample(1:nrow(all_data),10000),]
str(small_data)
## 'data.frame': 10000 obs. of 13 variables:
## $ SMILES : chr "[H]Cl.[H]Oc1c2c(c(O[H])c3c1C([H])([H])C(O[H])(C(=O)C([H])([H])C([H])([H])C([H])([H])C([H])([H])[H])C([H])([H])C"| __truncated__ "[H]O[C@@]12C([H])([H])C([H])([H])[C@]1(C([H])([H])[H])[C@@]1([H])C([H])([H])C(C([H])([H])[H])(C([H])([H])[H])C("| __truncated__ "[H]N([H])C([H])([H])C([H])([H])N1C([H])([H])C([H])([H])C([H])([H])C([H])([H])C1([H])[H]" "[H]c1c([H])c([H])c2c(sc3c(C([H])([H])[H])c4sc5c([H])c([H])c([H])c([H])c5n([H])c4nc3n2[H])c1[H]" ...
## $ GPCR_act: num 0.868 0.315 0.524 0.736 0.422 ...
## $ MR : num 144.9 58.3 39.3 101.1 82.7 ...
## $ ATOM : int 40 15 9 23 20 15 10 24 24 23 ...
## $ MW : num 576 210 128 335 298 ...
## $ ALOGP : num 2.612 1.945 0.431 5.906 2.473 ...
## $ HBA : num 10 2 2 1 3 2 1 3 4 2 ...
## $ HBD : num 5 2 1 2 2 1 1 1 0 0 ...
## $ PSA : num 176.6 40.5 29.3 44.5 43.7 ...
## $ ROTB : num 6 0 2 0 0 1 4 4 6 6 ...
## $ AROM : num 3 0 0 5 1 3 0 2 1 3 ...
## $ ALERTS : num 1 0 0 1 0 0 1 1 2 0 ...
## $ QED : num 0.274 0.641 0.584 0.34 0.773 ...
thermal <- colorRampPalette(c("black","blue", "#007FFF", "cyan",
"#7FFF7F", "yellow", "#FF7F00", "red",
"white"))
##################################################
# FUNCTION: MW_dplot -> Molecular weight Density Plot
# Makes nice density plots of the specified molecular property vs GPCR Activity
# input: df->source data frame; x(str)->col to use for x value; y(str)-> col to use for y values; palette(str)->name of palette to be used (must be in viridus palette); xlimit(int/double)->x limit for the graph; the rest of the input are strings specifying graph labels
# output: p -> prints the plot
#-------------------------------------------------
MW_dplot <- function(df=small_data,
x='MW',
y='GPCR_act',
palette="magma",
xlimit=NULL,
xmin=NULL,
xlab='Molecular Weight',
ylab='GPCR Activity Score',
title=NULL,
subtitle=NULL,
cap=NULL,
legend='Density') {
p <- ggplot(data=df,mapping=aes_string(x=x,y=y))+
stat_density_2d(geom = "raster", aes(fill = stat(density)), contour = FALSE)+
scale_fill_viridis_c(option=palette)+
theme_classic()+
labs(x=xlab,y=ylab,fill=legend)
if(!is.null(xlimit)&!is.null(xmin)){
p <- p+xlim(xmin,xlimit)
} else if(!is.null(xlimit)){
p <- p+xlim(0,xlimit)
}
if(!is.null(title)){
p <- p+labs(title=title)
}
if(!is.null(subtitle)){
p <- p+labs(subtitle=subtitle)
}
if(!is.null(cap)){
p <- p+labs(caption=cap)
}
return(p)
}
mw_inferno <- MW_dplot(x='MW',palette = 'inferno',xlimit=700,ylab=NULL,title='Inferno')
mw_magma <- MW_dplot(x='ALOGP',palette = 'magma',xlimit=10,xmin=-5,xlab='LogP',ylab=NULL,title='Magma')
mw_plasma <- MW_dplot(x='MR',palette = 'plasma',xlimit=200,xlab='Molar Refractivity',ylab=NULL,title='Plasma')
mw_thermal <- MW_dplot(x='PSA',xlimit=200,xlab='Polar Surface Area',ylab=NULL,title='Thermal (custom)')+
scale_fill_gradientn(colors=thermal(6))
figure <- ggarrange(mw_inferno,mw_magma,mw_plasma,mw_thermal,ncol=2,nrow=2)
figure <- annotate_figure(figure, left = textGrob("GPCR Activity", rot = 90, vjust = 1, gp = gpar(cex = 1.5)))
figure
# Ran the code once with the larger dataset, then saved it
# figure %>% ggexport(filename = "images/hw12_all_data_4feature_plot.png")
After many hours, that is a figure I can proudly display. The last line I commented out creates a pdf of the plot that would be ready for publication. That pdf is generated from the full dataset (n=88,878), but for this R markdown I’m running the smaller dataset (n=10,000) to save some computing time. It looks pretty, but all in all I don’t think this would be the most accurate visualization of these data. The intense color scales tend to obscure the sample size of the data and over-emphasize density; the density function and intense colors also mask outliers (from looking at a scatter plot, I know there were a few very heavy chemicals that don’t show up well in these plots). I’ve also been told that the density function uses a gaussian kernel, and although I don’t totally understand what that means, I know that we can’t assume these data are gaussian.
All in all though, I do find these plots satisfying. I’ll say that between all these weedy ggplot-based packages and Python’s matplotlib (which my lab tends to use for visualizations), making nice graphs, let alone figures, can easily become a headache. All the more reason this was a useful exercise.