Libraries

library(ggplot2)
library(RColorBrewer)
library(viridisLite)
library(paletteer)
library(ggpubr)
library(grid)

Read in Data

This data set contains 10 molecular properties from a random sample of 88,878 chemicals from the PubChem database. First column SMILES contains a unique smile string for each chemical. There is also a column with G-protein coupled receptor (GPCR) activity scores and one with qualitative estimates of drug-likeness (QED).

For ease of computing, I’m going to take a random sample of 10,000 rows from this data frame.

all_data <- read.csv("C:/Users/nbeck/Desktop/Spring 2022/CS 287/Project/CS287-Project/data/PubChemData.csv")

small_data <- all_data[sample(1:nrow(all_data),10000),]
str(small_data)

## 'data.frame':    10000 obs. of  13 variables:
##  $ SMILES  : chr  "[H]Cl.[H]Oc1c2c(c(O[H])c3c1C([H])([H])C(O[H])(C(=O)C([H])([H])C([H])([H])C([H])([H])C([H])([H])[H])C([H])([H])C"| __truncated__ "[H]O[C@@]12C([H])([H])C([H])([H])[C@]1(C([H])([H])[H])[C@@]1([H])C([H])([H])C(C([H])([H])[H])(C([H])([H])[H])C("| __truncated__ "[H]N([H])C([H])([H])C([H])([H])N1C([H])([H])C([H])([H])C([H])([H])C([H])([H])C1([H])[H]" "[H]c1c([H])c([H])c2c(sc3c(C([H])([H])[H])c4sc5c([H])c([H])c([H])c([H])c5n([H])c4nc3n2[H])c1[H]" ...
##  $ GPCR_act: num  0.868 0.315 0.524 0.736 0.422 ...
##  $ MR      : num  144.9 58.3 39.3 101.1 82.7 ...
##  $ ATOM    : int  40 15 9 23 20 15 10 24 24 23 ...
##  $ MW      : num  576 210 128 335 298 ...
##  $ ALOGP   : num  2.612 1.945 0.431 5.906 2.473 ...
##  $ HBA     : num  10 2 2 1 3 2 1 3 4 2 ...
##  $ HBD     : num  5 2 1 2 2 1 1 1 0 0 ...
##  $ PSA     : num  176.6 40.5 29.3 44.5 43.7 ...
##  $ ROTB    : num  6 0 2 0 0 1 4 4 6 6 ...
##  $ AROM    : num  3 0 0 5 1 3 0 2 1 3 ...
##  $ ALERTS  : num  1 0 0 1 0 0 1 1 2 0 ...
##  $ QED     : num  0.274 0.641 0.584 0.34 0.773 ...

How does molecular weight relate to GPCR activity scores?

thermal <- colorRampPalette(c("black","blue", "#007FFF", "cyan",
                                 "#7FFF7F", "yellow", "#FF7F00", "red", 
                                 "white"))
  
##################################################
# FUNCTION: MW_dplot -> Molecular weight Density Plot
# Makes nice density plots of the specified molecular property vs GPCR Activity
# input: df->source data frame; x(str)->col to use for x value; y(str)-> col to use for y values; palette(str)->name of palette to be used (must be in viridus palette); xlimit(int/double)->x limit for the graph; the rest of the input are strings specifying graph labels
# output: p -> prints the plot
#------------------------------------------------- 
MW_dplot <- function(df=small_data,
                           x='MW',
                           y='GPCR_act',
                           palette="magma",
                           xlimit=NULL,
                           xmin=NULL,
                           xlab='Molecular Weight',
                           ylab='GPCR Activity Score',
                           title=NULL,
                           subtitle=NULL,
                           cap=NULL,
                           legend='Density') {
  p <- ggplot(data=df,mapping=aes_string(x=x,y=y))+
  stat_density_2d(geom = "raster", aes(fill = stat(density)), contour = FALSE)+
  scale_fill_viridis_c(option=palette)+
  theme_classic()+
  labs(x=xlab,y=ylab,fill=legend)
  if(!is.null(xlimit)&!is.null(xmin)){
    p <- p+xlim(xmin,xlimit)
  } else if(!is.null(xlimit)){
    p <- p+xlim(0,xlimit)
  }
  if(!is.null(title)){
    p <- p+labs(title=title)
  }
  if(!is.null(subtitle)){
    p <- p+labs(subtitle=subtitle)
  }
  if(!is.null(cap)){
    p <- p+labs(caption=cap)
  }
 return(p)
}

mw_inferno <- MW_dplot(x='MW',palette = 'inferno',xlimit=700,ylab=NULL,title='Inferno')

mw_magma <- MW_dplot(x='ALOGP',palette = 'magma',xlimit=10,xmin=-5,xlab='LogP',ylab=NULL,title='Magma')

mw_plasma <- MW_dplot(x='MR',palette = 'plasma',xlimit=200,xlab='Molar Refractivity',ylab=NULL,title='Plasma')

mw_thermal <- MW_dplot(x='PSA',xlimit=200,xlab='Polar Surface Area',ylab=NULL,title='Thermal (custom)')+
  scale_fill_gradientn(colors=thermal(6))

figure <- ggarrange(mw_inferno,mw_magma,mw_plasma,mw_thermal,ncol=2,nrow=2)
figure <- annotate_figure(figure, left = textGrob("GPCR Activity", rot = 90, vjust = 1, gp = gpar(cex = 1.5)))
figure

# Ran the code once with the larger dataset, then saved it
# figure %>% ggexport(filename = "images/hw12_all_data_4feature_plot.png")

After many hours, that is a figure I can proudly display. The last line I commented out creates a pdf of the plot that would be ready for publication. That pdf is generated from the full dataset (n=88,878), but for this R markdown I’m running the smaller dataset (n=10,000) to save some computing time. It looks pretty, but all in all I don’t think this would be the most accurate visualization of these data. The intense color scales tend to obscure the sample size of the data and over-emphasize density; the density function and intense colors also mask outliers (from looking at a scatter plot, I know there were a few very heavy chemicals that don’t show up well in these plots). I’ve also been told that the density function uses a gaussian kernel, and although I don’t totally understand what that means, I know that we can’t assume these data are gaussian.

All in all though, I do find these plots satisfying. I’ll say that between all these weedy ggplot-based packages and Python’s matplotlib (which my lab tends to use for visualizations), making nice graphs, let alone figures, can easily become a headache. All the more reason this was a useful exercise.

Homework_12

Noah Beckage

4/13/2022

Libraries

Read in Data

How does molecular weight relate to GPCR activity scores?