Modelling Water Solubility - Part 1

Finding the right descriptors for a property prediction model.

To test some models for Cheméo Studio, I need a relatively good Water solubility property prediction model. Water solubility belongs to the relatively hard to predict and measure property. Some good researchers get a $r^2$ of only 0.79 for a Water solubility model.

The Water solubility is usually given as $logS = log_{10}(S)$ with $S$ the solubility of the compound in Water in $mol/l$. An acceptable error in the prediction is $0.5\ logS$, the error in the underlying data for the model.

The first step in building a model is collecting data. In my case, I have a nice dataset of about 4000 compounds with some duplicates measurements. This is a nice dataset because it was already used to build some water solubility models. Another bulk dataset for 35000+ compounds is also available but it has been shown that it was not really possible to do anything with the data, so, for the moment, it is not used.

Using a small tool, I can generate 200+ descriptors for the molecules and take a look at them (as shown partially on the picture above). For example the correlation between the molecular weight and the Water solubility:

What is really interesting in the correlation between the molecular weight and the Water solubility is that you have clearly molecules above and below. That is, at a given molecular weight, you can group the compounds in two groups, the more soluble and the less soluble. So, we can cluster the molecules in two groups with the more soluble for: $logS > -0.0125 \times mw + 2.5$ and the others are the less soluble.

By clustering the compounds, we can analyse what are the characteristics unique to each cluster and this can help develop the right model.

Here is a selection from the less soluble compounds with a molecular weight between 150 and 200:

and more soluble:

Polarity is definitely playing a big role here. But we do not have a clear cut. What is sure is that we need to have the OH and =O fragments, but we need also to have the amines too because we know that amino-acids are insoluble.

The real question is simply to know why a compound should be more soluble or less than the average of the compounds at the same molecular weight. Just from this binary information, more or less soluble, one get a good trend on the solubility as shown on this cluster comparison (ok we are still far away from a $0.5\ logS$ error).

This first dive into the data shows that it will not be easy, but at least we gained an empirical feeling for the property and the available data. This is important before taking a closer look at what has been done in the field.

Read the second part of the series including a literature review of the predictive models for Water solubility.

Physical and Chemical Property Prediction, Experimental Properties & Databases