This note is part of a series on Water solubility. You should read the first part before going further.
From the first exploration of the data, we found that it is going to be hard to have a good Water solubility prediction model.
A good review of the models for Water solubility is the paper by Dearden in 2006: "In silico prediction of aqueous solubility". Paraphrasing the explanation of equation (8) in the article, it shows that the most important contributions are the molecular size (not the weight but one is highly correlated to the other) and the hydrogen bond acceptor ability. The first is reducing the solubility and the later aids the solubility. This is because Water is a good Hydrogen bond donor. But Hydrogen bonding increases or decreases the solubility depending on the 3D structure of the compound. The bonding can either help the crystal cohesion or the Water bonding. This is the link bringing the melting point into the property based equations.
What it means is that you do not have really a linear model, because two descriptors present together can result in a molecule being less soluble. This is the aminoacid effect. This shows well in the comparison of the different descriptors for the less or more soluble molecules. A single descriptor is not providing a clear answer if a molecule belongs to the more soluble or less soluble set. In the best case, you have a different trend line like for maximum partial charge, but this is not sufficient to discriminate the molecules.
This is also why a good model from Abraham and Le incorporate a non linear term. This result is really important because the standard group contribution methods are usually linear and this shows that we cannot achieve good results with a linear model.
The end results of the different models are implemented in software and the review paper from Dearden is providing a nice table running predictions with different software against the dataset of Rytting. What is really nice is that the test set was published in 2005 and the review paper in 2006, so the probability of the software to have used the results of the test set to regress their parameters is relatively low. The test set is for 122 compounds. The results are sorted by the percentage of compounds having a predicted solubility error within 1.0 logS unit, so a bit larger than our target of $0.5\ logS$ unit. If you look at the $0.5\ logS$ unit error, the best is only at 72% and the average is the flip of a coin.
|Software||% within ± 0.5 log unit||± 1.0 log unit||$r^2$||s|
|Pharma Algorithms ADME Boxes||59.0||86.9||0.74||0.62|
|Cerius 2 ADME||37.7||72.9||0.61||1.02|
I suppose that s is the standard deviation, but I am not sure.
From the current implementations, we have now a good benchmark for the model if we want to develop one. Also, a good idea is to use the same test-set of molecules and be sure not to include them in the regression to check if our model can intrinsically be better than ones listed here.
From the current models, we know that we can use a linear model but it will not be the best. Maybe a neural network approach could be good or we need to carefully select the non linear parts of the model like the work by Abraham.
Update: To show the issue with a linear model, using my dataset of 3733 compounds which can be described with the Joback and Reid descriptors, I created a small model and it has a RMSE of 2.7. The model is useless but the parity plot (predicted vs. experimental logS) shows that it cannot express the more soluble/less soluble effect at a given molecular weight. What is really impressive is basically the clear cut, not a single good prediction and two clusters above and below the parity line. Not a single good prediction is most likely due to issues in the dataset, but if the data set is good and if you get better fit with a linear model it is most likely because you simply overfit with more parameters but do not describe the physics behind the solubility.
Update 2: It looks like my nice dataset is in fact composed of two independent datasets. This is really strange, because in each subset I have experimental data coming from two reliable sources which have close collaboration, that is, Syngenta and DTU. So I would expect DTU to have included in their databank the data from Syngenta. I am going to explore a bit more and check the difference, maybe a unit problem or so.
Effectively, the data in one dataset was in mg/l instead of mol/l. Not the best unit because the solubility is correlated to the volume of the molecule more than its mass, but this explains the results I had. Nice to see that the modelling allowed me to backward find error in the dataset.
You can read part 3, a relatively good water solubility model using connectivity indices.