Modelling Water Solubility - Part 4

The final Water solubility model.

This is the last part of the series on modelling Water solubility. You should read the other parts to get a sense of the process used to develop the model or simply predict the Water solubility with the online tool.

Finally I spent a bit more time analysing the descriptors, their significance and tried to reduce the number of parameters in the model as much as possible to reduce overfitting. The final model, with only 16 parameters and the constant is regressed on a bit more than 1500 observations. The final $R^2$ is maybe not that wonderful with 0.83, but the significance of all the parameters is very good and it showed a very good MSE of 0.47 on the testing set. Note that the regression here is with the testing and training sets combined.

                            OLS Regression Results
==============================================================================
Dep. Variable:                log10SW   R-squared:                       0.838
Model:                            OLS   Adj. R-squared:                  0.837
Method:                 Least Squares   F-statistic:                     495.6
Date:                Mon, 20 Jun 2016   Prob (F-statistic):               0.00
Time:                        20:58:41   Log-Likelihood:                -1912.8
No. Observations:                1547   AIC:                             3860.
Df Residuals:                    1530   BIC:                             3951.
Df Model:                          16
Covariance Type:            nonrobust
=======================================================================================
                          coef    std err          t      P>|t|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------------
const                  -0.9119      0.079    -11.579      0.000        -1.066    -0.757
Chi1v                  -0.2530      0.038     -6.736      0.000        -0.327    -0.179
EState_VSA5            -0.0109      0.002     -6.091      0.000        -0.014    -0.007
Kappa1                 -0.2296      0.021    -10.762      0.000        -0.271    -0.188
MaxAbsPartialCharge     2.9939      0.233     12.876      0.000         2.538     3.450
MaxEStateIndex          0.0750      0.012      6.497      0.000         0.052     0.098
MinAbsPartialCharge    -2.5713      0.303     -8.496      0.000        -3.165    -1.978
NHOHCount               0.1119      0.021      5.360      0.000         0.071     0.153
PEOE_VSA8               0.0116      0.003      4.047      0.000         0.006     0.017
SlogP_VSA2              0.0505      0.002     21.350      0.000         0.046     0.055
SlogP_VSA3              0.0550      0.005     11.817      0.000         0.046     0.064
SlogP_VSA6              0.0088      0.002      3.928      0.000         0.004     0.013
SlogP_VSA8             -0.0084      0.005     -1.750      0.080        -0.018     0.001
VSA_EState10           -0.0443      0.004    -11.307      0.000        -0.052    -0.037
VSA_EState9            -0.0090      0.003     -2.976      0.003        -0.015    -0.003
fr_benzene             -0.5469      0.053    -10.253      0.000        -0.652    -0.442
fr_bicyclic            -0.0815      0.025     -3.222      0.001        -0.131    -0.032
==============================================================================
Omnibus:                       98.867   Durbin-Watson:                   1.649
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              320.577
Skew:                           0.259   Prob(JB):                     2.44e-70
Kurtosis:                       5.169   Cond. No.                         747.
==============================================================================

The datasets are available for download:

The sources of the data are:

  • Jarmo Huuskonen, J. Chem. Inf. Comput. Sci., 2000, 40, 773-777 (entries marked as train.smi, test1.smi and test2.smi).
  • J. Chem. Inf. Comput. Sci., 2004, 44 (3), pp 1000–1005, DOI: 10.1021/ci034243x (entries marked as ci034243xsi20040112_053635.txt).
  • Rytting et al. Aqueous and cosolvent solubility data for drug-like organic compounds (entries marked as rytting2005.smi).
Physical and Chemical Property Prediction, Experimental Properties & Databases
Back to Top