Finally, the exploration of the data lead us to discover that our first dataset was inconsistent. After cleaning the dataset and using 1305 datapoints as training set and 74 as validation set, the main results are:

**Adjusted $R^2$**: 0.906**MSE on training set**: 0.444**MSE on test set**: 0.468**Number of parameters**: 18

The methodology to find the best 18 parameters was a combination of guided search with a genetical algorithm to select the significant parameters and exclude the non significant. The best model was kept based on the MSE on the test set.

This is pretty good, in fact we do as good as the Marrero-Gani method which is using 370 parameters, so more than 20 times the number of parameters we are using.

What is important is that our MSE about 0.5 logS unit, both on the training set and the test set. This is at the same level as the experimental error.

Here are the statistics of the model:

```
OLS Regression Results
==============================================================================
Dep. Variable: log10SW R-squared: 0.908
Model: OLS Adj. R-squared: 0.906
Method: Least Squares F-statistic: 702.9
Date: Tue, 07 Jun 2016 Prob (F-statistic): 0.00
Time: 15:38:23 Log-Likelihood: -1312.6
No. Observations: 1305 AIC: 2663.
Df Residuals: 1286 BIC: 2762.
Df Model: 18
=======================================================================================
coef std err t P>|t| [95.0% Conf. Int.]
---------------------------------------------------------------------------------------
const -0.4329 0.074 -5.822 0.000 -0.579 -0.287
Chi1 -0.6769 0.057 -11.833 0.000 -0.789 -0.565
Chi2n -0.2260 0.057 -3.959 0.000 -0.338 -0.114
Chi4n 0.3246 0.048 6.752 0.000 0.230 0.419
EState_VSA10 0.0153 0.006 2.721 0.007 0.004 0.026
EState_VSA2 0.0142 0.003 4.707 0.000 0.008 0.020
EState_VSA3 0.0062 0.003 1.929 0.054 -0.000 0.012
EState_VSA5 -0.0137 0.002 -7.482 0.000 -0.017 -0.010
MinAbsPartialCharge -1.3721 0.267 -5.138 0.000 -1.896 -0.848
MinPartialCharge -3.4735 0.181 -19.209 0.000 -3.828 -3.119
NOCount 0.0676 0.031 2.158 0.031 0.006 0.129
NumRotatableBonds -0.0415 0.015 -2.700 0.007 -0.072 -0.011
SMR_VSA7 -0.0074 0.004 -1.752 0.080 -0.016 0.001
SlogP_VSA1 0.0361 0.008 4.590 0.000 0.021 0.051
SlogP_VSA12 -0.0279 0.002 -14.202 0.000 -0.032 -0.024
SlogP_VSA2 0.0551 0.003 17.793 0.000 0.049 0.061
SlogP_VSA3 0.0379 0.005 7.566 0.000 0.028 0.048
SlogP_VSA6 0.0190 0.004 4.952 0.000 0.011 0.026
SlogP_VSA8 -0.0316 0.004 -7.348 0.000 -0.040 -0.023
==============================================================================
Omnibus: 60.595 Durbin-Watson: 1.679
Prob(Omnibus): 0.000 Jarque-Bera (JB): 190.451
Skew: 0.087 Prob(JB): 4.41e-42
Kurtosis: 4.863 Cond. No. 715.
==============================================================================
```

The interesting points for me is that we have both the absolute and non absolute minimum partial charge as parameter and they are both significant and not correlated (the condition number would be higher if we had colinearity there). The coefficients are not spanning a very large range, this makes the model relatively well balanced.

And the parity plot, in green the training set and in blue the testset. The test set is independent of the the training set.

The next question is if we can test it against the dataset in mg/l I excluded to generate this model. I helped a bit the model by regressing it including the previous test data. This results in a training set of 1267 compounds (I improved the duplicate detection). The test set is composed of 1755 compounds. This means it can include the training set but with at least 500 compounds are not in the training set.

The fraction of compounds predicted within a 0.5 logS unit error is 51% and the fraction within 1.0 logS unit error is 79%. A good thing is that the MSE on the training set is 0.44, that is, the same as before, but 0.82 on the test set, which is not that good. But one needs to be realistic, nobody is leaving 1/3 of the measurements on the table to generate a model.

In the case of the new test dataset, a lot of new compounds have a high solubility where most of the training dataset has compounds with low solubility. So, we are operating the model partially on a different space than the space used to perform the regression. This is always a bad thing.

**To conclude**: it is possible to generate a new Water solubility
model based on easy to calculate molecular descriptors. Some of the
issues in the current development is the use of a fixed test set to
select the parameters. A better approach would be to sample many test
sets for each set of parameters and ensure that the resulting variance
on the parameters is low and the test set MSE is not changing. But
selecting the right test sample is hard, this is why I relied on a
known test set for this model.

Next step is including this model into the prediction tool of Cheméo and providing the datasets I used for you to create your own model or validate this model.

**Update**: I was not really happy with the results for the
experimental logS above 0. That is, for the compounds soluble at more than 1
mol/l. I tried to generate a model just for the compounds with a logS
above -1.0. The results are pretty bad with 28 compounds in the test
set and 162 in the training set. I am not sure if this comes from the
data, for example, people missing the minus sign when writing down the
data, or from something else. This is something to explore.

**Update 2**: The new model is available, this is not the
exact same described here. I will upload the datasets used to
train/test and final train together with the model parameters.