Big Data, Small Data or Bad Data?

Evaluate how good are your data before you start using them.

We have an explosive use of machine learning algorithms. On technical news aggregators, where people share article, software development and methods, you can see nearly everyday an article on the subject poping up in the list. The developments are impressive but as someone said, very well:

Good CS expert says: Most firms that think they want advanced AI/ML really just need linear regression on cleaned-up data.

As long as the scope of application is well defined, you can with linear regression build models which can nearly predict everything. But the real issue is that you need to work on "on cleaned-up data".

If you speak with the people from NIST, they will tell you that 10 to 25% of the data in the chemical databanks are wrong. So, if you lookup information about the Methanol, you will most likely have wrong data. In fact, a quick analysis of the data already show that 35.30 ± 0.04 kJ/mol for the standard enthalpy of vaporization is most likely wrong as it is more two standard deviations away from the property mean.

So, if you have enough data, you have statistical tools to clean your data, just take the time to do it. The simplest tool being the deviation analysis already presented. In Python with numpy, you can simply do something like:

import numpy as np
raw = np.asarray([37.83, 37.80, 38.00, ..., 35.30])
mean = np.mean(raw)
std2 = np.std(raw) * 2.0
np.any((raw > (mean+std2), raw < (mean-std2)), axis=0)

You get a nice array telling you which point is an outlier:

array([False, False, False, False, False, False, False, False, False,
       False, False,  True], dtype=bool)

Sadly, sometimes you do not have enough redundancy in your data to run such simple analysis. For example, you have only a single water solubility data point per compound. What you can do is that you use your domain knowledge and directly model and fit your data.

This was done for a Water solubility model. Using my knowledge, I built a model and it resulted in a clear inconsistency at the end and this allowed pinpointing the errors in the data (inconsistency in units, like the one crashing the Mars Climate Orbiter, so easy to make embarrassing mistakes with units even today).

In the Water solubility example, the error is easy to spot, but most of the time, the errors are not easy to spot. The sources of errors can be:

  • unit mismatch on a very limited number of points;
  • a minus sign lost in conversion;
  • error at the experimental level;
  • an error in transfering the data from the experiment to the lab book;
  • error in converting the data from whatever format to the next one;
  • error in collecting data from unstructured sources.

For Cheméo, I collected GB of data, both structured and unstructured. What is really funny is that even the officially structured data can be totally messy and must often be considered as semi-structured.

At the end of the parsing of the data, I have a set of records and the big question is? How good are my records? Of course I used unit tests, but unit testing a lot of cases is not the same as running the parsers on millions of HTML pages, mol files and crappy Excel files.

First rule: you need to accept that your data will contain errors.

But because you did your job the right way, with unit tests and a robust parser, each time you look at a record and compare with the source, you do not find any particular error. You are doing 0 defect sampling.

Suppose you want to be 95% sure that your parser is working for at least 99.9% of the cases or a maximum 0.1% of defects, how many random tests of the parsed documents should you run?

You simply use the rule of 3, which states that $p = 3/n$ with $p$ the probability of defect (0.001) and $n$ the number of tests to run, which leads to 3000 tests to run.

You can quickly run the calculations for your level of success:

[round(3/(1-x)) for x in [0.90, 0.95, 0.99, 0.995, 0.999]]

which gives:

[30.0, 60.0, 300.0, 600.0, 3000.0]

So, most likely, a sampling of about 100 records is probably a good idea. You can also sample by source, that is, you sample in the set or records coming from a given source, like the crawling of a website or big ugly Excel file.

Of course here, I used the example of a chemical database, luckily, this critical approach applies to all your data.

For example, National Geographic had a very nice presentation of commutes data, and at the end, the journalist was thinking that maybe the map could be improved:

The map is clearly a work in progress, and some areas still don’t look right—the division of the New York City tri-state area into two megaregions, for example. And I’m not convinced the Bay Area-Sacramento megaregion where I live should extend all the way to Nevada.

But one could also ask the question, do we have errors in the sensus data which are artificially linking points together? This can simply come from a geocoding routine making errors for some area around New York or Reno.

Easy Conclusion

Have a critical look at the data you generate, at the scale of your project, it will cost you nearly nothing in time, but it can save your day or your spacecraft.

Header cut from a photo by Xession, CC BY 3.0.

Physical and Chemical Property Prediction, Experimental Properties & Databases
Back to Top