High Quality Chemical Properties

Technology Behind Cheméo


Introduction

To run Cheméo, you basically need the same kind of approach you would have to run a search engine. Cheméo has been built from the end user point of view with the scientific correctness beeing the only non negotiable value.

The tasks performed by Cheméo are:

How this is done? It is very simple, find the best tool to perform each task and use it. A quick summary would be:

But the details are even more interesting.

Data Retrieval

Data retrieval is performed by a series of scripts written mainly in Python. The two main components are the mechanize and the pyparsley module.

The data are retrieved and stored on disk, the parsing is kept minimal only to feed again the crawler with new URLs. The real parsing of the data is performed independently. At the moment, the raw volume of data is around 10GB, this is not a lot and it saves me the need to setup a smart way to store the data.

A time saver was the pyparsley module, I can for example extract some links having some special comments with a command like:

linkparser = PyParsley({
  "links(a)": [{
            "title": "contains(., 'Individual data points')",
            "url": "@href",
              }]
   })
out = linkparser.parse(string=content)

The only problem with pyparsley is that it is leaking memory. This forces the restart of the agent based on memory consumption. So, it is great for a proof of concept, but on the long run I may switch to hand tuned regular expressions as the complexity of the pages is low.

It can take days or weeks to retrieve the data as you want to play nice with the server in front of you. You need to keep an history of your crawling to restart after a crash or a code update. Bonus, put your email in your user agent string, the admins of the servers will be able to contact you if you do something bad for them (even if you tried to be nice).

Data Integration And Merging

Again, Python is the king for that game, especially because of the insane number of bindings in the scientific field.

If the data is a structured database, the data is extracted using the right tool (SQL, SD file reader, whatever) and if it is a webpage, the regular expression module is used to clean the file and then feed it to pyparsley (memory leaks included...). At the end, for each chemical component a Python dictionary of data is created.

The merging is made with Open Babel. Open Babel is a chemical toolbox designed to speak the many languages of chemical data. You have of course nice Python bindings and this used to normalize the chemical representation of each molecule (SMILES, InChI) and perform a search in the database to find a matching component.

The merging uses as much as possible the chemical structure of the components. This is the best way to avoid typo spread, with typos from one database spreading into another one.

Side note: MongoDB makes it very easy to create a server side function with access to the documents in the database. When a component is not found in the database, this is used to create a new component with a corresponding id. Here is the piece of code generating the id:

db = pymongo.Connection().yourdatabase
db.bk.insert({"_id":"last_cid", "last_id":10000})
db.eval('db.system.js.save( { _id : "next_cid", value: function() {  
     var incdoc = db.bk.findAndModify({query:{_id: "last_cid"}, update:{$inc: {last_id:1}}}); 
     var elts = incdoc.last_id.toString().split("");  
     var sum = 0;  var i = 0;  var check = 0;  
     for (i=0; i<elts.length; i++) {      
         sum = sum + parseInt(elts[i]); 
     }  
     check = 10 - (sum % 10); 
     if (check == 10) { check = 0; } 
     return elts.slice(0, elts.length-3).join("")
           +"-"+elts.slice(elts.length-3).join("")+"-"+check.toString();
}});')

This is a naïve implementation of the POSTNET checkdigit algorithm added to a counter. The real good point is that getting a new id is atomic as it locks the db. This is not a performance problem as the creation of a new id is not that often performed. To get a new id, just call:

db = pymongo.Connection().yourdatabase
compid = db.eval("return next_cid()")

Indexing For High Speed Search

This is simple:

MongoDB is a very actively developed but also very young database system. It has not the years of fine tuning of PostgreSQL behind it and it is a document store. This means that the indexing constraints are not the same, as it is indexing complex documents and not simple tables. If you push all your data in bulk in MongoDB, add a couple of indexes and expect it to work nicely, you may be very disappointed. Let me illustrate that with the Cheméo dataset.

When you search Cheméo, you can look for a given component (easy full text search index) or search for a range a components. For example, components with a boiling point below 400K, a critical temperatur above 500K and a standard enthalpy of formation between 50 and 500kJ/mol. At the moment, it finds 32 components out of 100,000 components in 20 ms.

The thing is, at the moment, for each component, the system is storing 70 properties, but each property can in fact be a set of values. If you look at the Benzene in the results, the enthalpy of formation is between 79.9 and 82.98 kJ/mol. It is clear that you cannot index that without a bit of thinking. Especially because a lot of components have nearly no data but some others have a lot. So, if you think in terms of matrices, this a very sparse 3 dimensional matrice of data.

The solution was to build for each component document a special index key with this format:

{i: [
  {k: 'myprop',
   n: 10.1, // Min value
   x: 123.0}, // Max value
  {k: 'otherprop',
   n: -1234.1,
   x: 254.0},
]}

Then you need to think about the query. It will always need to know about the key and then min or max or both. So we need two indexes:

This means that now, when looking for a component with the $all and $elemMatch operators, you will always hit the indexes, yeah! But then, a guy will do a search which will translate to something like that:

{ i: { $all: { [{ $elemMatch: { k: "mw", n: { $lte: 400.0 } } }, 
                { $elemMatch: { k: "tc", x: { $gte: 500.0 } } }, 
                { $elemMatch: { k: "hf", n: { $lte: 500.0 }, x: { $gte: 50.0 } } }
] } } } 

And your server will fall, because mw is the molecular weight and Mongo will take the first hit in the $all query and then do a standard scan for the other properties without using the index. In that case, even if we have only 50 components matching the hf value, if mw provides 50,000 components, Mongo will scan 50,000 components. Oups, the wrong part of the index is used. You need to know your data to order your query the best way to correctly hit your index. This, I still need to implement it with Cheméo.

As said, the web interface is running PHP with Pluf an extremely optimized framework very Django like.

Why PHP? As you have understood, one language or another is not really my problem, my question is more, can it match my requirements? In that case, PHP has the fastest library to access MongoDB and as the bulk of the work is done by MongoDB, I tried to have the thinnest layer between the users and the database. The main search view is composed of 50 lines of code without the template, simple and clean.

Source Tracking For Reference

This one is easy, for each property stored, which is a small dict, I have a "source" and "id" keys to keep track of the source. I also have a collection with for each source and id the corresponding unique id in the component collection and the last update time. This is where the document store format shines, it is really easy to keep track a lot of attributes.

Data Analysis For QSPR and QSAR Models

Learn R, really. R is the tool for data analysis, it will change your life if you manage large sets of data. And again, you have a simple and efficient access to R from Python with rpy2. So, when you request a regression, the following is done:

  1. the webapplication in PHP defines the job and stores it in Mongo;
  2. the webapplication pings Node.js running as job coordinator;
  3. Node.js starts a Python script to perform the calculations with R;
  4. the Python script stores the results back into Mongo;
  5. the visitor will know that the job completed through Ajax pooling.

One could think, why all these components when everything in Python would allow to have only one language and skip all these messages? Simple, you cannot get everything into a monolithic system.

The Python/R bindings are not threads safe. So, this basically excludes inclusion within a Python webserver are they are all threads based. Second, R is loading your complete dataset in memory to operate on it (it is possible to use PostgreSQL with numpy to not do that, but it is not worse the trouble in our case). This can eat up to 200MB of ram for one run. Imagine that one thread in your webserver is claming 200MB, then back to normal and that again and again. You are good at the end for a nice memory fragmentation at the Python VM level (if you are lucky and if the bindings are not leaking).

So, this means that R must be run in batch mode. But, batch mode does not mean, waiting too long either. So, the Node.js controller is used to directly launch the job via a simple in memory queue system. The workflow of the controller is simple. You GET the url of a job, it pushes it in the queue which is a simple array and run the job with a spawn call. The called script is doing all the work, Node.js is just pushing the work.

A simple GET call which do not care about the response body in PHP, can be triggered this way:

function triggerJob($server, $job, $params)
{
    $pctx = array('http' => array(
                  'method' => 'GET',
                  'user_agent' => 'Mazout (http://www.chemeo.com)',
                  'max_redirects' => 0, 
                  'timeout' => 1, // better pushing in the queue
                                  // than making a page too slow.
                                    )
                    );
    $url = $server.'/'.$job.'/'
        .implode('/', $params).'/';
    $ctx = stream_context_create($pctx);
    $fp = @fopen($url, 'rb', false, $ctx);
    if (!$fp) {
        return false; // This can be a 401 error.
    }
    $meta = stream_get_meta_data($fp);
    @fclose($fp);
    if (!isset($meta['wrapper_data'][0]) or $meta['timed_out']) {
        return false;
    }
    if (0 === strpos($meta['wrapper_data'][0], 'HTTP/1.1 2') or 
        0 === strpos($meta['wrapper_data'][0], 'HTTP/1.1 3')) {
        return true;
    }
    return false;
}

What I do, is that I let it fail fast (maximum 1 second for an AJAX call) and when it fails, the job request is pushed in a queue which is controlled every minute by a cron job. This way, one can restart/stop the controller without affecting the service too much.

Management Of The Code And Deployment

We manage all the code and documentation with Indefero and the deployment is made using Fabric.

The deployment targets are a set of OpenVZ VMs running on dedicated servers. As soon as one need more than 4 VMs, this way cheaper than Amazon EC2 or any cloud computing offer while having a better predictability in terms of performance. Also, most of the active users of the service are located in Europe and Amazon EC2 is in Ireland.
The problem with Ireland is that if it is fiscaly very interesting to run a datacenter there (in this is why Amazon has setup the European zone there), it is a bad decision with respect to the speed of light and the interconnection with the main peering points in Europe (Paris, Amsterdam, Frankfurt). So our servers are located a bit North of Paris and our provider has an incredibly good private network in Europe (one line is 10Gbps) and is peering with nearly everybody so that we have users in Finland telling us that Cheméo is faster than their local websites.

Checklist For Large Systems

Or said another way: each time I followed another way I lost both my time and my way.

Conclusion

Everyday I am learning something new, everyday the Cheméo users are pushing us to go ahead, everyday the community at Hacker News is providing me with new sparks of good ideas to do what I do better and faster. I am a researcher, but I found that where I am good at is not having new break through ideas, but being able to bring together a lot of different concepts from different horizons below a new roof.

Do you want to do the same? First, yes you can whatever your field is, second, be here for the long run, be here now to still be here in 10 years. Of course you can go for a quick win, but in the scientific world, people are marathon runners, you build today for tomorrow and the day after tomorrow.

We will need the chemical properties of components today, tomorrow and in 50 years from now. You need to find information related to your work today and in the future.

Yes, not really technical this conclusion, so to go back to the subject, all this work would have not been possible without the thousands hours of work by passionate people creating the free software at the backend of this system. Thank you!

PS: Do not hesitate to contact me (loic @ ceondo.com) if you have questions about this infrastructure.

» Go back to the documentation page.
» Go back to the home page.