Please, No SQL!

For a change, this post is not about ML but about data storage and access. In the dark age, people were forced to store their data with bloated formats like, you know, XML or some binary format that is hard to read and to parse. Text formats were always popular but some of them are neither lightweight nor really human-readable. XML is a good example. It is a very ambitious format but with all the overhead, it often does not fit in and not to forget that some implementations, DOM for instance, are very memory hungry.

Thus, we tried to find something lightweight but that is still human-readable, to help debugging, and that can be -in the worst case- manually adjusted by hand. JSON is such a format. The idea of it is to provide a compact format to exchange data between applications with a simple encoding. And yes, we had a strong bias since the integration into our favorite language -Python of course- is seamless.

Now for the practical part. Let’s say we trained a neural network and we want to store it on disk. Sure, we could pickle it, but that is not as transparent as a JSON file. The only thing we have to do is to convert the numpy arrays into lists. The format is free style, but some description won’t hurt: {“weights: [0.1, 0.2, 0.3, 0.4], “bias_hidden”: 3.4, “bias_visible”: -0.1}. The little snippet is actually valid JSON and allows us to transfer the model over the web or to use it in some application.

This post is not about the best format on Earth, but one that makes our life much easier. And with the seamless integration into Python, and the available data types, we don’t have to care much how to store objects in JSON. All we need are lists and dictionaries to store simple, as well as complex objects, like a movie: {“title”: “Doom”, “features”: [0.19, 0.31, 0.99, …], “genre”: “sci-fi-horror”, persons: [“A”, “B”, ..]}.

Of course, without a real back-end, JSON makes not much sense for us. But in combination with a NoSQL database, we can read and write JSON documents and use them directly like Python objects. Furthermore, the flexibility of such databases allows us to toy with new features, either in a new collection or by just updating documents with a new field. Combined with the fact that such DBs are schema-less, working with the data is painless.

Bottom line: If we do not have to model relations between the data, why should be bother with a column-based storage model? And if we are mostly working with freestyle text, schemata do not make much sense.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s