Data, Data Everywhere…

For a project which is essentially about storing data, we’ve not actually done that much talking about it. This may seem sensible to some — after all, everybody knows what data is, don’t they?

It turns out that what people define as ‘data’ is a hugely wide ranging topic (you can find a myriad of research on how different people define it), and what we’re trying to do is basically trying to fit mis-shapen data into a one-size-fits-nothing storage system. Allow me to elaborate.

First of all we had to look at what data was currently available to us. Fortunately we have some awesome project partners in the School of Engineering who provided us with some of what they’re researching on, and thus presented the first problem: The data doesn’t exist in any kind of standardised format. We’ve got to content with flat text database formats, weird (often invalid) XML, Excel spreadsheets, CSV files (again often invalid), folders of images or audio files, proprietary binary formats, non-binary flat files which nonetheless need parsing to be made understandable, plain strings of data, and the occasional random file format which even the source of the data can’t explain.

The solution to this problem is fairly simple in principle, yet complex in practice. First of all when it comes to archive storage of files (ie without any pre-processing) Orbital is designed to be file type agnostic — if you give it a random stream of bytes and say it’s a file then a Orbital will duly store the file as provided, with no further work needed. It doesn’t care if your XML file has no DTD and has unclosed tags, since it doesn’t do any work inside the stream. You will later be able to retrieve the file exactly as it was first loaded into the system without any changes or alterations. It’s worth pointing out, however, this does mean that if Orbital is given a corrupt file to store then it will do so blindly without any attempt at validation.

Continue reading “Data, Data Everywhere…”