Data, Data Everywhere…

For a project which is essentially about storing data, we’ve not actually done that much talking about it. This may seem sensible to some — after all, everybody knows what data is, don’t they?

It turns out that what people define as ‘data’ is a hugely wide ranging topic (you can find a myriad of research on how different people define it), and what we’re trying to do is basically trying to fit mis-shapen data into a one-size-fits-nothing storage system. Allow me to elaborate.

First of all we had to look at what data was currently available to us. Fortunately we have some awesome project partners in the School of Engineering who provided us with some of what they’re researching on, and thus presented the first problem: The data doesn’t exist in any kind of standardised format. We’ve got to content with flat text database formats, weird (often invalid) XML, Excel spreadsheets, CSV files (again often invalid), folders of images or audio files, proprietary binary formats, non-binary flat files which nonetheless need parsing to be made understandable, plain strings of data, and the occasional random file format which even the source of the data can’t explain.

The solution to this problem is fairly simple in principle, yet complex in practice. First of all when it comes to archive storage of files (ie without any pre-processing) Orbital is designed to be file type agnostic — if you give it a random stream of bytes and say it’s a file then a Orbital will duly store the file as provided, with no further work needed. It doesn’t care if your XML file has no DTD and has unclosed tags, since it doesn’t do any work inside the stream. You will later be able to retrieve the file exactly as it was first loaded into the system without any changes or alterations. It’s worth pointing out, however, this does mean that if Orbital is given a corrupt file to store then it will do so blindly without any attempt at validation.

The complex bit arrives second of all, where we want to turn those weird filetypes into something which is usable by our ‘smarties not tubes’ storage approach. Orbital is built around the ability to programmatically access both archival storage and the working data store, meaning that we can write custom interpreters for every single file type (or filesystem structure) we come across which extracts the file from storage, performs and necessary interpretation, and dumps the resultant data into the working store. Where a file format is understood to belong to a standard type (such as JSON or standard CSV) we can also provide standardised interpreters to make everybody’s life easier.

All this is well and good, but does pose some interesting problems which need further work — should these plugins be restricted to system-level, institution-approved ones, or should they be via an extensible scripting interface which allows any researcher to write interpreters ad-hoc. My personal tendency at this point is towards the former (although this may change), given that academics smart enough to write an interpreter can probably interface directly with the Core APIs to load data instead.

By this point you might be thinking “yay, data!”, but you’d be wrong. We’ve taken our mis-shapen data and done just enough work to understand it and throw it into the data store, but the data still has no meaning attached – it could very well (and often is) be an Excel spreadsheet labelled “Mar12_SystemAnalysis_final_proc.xls”, which contains nothing more than a column of identifiers and another column or two of values with no headings. This is the second problem we bump into — although we have data, we don’t actually know what it it. Sometimes even the source of the data can explain what they’re looking at. Often the best we can hope for is somebody saying “oh yeah, the third column is the rotational velocity of the outer flange inhibitor measured in radians per minute”. The trick here is to capture this obscure knowledge, and we’re considering doing it using what some may consider quite a harsh method.

Whenever a new data set is uploaded to Orbital it will begin to prompt the user to complete the necessary metadata, explaining exactly what each field in the data set is, including (where necessary) things like units. We can also start to ask for things like providence information for the entire set, licences and more. We’re doing a lot of work on making this process as painless as possible by the use of clever UI design, integrated workflows, passive feedback and even going as far as using various machine intelligence techniques to preemptively suggest metadata.

Fortunately for both the researchers working on the data and those looking at it downstream we’re going to make Orbital produce nice, sane data at the point of consumption. This means that any data which is output from a data store (as opposed to a file store) will be neatly formatted in line with the requested standards, packed full of all the metadata we can muster, and ready to be used. Take the resultant data elsewhere and the above issues will have (mostly, we hope) disappeared.