You may remember a while back that I blogged about how Orbital thinks of research data, using our “Smarties not tubes” approach. We then went away for a bit, and in our first functional release included a way of storing your tubes, but nothing about the Smarties. This understandably caused some confusion amongst those who had paid close attention to our poster and were expecting a bit more than a file uploader.
The reason behind this seeming disconnect in our plans was simple: a researcher had a bunch of files they wanted to publish through Orbital, and it made a lot more sense to build something that would let them do what they wanted straight out of the gate rather than devote our efforts to breaking up their nice datasets again and storing individual bits and pieces. Fortunately, our next functional release is planned to include our magical Smarty storage system. Here’s a quick overview.
Dynamic Data (as we’re calling it) uses a document storage database to keep tabs on individual data points within a research project. It’s designed specifically so that you can fill it up with fine-grained information as individual records rather than storing a single monolithic file. We think this is the best way to go about storing and managing research data during the lifetime of a research project for a few reasons:
- It’s easier to find relevant stuff. Instead of trying to remember if the data you were looking for was in 2011-Nov-15_04_v2.xls or 2011-Nov-15_04_v3.xls you can instead just search the Dynamic Dataset.
- It’s an awful lot easier for us to ensure a Dynamic Dataset is stored reliably than a bunch of files, due to databases’ tendencies to have good replication and resiliency options.
- We can scale for storing individual files up to a few tens of gigabytes per file at most before things start to get silly, although we can store a lot of files at that size. We can scale a single Dynamic Dataset until we run out of resources.
- Data can be reproduced in a number of standard ways from a single source. The same source can easily be turned into a CSV, or XML document, or JSON, or any other format we can write a structured description for.
- With a little work, data can be reproduced in a number of non-standard ways from a single source. Templating engines can allow researchers to describe their own output formats.
- Data can be interfaced with at a much more ‘raw’ level with only basic programming skills. Equipment such as sensors can automatically load data to a Dynamic Dataset, survey responses can be captured automatically and more. Data can be retrieved and analysed in the same way, for example scheduling analysis of a week’s worth of data.
The data in release v0.2 is manipulated purely at an API level through Orbital Core, although upcoming versions will have cool ways of manually entering and querying the data through the web interface. Data is then quickly processed to add things like tracking metadata (upload time etc) and shovelled off to our storage cluster of MongoDB servers.