Have Data, Will CKAN

One thing that Orbital is focussed on is the notion that data should remain as raw and accessible as possible throughout the research cycle, with as few steps as possible between the source of the data and its storage. We don’t want our rich sensor data being turned into Excel spreadsheets unnecessarily, and we don’t want to have to manually run reports, extract data and then load the data into something else just to get work done.

When we were building our own platform this was achieved with something we called Dynamic Datasets; a MongoDB cluster with a RESTful API bolted on top which would accept massive amounts of data and re-expose it on demand, including powerful filtering and output options. In CKAN we’re using the DataStore API, powered by ElasticSearch (at the moment, this will change in the future), to do the same thing. The DataStore API was originally intended to provide a searchable view on data such as CSV files as part of the Recline preview, but we’re using it for a slightly different purpose as the sole repository of data without any corresponding originating file. Data comes straight from the source, through any sanitation process which we need to run, and into DataStore.

To test the principle whilst we’re finishing evaluating security for our more confidential research data (where we get to play with big engineering data) we’ve started loading data which is being used by On Course into CKAN. Specifically we’re ingesting our course data and our organisational structure — and we’re doing it all automatically.

Once a day we run a set of queries against our Nucleus institutional data warehouse to gather the required data. This is our ‘sensor’ in the model, representing the source of the data prior to any analysis or further work. The data is then manipulated into ElasticSearch’s bulk update API format, and injected directly into DataStore. Check out some of our resources, maintained automatically by direct interaction between our source data and CKAN’s DataStore:

We can also use this data to run ‘real-time’ queries against the data — something researchers are likely to find useful.For example, we can ask for the first 100 modules taught by the School of Computing with “Games” in their title. This query is always as accurate as the last update of the DataStore API, which means that (as an engineering example) researchers can ask for data such as “every sensor value for machines recorded in the last day where the temperature was over 40”, and the data will be hot off the press every single time, with no requirement to re-export the information.

There are still a few rough edges we’re going to do our best to iron out, but all in all I’m fairly impressed with the ease of getting going.