The Development Goes On…

It’s been a while since I gave you an update on the technical side of Orbital, so here’s a lightning-fast overview of what’s going on.

CKAN

We’re still working on fine-tuning CKAN for our needs. Although we’ve made advances in the fields of theming, datastore, HTTPS and a few other tweaks we’re still plagued by mixed HTTP/HTTPS resources, plugins which are difficult to install, broken sign-in using our OAuth 2 SSO service, a broken search and a complete unwillingness of the Recline preview to work. I suspect a lot of this is down to unfamiliarity with the codebase and with Python in general, although some areas of CKAN do feel like they’re a collection of hacks built on top of some more hacks built on a framework which is built on another framework which is built on a collection of libraries which is built on a hack.

In short, CKAN is still in need of a lot of work before our deployment can be considered production ready (hence the “beta” tag). That said, we are already using it to store some research data and the aspects which we’ve managed to get working are working well. We’re going easy though, because CKAN 1.8 and 2.0 are apparently due to land in the next couple of months.

Orbital Bridge

Our awesomely named Orbital Bridge will serve as the central point for all RDM activity around a project, as well as helping people through the process of general project management by being a springboard to our existing policy and training documentation.

Currently Bridge’s public-facing side is in a very basic state, with only static content, but is serving as a test of our deployment toolchain. However, behind the scenes Harry has been working on ways of shuffling data around between systems using abstraction layers for aspects such as datasets, files, people and projects. Today we sat down with Paul and went through some aspects of minimal metadata which are required to construct things to an acceptable standard, which will lead to additional work both on CKAN and our existing ePrints repository to smooth the transfer of things between them.

AMS

The University’s new Awards Management System is designed to help researchers plan their funded research, walking them through the process of building their bid. The system itself has begun its roll-out across the University, and as soon as we’re given access to the APIs we’ll be integrating the AMS with Orbital Bridge, allowing seamless creation of a research project based on the data in the AMS.

This work also helps to inform stuff we’re doing in Bridge around abstracting the notion of a ‘project’ between all our different systems.

Kumo

Our ongoing OpenStack project, which we will use as the bed to provide the technical infrastructure, is slowly moving closer to a state which we can begin to develop on. Tied in with this effort is our continued work on automating our provisioning, configuring, deployment, maintenance, monitoring and scaling.

Have Data, Will CKAN

One thing that Orbital is focussed on is the notion that data should remain as raw and accessible as possible throughout the research cycle, with as few steps as possible between the source of the data and its storage. We don’t want our rich sensor data being turned into Excel spreadsheets unnecessarily, and we don’t want to have to manually run reports, extract data and then load the data into something else just to get work done.

When we were building our own platform this was achieved with something we called Dynamic Datasets; a MongoDB cluster with a RESTful API bolted on top which would accept massive amounts of data and re-expose it on demand, including powerful filtering and output options. In CKAN we’re using the DataStore API, powered by ElasticSearch (at the moment, this will change in the future), to do the same thing. The DataStore API was originally intended to provide a searchable view on data such as CSV files as part of the Recline preview, but we’re using it for a slightly different purpose as the sole repository of data without any corresponding originating file. Data comes straight from the source, through any sanitation process which we need to run, and into DataStore.

To test the principle whilst we’re finishing evaluating security for our more confidential research data (where we get to play with big engineering data) we’ve started loading data which is being used by On Course into CKAN. Specifically we’re ingesting our course data and our organisational structure — and we’re doing it all automatically.

Once a day we run a set of queries against our Nucleus institutional data warehouse to gather the required data. This is our ‘sensor’ in the model, representing the source of the data prior to any analysis or further work. The data is then manipulated into ElasticSearch’s bulk update API format, and injected directly into DataStore. Check out some of our resources, maintained automatically by direct interaction between our source data and CKAN’s DataStore:

We can also use this data to run ‘real-time’ queries against the data — something researchers are likely to find useful.For example, we can ask for the first 100 modules taught by the School of Computing with “Games” in their title. This query is always as accurate as the last update of the DataStore API, which means that (as an engineering example) researchers can ask for data such as “every sensor value for machines recorded in the last day where the temperature was over 40”, and the data will be hot off the press every single time, with no requirement to re-export the information.

There are still a few rough edges we’re going to do our best to iron out, but all in all I’m fairly impressed with the ease of getting going.

A Bridge To The Skies

Following on from our meeting with Team CKAN, we’ve had to make a few sweeping architectural changes. Here’s what’s happening:

  • We’re totally scrapping Orbital Core and Orbital Manager in their current form, since they mostly replicate functionality which already exists in ownCloud and CKAN.
  • We’re developing a brand new application, codenamed Orbital Bridge, which sits between various systems which make up the Orbital platform (ownCloud, CKAN, and in the future our Staff Directory and Awards Management System).
  • Orbital Bridge acts to orchestrate aspects of the various systems, and provides a high-level concept of research projects and project team members. The relevant aspects of these concepts (such as group membership, folder sharing and security permissions) are then managed in the individual systems.
  • Orbital Bridge will also include options for moving files and data (via the CKAN DataStore API) through conversion tools, for example taking a CSV and loading it directly into a datastore, converting binary files to open standards, or taking a datastore and converting it to something more useful based on given search parameters.
  • CKAN exposes a variety of data using its APIs (also delicious RDF). Orbital Bridge takes this data to boost our Nucleus institutional data store (and our upcoming institutional triple store), and through that our other integrated services.

Information which Orbital Bridge uses to determine things like projects and users will come initially from our University SSO service coupled to manual creation of projects and external users, and in the future (APIs permitting) partially from our Awards Management System for automatic population of project information. Check out our shiny graphic for a visual overview:

Release The Releases!

One of the things that Orbital set out to do is to prove that agile development of software and solutions can arrive at the same outcome as a more traditional ‘waterfall’ method of project planning. A big part of this is the “release often” approach to development, shipping new versions far more quickly than is usually the case in academia. We’re actually on the slightly slow side of agile development (some companies push updates several times a day), but still aiming to ship a major point release every month for our users to have a play with and comment on.

On top of the point releases we’re also churning out additional maintenance releases with bug fixes and minor features roughly every week, our latest being 0.2.1 which shipped yesterday. These follow a fairly common pattern of odd-numbered releases (0.1.1, 0.2.1, 0.2.3 etc) being patch releases which fix bugs and even-numbered (0.2.2, 0.2.4 etc) adding evolutionary updates to existing functionality.

There are a few major benefits to doing things this way:

  • Orbital never gets the chance to stray far from user requirements – we can’t go off and spend 6 months developing something that doesn’t do what people want, because they can tell us if we’re going wrong at least once a month (and often more frequently than that).
  • Users who report bugs don’t have to wait several months for the next ‘service patch’ which rolls up thousands of changes, the next maintenance release will fix things and is usually only a few days out.
  • The gap between a feature request and implementation is dramatically reduced, so the majority of feature requests are delivered whilst the requirement is fresh in the minds of the user. This results in more immediate usage and feedback.
  • Our code is refactored and refined more often. Instead of building a massive codebase all at once and never going back to improve things we spend a small amount of time each release making sure that code is clean and sane, interfaces are well defined and so-on.
  • Our continuous integration server won’t let us ship a product which doesn’t meet minimum requirements of code quality, documentation and testing. Being forced through this process on a regular basis means that we never get a chance to build up a significant backlog of problems.
  • The use of our code repository and feature branches (there’s a post on this coming up later) means that every ‘merge’ of development code with our staging code is checked over by a developer other than the one who wrote the feature. When we ‘merge’ the staged and tested code with our production code the changes are checked yet again.
  • More granular releases make it easier to roll back when things go wrong. Moving from v0.2.1 to v0.2.3 doesn’t need any database changes, so if something isn’t working as expected its a simple matter to move back to an earlier release. In contrast, if we only ever moved from major releases v1 to v2 (which will almost inevitably involve changing a database schema) then performing a rollback becomes much more challenging.

Thus far Orbital has made four distinct releases (v0.1, v0.1.1, v0.2 and v0.2.1), with v0.2.2 due out next week. If you’re interested in seeing (roughly) what’s in the pipeline don’t forget our Pivotal Tracker can tell you more.

And Now… Dynamic Data!

You may remember a while back that I blogged about how Orbital thinks of research data, using our “Smarties not tubes” approach. We then went away for a bit, and in our first functional release included a way of storing your tubes, but nothing about the Smarties. This understandably caused some confusion amongst those who had paid close attention to our poster and were expecting a bit more than a file uploader.

The reason behind this seeming disconnect in our plans was simple: a researcher had a bunch of files they wanted to publish through Orbital, and it made a lot more sense to build something that would let them do what they wanted straight out of the gate rather than devote our efforts to breaking up their nice datasets again and storing individual bits and pieces. Fortunately, our next functional release is planned to include our magical Smarty storage system. Here’s a quick overview.

Dynamic Data (as we’re calling it) uses a document storage database to keep tabs on individual data points within a research project. It’s designed specifically so that you can fill it up with fine-grained information as individual records rather than storing a single monolithic file. We think this is the best way to go about storing and managing research data during the lifetime of a research project for a few reasons:

  • It’s easier to find relevant stuff. Instead of trying to remember if the data you were looking for was in 2011-Nov-15_04_v2.xls or 2011-Nov-15_04_v3.xls you can instead just search the Dynamic Dataset.
  • It’s an awful lot easier for us to ensure a Dynamic Dataset is stored reliably than a bunch of files, due to databases’ tendencies to have good replication and resiliency options.
  • We can scale for storing individual files up to a few tens of gigabytes per file at most before things start to get silly, although we can store a lot of files at that size. We can scale a single Dynamic Dataset until we run out of resources.
  • Data can be reproduced in a number of standard ways from a single source. The same source can easily be turned into a CSV, or XML document, or JSON, or any other format we can write a structured description for.
  • With a little work, data can be reproduced in a number of non-standard ways from a single source. Templating engines can allow researchers to describe their own output formats.
  • Data can be interfaced with at a much more ‘raw’ level with only basic programming skills. Equipment such as sensors can automatically load data to a Dynamic Dataset, survey responses can be captured automatically and more. Data can be retrieved and analysed in the same way, for example scheduling analysis of a week’s worth of data.

The data in release v0.2 is manipulated purely at an API level through Orbital Core, although upcoming versions will have cool ways of manually entering and querying the data through the web interface. Data is then quickly processed to add things like tracking metadata (upload time etc) and shovelled off to our storage cluster of MongoDB servers.