How the National Archives use MongoDB

We spoke to staff at 10gen yesterday about our choice of MongoDB for Orbital and were reminded that the National Archives (UK) are using it as the basis of their discovery platform. We plan to stay in touch with 10gen throughout the project and provide a Case Study on our use of Mongo for Managing Research Data. Here are some slides from a recent conference where the National Archives spoke about their work with MongoDB.

Source link.

Forecast: Cloudy

With the Orbital project we’re looking at taking a leap into the brave new world (well, at least as far as university projects go) of cloud hosting. Now, I hate the word “cloud” when used to describe most services because people bander it about like it’s some magical world-fixing technology when in actual fact all they mean is “it’s on the internet”. We have “cloud” services which are fundamentally no different from doing the same thing on a server kept on a desk somewhere; but Orbital isn’t going to be like that.

I hope.

Instead, Orbital will be a true ‘cloud’ service in that it’s a resource which end users can tap into with no care at all for the underlying technologies. It’ll scale up and down with demand, extending both processing power and storage space as needed. Should one of our servers fail for any reason it won’t be met with a week of downtime whilst we rebuild things, but instead a seamless transition of work to one of the redundant, load balanced alternatives. If a process stops working then instead of the entire system crashing down it’ll adapt, queueing tasks until things are restored. Alongside this, the use of common standards is something that is essential to development. RESTful APIs follow well understood principles for interacting with data, and authentication using OAuth (the same authentication method used by Twitter, Facebook, Google and Microsoft) is core to how things behave.

Whilst the Orbital application itself is built to run in a cloudy manner using these loosely coupled methods and Rambo architecture, we’re also going to be hosting the thing in the cloud. This helps us with a few things including the aforementioned scalability, improved resiliency and the ability to properly analyse how the cloud works for higher education.

There’s also an unexpected benefit of this cloudy approach to Orbital: we gain the ability to pin a real-world cost on the storage of research data since we are quite literally being charged by the GB. At the moment researchers tend to treat storage as a one-off cost – for example buying a pile of hard disks – with less understanding of what it actually costs to keep them spinning. Since Orbital will know more about the intricacies of the stored data than the researchers we will be able (for the first time) to offer a number both in terms of how much it is costing to store data and also the estimated carbon impact.

Both these numbers are something that we want to be able to give researchers to help them understand that hanging on to research data has a cost, but also that it’s probably more efficient to hang on to it in a central, cloud-based platform. Of course, we also want to give people a clean exit strategy so we’re also going to be looking at ways of easily creating ‘hard’ copies for offline, non-cloud storage whilst still maintaining a virtual presence for the purposes of referencing and metadata.

Research Data vs Research Data

As I’ve been looking closer at various requirements for Orbital, as well as other research data management projects, it’s becoming increasingly apparent that Orbital has taken a different tack when it comes to defining what research data actually is. Whilst not a problem, it does lead to a certain disconnect when talking to people with a different idea about what data means. When it comes to storing data the disconnect is even bigger, caused by people experiencing problems breaking the transit format of the data away from the data itself. In true engineering/computing style, it’s time for an analogy. I’m using sweets because hey, sweets are awesome.

Sugar! Sugar!

Imagine a tube of Smarties (or sugar-coated chocolate beans of choice). When I talk about research data I’m talking about the individual smarties, the individual nuggets of information. You could tip 100 tubes of smarties into a bowl and you’d just end up with a big pile of smarties. You could then go through and sort the smarties by colour, or perform some other type of organisation. Since you’ve got the individual smarties out of their containers it’s a lot easier to see a whole overview and work with them all at once.

 

Taking this approach makes sense to me, because if I want to throw in a couple of bags of Peanut M&Ms I can do without suddenly having a tube saying “Smarties” which contains nuts. I can still sort my pile of sweets into colours, or into types. I can orient them by the little letters on top. I could throw in a handful of jelly beans and a bar of chocolate broken into squares, and then order by sugar content, colour, and number of artificial flavours. The possibilities are quite literally limited only by my tolerance for sugar highs.

Continue reading “Research Data vs Research Data”

Pivoting Around

As part of Orbital’s development we need to keep what we’re doing on track, and ensure that what is produced is actually what people are after. We’re building the project using agile development methods, which mean that instead of generating a load of documentation and exacting requirements up front and then building software, we generate a basic set of requirements, start developing and then return to look at new or changed requirements at regular intervals.

Keeping tabs on this kind of thing requires a management tool, and in our case we’re using the wonderful Pivotal Tracker, and here’s why.

Pivotal allows us to break down user requirements (gathered through a variety of means, including meetings, surveys, observation and so-on) into discreet bundles called ‘stories’, each of which represents something that a user needs (or wants) to be able to do with the final product. An example may be “project administrators must be able to assign roles to project users”, or “users must be able to manually add a data point”. By creating these stories it starts to become clearer what actually needs to be done.

From there we can start to fully analyse each of these stories, providing them with information such as a ‘score’ of how difficult to achieve each story will be, or including sub-tasks for actual development purposes. Stories can be assigned to various people based on who needs to be involved, and go through a clearly defined workflow of being started, being finished, being delivered in a product version and being approved by the customer.

On top of this management of user stories we can also pack out Pivotal with higher-level package deliverables and deadlines, along with bug reporting and general project chores. Once we’ve got all these things into the Tracker we’re able to re-order them as priorities shift, giving us an instant overview of what’s happening in the current iteration (a 2-week long development cycle) as well as what’s going to be happening in future iterations. At this point, Pivotal Tracker comes into its own with something called ’emergent planning’.

Emergent planning takes a look at how we’re actually performing in terms of crunching through our list of user stories and dynamically adjusts which stories we’re going to be tackling in upcoming iterations. If we’re doing well we begin to see more points worth of development per iteration, and if we’re slipping then Tracker gives us fewer. Since we’ve told Pivotal what needs to happen before certain deadlines are met (when we ordered stories and tasks), and since Pivotal knows roughly how fast we’re working, it’s easy to see if we’re predicted to hit or miss development milestones.

Want to see what we’re up to? Our Pivotal Tracker project is open for you all to see.