We’re listening.

We love hearing feedback, so as part of our v0.1 release of Orbital we decided to make giving feedback super easy. We’ve done this using the tools provided by UserVoice, giving us a single box which lets you complete a simple sentence:

I need Orbital to…

That’s it. No questionnaires to draw up complicated statistical analysis of how people feel about a list of stuff we plucked out of thin air, just a single box for you to share your requirements with us.

UserVoice is designed to be really useful when it comes to gathering user feedback, so we’re taking advantage of that usefulness and adding a “feedback” button to every single page. When inspiration strikes, or you discover that Orbital is missing something, or you realise that life would be easier if it did something differently, you can just click the button and let us know.

You can also see a list of everything that people have said already, to see if your thought has already been shared. If you think a thought is particularly important you can vote on it, to add your own voice to what you reckon we should do next. We’ll also let you know how we’re getting on with looking at or implementing your ideas, telling you when we’re looking at feasibility, planning it and working (or not) on it.

So there you have it. No focus groups, workshops or questionnaires to carefully filter and manage your thoughts and suggestions. Just a simple box and a hotline to the development team.

It’s like having a whole new research partner…

Orbital, as you know, has many cool features in the pipeline which are designed to make research easier. From keeping tabs on your data and helping you find what you’re looking for through to helping you build your data management plan Orbital is going to be an essential tool in your daily research work. Today we’re pleased to announce that it’s being made even better, making use of some really rather clever machine learning and artificial intelligence to actually do parts of the research for you.

To start with, all you need to do is to upload your research data in whatever format you’ve got it in. If you don’t have research data then just give Orbital a few keywords and it’ll generate research data for you based on over 200 individual variables gathered from (amongst other things) the news, weather, stock markets and punctuality of the rail network. Once your data is loaded Orbital will begin to sift through it, sorting it into a more easily understood form which can be searched and queried with ease. From there Orbital will begin to look for statistically significant patterns of data, pick them out for further analysis and finally output a conclusion for you – complete with any necessary citations – ready for inclusion in your paper.

Yet another example of how Orbital isn’t just a place to keep your data, but an active part of your day-to-day work.

Data, Data Everywhere…

For a project which is essentially about storing data, we’ve not actually done that much talking about it. This may seem sensible to some — after all, everybody knows what data is, don’t they?

It turns out that what people define as ‘data’ is a hugely wide ranging topic (you can find a myriad of research on how different people define it), and what we’re trying to do is basically trying to fit mis-shapen data into a one-size-fits-nothing storage system. Allow me to elaborate.

First of all we had to look at what data was currently available to us. Fortunately we have some awesome project partners in the School of Engineering who provided us with some of what they’re researching on, and thus presented the first problem: The data doesn’t exist in any kind of standardised format. We’ve got to content with flat text database formats, weird (often invalid) XML, Excel spreadsheets, CSV files (again often invalid), folders of images or audio files, proprietary binary formats, non-binary flat files which nonetheless need parsing to be made understandable, plain strings of data, and the occasional random file format which even the source of the data can’t explain.

The solution to this problem is fairly simple in principle, yet complex in practice. First of all when it comes to archive storage of files (ie without any pre-processing) Orbital is designed to be file type agnostic — if you give it a random stream of bytes and say it’s a file then a Orbital will duly store the file as provided, with no further work needed. It doesn’t care if your XML file has no DTD and has unclosed tags, since it doesn’t do any work inside the stream. You will later be able to retrieve the file exactly as it was first loaded into the system without any changes or alterations. It’s worth pointing out, however, this does mean that if Orbital is given a corrupt file to store then it will do so blindly without any attempt at validation.

Continue reading “Data, Data Everywhere…”

Eating your own dog food: Building a repository with API-driven development

This is a proposal for a paper at the Open Repositories 2012 conference in July.

The JISC-funded Orbital project is building on earlier work at the University of Lincoln to develop a state-of-the-art research data management infrastructure, piloted with the first purpose-built School of Engineering in the UK in over 20 years.

Orbital (figure c) differs from traditional database applications in three significant ways:

  1. Orbital Core uses MongoDB, a document-oriented, schema-less, so-called ‘NoSQL’ database. MongoDB offers flexibility in that it is capable of accepting an object representing any kind of data (e.g. tabular data, survey results, images) without the need to develop a schema beforehand. MongoDB also includes useful features which can boost performance and resiliency, namely sharding – slicing data across multiple servers so a request may be processed by multiple servers in parallel – and replication — keeping multiple identical copies of data on different servers in case one of them fails. Orbital is also designed to be able to spread the ‘core’ – the application which does the heavy lifting – and the ‘manager’ – the front-end user interface – across multiple servers without causing stress. In our experience MongoDB, combined with the Sphinx search engine to perform full-text searching, is also extremely fast and allows us to develop simple, attractive APIs which we can expose to user applications.
  2. Orbital Core mediates access to the data via an open source OAuth 2 server we have developed and implemented at Lincoln.  The use of OAuth 2 allows access to the data from multiple authorised systems providing that the owner of the data has given permission, instantly opening the Orbital application to third-party extension. This method establishes the identity, authentication and authorisation of users, providing direct access to individual data sets or portions of data sets (e.g. specific rows/columns) through APIs on Orbital Core.
  3. The design and development of Orbital Core is API-driven, resulting in an application that offers 100% of its functionality through APIs, whether to our own Orbital Manager or a third-party application, each of which are treated equally by Orbital Core (figure c). As far as Orbital Core is concerned there is no functional difference between Orbital Manager (the front-end) and an application that a researcher has developed to meet a specific need; they are subject to the exact same access controls, restrictions, sanity checking and limitations. We have therefore eschewed some of the traditional approaches of building a database application, where access to the database is either provided via a stand-alone application (figure a) or via an API bolted on to the database (figure b). Orbital is also designed to be both stateless, i.e. all of the API functions are RESTful and thus represent a complete transaction with no requirement for session affinity, are not reliant on SQL features like transactions and joins, and have a reduced requirement for referential integrity.

Under this design, the API is the only way to interface with the data and functionality of the system. This API-driven approach offers several benefits:

  • Architecture is better: We are forced to think about data types and methods early on. Consistent behaviour across the application is easier to achieve.
  • Development is easier: Calling a well designed API is simple; error messages become cleanly captured by design; APIs encourage code reuse at both API and application end.
  • Updates become simpler:  We can run two or more API versions concurrently; tweak the API back-end and all front-end applications (‘official’ and 3rd party) benefit at once.
  • The APIs are better: The APIs must include everything we want our application to be able to do. Reliability of the API is now critical which encourages better design of resiliency and error handling; and usability of the API is essential which encourages better documentation.

The challenges of this approach are that every time we want to build user-facing functionality we have to assess our APIs and work out where the functionality belongs as well as ensuring that we have lightweight data transfer and reliable error handling designed into the application. We also have to double up on some areas of development, writing both the respective Core and Manager parts of the system.

Illustrations

Figure a: The only way to interact with this application is to either be a user, or pretend to be one (for example via screen-scraping).
Figure b: The most common form of API, consisting of a ‘second view’ on the data and functionality of an application. This style of API often exposes a limited subset of the application’s functionality.
Figure c: In an API-driven model the API is the only way to interface with the application.

Hello? Is it me you’re looking for?

Orbital loves APIs, to the extent that the entire project hangs off them working as expected. One thing that Orbital must also do is ensure that it keeps potentially sensitive or confidential data secure during the process of slinging it over the ether. Fortunately, the fact we’re using HTTP as our transport mechanism of choice means that we can leverage something which has proven to be pretty darned reliable thus far: HTTPS!

The pretty green HTTPS icon in Chrome means that your communication with Orbital is secured.

HTTPS is the secure version of the HTTP standard which drives browsing the web — the ‘green padlock’ which appears on secured websites when you’re browsing. Behind the scenes it’s full of plenty of technological awesomeness including various protocols (SSL and TLS if you’re curious) and complex encryption and certification policies which aim to ensure three things:

  1. Traffic between the client (in most cases your browser) and the server is encrypted, ie it’s useless to anybody who may be able to intercept it.
  2. The connection between the client and the server is secure, ie the only parties who can communicate over that channel are the client and server who initiated it.
  3. The server you’re talking to is verifiably the right one, and not simply pretending in order to intercept data.

Orbital makes use of HTTPS throughout, and offers no option at all for unsecured access. Simply put, this means that all your communication with Orbital is guaranteed to be secure whilst it’s in transit. In fact, we decided to find a tool which can quantifiably express just how secure it is, so we had a look at Qualys. You can view our SSL reports below:

Whilst we always made sure that both aspects of Orbital offered a standard level of security, our initial reports highlighted a few flaws. Our servers were vulnerable to the BEAST attack, and offered some low security ciphers in their list of supported ones. Based on these reports we were able to plug those holes, improving our overall security. We also decided to implement the HSTS draft specification, providing an explicit instruction to browsers that communication with Orbital must always take place over HTTPS with a valid certificate, and any attempts to do otherwise should be blocked as a security risk.

Just a few of the ways we’re making sure that Orbital can be trusted with your sensitive or commercial data.