A JISC-funded Managing Research Data project

Posts tagged OAuth

This is a post about our first release of Orbital.

About a month ago, Dr. Tom Duckett, Reader in The Department of Computing and Informatics approached the Orbital project because he urgently wanted to publish around 20GB of data for Long-term mobile robot operations. That afternoon, we gave Tom and Feras Dayoub, his Research Assistant, space on one of our servers and they uploaded a bunch of HTML pages and the zipped up data. We minted a proxy URL for them and advised them on an appropriate data license to choose.   We also set up Google Analytics, so they could see what interest in his data there was.

Job done. For the time being.

What Tom really wanted was to be able to email a link to his data to a robotics mailing list and tell an international community of likeminded researchers and manufacturers that the data was available to use. He says that long-term datasets for mobile robots are quite rare in his community, so there was a good chance people would be interested in them. He also wanted to be able to demonstrate his work when writing an EU bid. There will be a follow up blog post about what impact this has had on Tom’s research.

That afternoon got us thinking: What is the minimal set of functions that a researcher like Tom requires of a Research Data Management tool?

Tom wanted access (sign in) to a server (hosting) where he could upload his data (storage) and describe it so that other people could understand and download it (publish) under an appropriate license. The URL pointing to the data should be persistent, even if the data itself is migrated from one system to another. The impact (analytics) of the data should also be measurable.

Tom’s chance intervention in our project made us focus on Orbital v0.1 as the ‘minimum viable product‘ for researchers who need to publish open data. We thought his requirements were a great opportunity to release something early and start getting direct user feedback on our product. We decided to set a release date for Orbital v0.1 a month ahead and aim to deliver everything that Tom asked of us in this first release.

A Minimum Viable Product has just those features that allow the product to be deployed, and no more.

Today, we released Orbital v0.1 and it does everything described above. It’s an alpha release, but we’ve been testing it like crazy, we also had Feras test it and we’ve been pushing code through Jenkins since the beginning of the project so we know it passes our QA checks and we think it’s stable enough for use. From this point forward, Orbital and the URIs it mints will persist, too.

From today, a researcher at the University of Lincoln can sign in to Orbital, create and describe a project, upload their data to the project, choose a license for the data and add a Google Analytics code to measure project analytics (we’re also tracking each button click to better understand how people use Orbital). The data is published at a id.lincoln.ac.uk URI, which will persist indefinitely. At this stage, until we’ve got an approved business case for scaling it up and out to all academics, we’ll be limiting uploads on a case-by-case basis. You can view and request what other features we develop for Orbital on UserVoice, or in more detail on our project tracker. We’ve also written a basic development roadmap.

For developers, here are the basic technical details. You might also want to trawl through our implementation plan and the collected blog posts at the bottom of the plan.

Orbital is written in PHP using the CodeIgniter development framework.  It’s split into two main pieces of functionality. Orbital Core (database and APIs) is currently hosted on a Linux box on Rackspace’s cloud. Orbital Manager (the User Interface) is likewise hosted on Rackspace. A user signs in to Orbital Manager via OAuth 2.0 using their university credentials. Orbital Manager is using Twitter’s Bootstrap framework. The project metadata is stored in a MySQL database. Files are uploaded to Rackspace’s cloud files storage using Andrew Valums’s AJAX Uploader. APIs are exposed using Phil Sturgeon’s CodeIgniter REST server.

Orbital is licensed under the GNU Affero GPL 3 license and you can download, fork it and create pull requests on Github:

Orbital Core

Orbital Manager

New contributors to Orbital will be ritually applauded each weekday morning :-) Thanks.

Introduction

The Orbital Implementation Plan (WP6) is intended to be a synthesis of our initial user requirements gathering (WP5), an assessment of Engineering research data (WP9), an evaluation of standards and technologies (WP10), informed by a literature review of previous work relevant to the Research Data Management (RDM) domain as it relates the discipline of Engineering (WP4).

Therefore, appended to this Implementation Plan is: i) a Technical Specification based on user requirements; ii) a Literature Review; iii) a summary of an institution-wide survey based on the Data Asset Framework; iv) and a draft Research Data Management Policy for the University of Lincoln (WP7), which is currently under-going internal review.

The Implementation Plan has been written at exactly a third of the way into the Orbital project (six months), allowing for a further year of development based on the work brought together in this document. It is worth repeating the objectives of the project, as stated in the Project Plan:

We intend to build on our previous work around the deposit, management and access to university research as well as further existing work in which we are building a platform for data-driven services at the university.

Throughout this undertaking, we aim to improve our understanding of the issues around research data management; develop the requisite skills among the university community to better manage research data; re-use and develop some of the underlying tools we have built to provide an institution-wide service for the ingest, description, preservation and dissemination of research data; improve the way we work on such projects, refining our use of agile methods; build capacity for the local development of academic technologies at the university; develop and implement appropriate institutional policy for the deposit, management and sharing of research data; and develop a Business Plan for the university for the long-term sustainability of our research data.

Our work to-date has pursued many of these objectives closely, reflecting continued effort over the last six months, both inside and outside the project, to build on previous work by using institutional data to drive application development; to improve our methods of access and identity management; and develop an environment that fosters and supports in-house innovation.

This planning document is primarily intended to support the technical implementation of the Orbital application to manage research data at the University of Lincoln. What it does not address is the training to support the use of the application (WP11), nor the Business Case for sustaining the pilot service (WP13), which we are implementing. However, some preliminary work is underway to consider appropriate business models for sustaining Orbital as open source software and we believe that the technical decisions laid out in this Implementation Plan will support the development of a sustainable Business Case for Orbital. This area of work continues and the outcomes are due to be delivered towards the end of the project.

What follows is a brief summary of the appended Technical Specification and Literature Review. I would like to thank Nick Jackson and Paul Stainthorp for their work on these documents, which have brought clarity to the Orbital project and contributed to a much better understanding of RDM at the University of Lincoln.

Joss Winn, Orbital Project Manager, 2nd April 2012.

Literature Review

The management of research data is recognised as one of the most pressing challenges facing the higher education and research sectors. Research data generated by publicly-funded research is seen as a public good and should be available for verification and re-use. In recognition of this principle, all UK Research Councils require their grant holders to manage and retain their research data for re-use, unless there are specific and valid reasons not to do so. (JISC Managing Research Data Programme 2011-13).

To gain a clearer understanding of the more complex and unfamiliar concepts in the emerging discipline of Research Data Management, the Orbital project conducted a review of published literature on the subject (mainly web sites, project reports and guidance documents), with particular reference to RDM in the discipline of engineering.

An online Research Data Management bibliography is being maintained at: http://lncn.eu/bcf6

The project team identified the following nine themes in the literature – for each theme, a recommendation is made which will support the development of RDM infrastructure at the University of Lincoln.

1. Fundamentals of research data and RDM

Researchers are not a homogeneous group, and their data needs are changing as the research landscape becomes more complex. Recommendation: the Orbital project continue its work to assess the storage and other requirements of Lincoln researchers using surveys and interviews.

2. Particular requirements of the discipline of engineering

The ERIM (Engineering Research Information Management) project at the University of Bath has specified the first ever set of RDM principles and terminology designed specifically for engineers. Recommendation: the Orbital continue to work with the Bath team on implementing ERIM’s findings.

3. The behaviour of researchers

What motivates researchers to invest in RDM is not the same as what motivates their institutions. Recommendation: Orbital to use surveys and interviews to understand researchers’ requirements and develop appropriate advocacy materials.

4. RDM policies and legal aspects

All UK Research Councils are introducing mandates for data curation, and in some cases data publication. Recommendation: the Orbital team to support the University’s response to the imminently required EPSRC data policy roadmap and to help develop institutional policies.

5. Data sharing

Research data are at their most useful when they are interoperable with other data. Sharing data leads to a range of real and measurable benefits, and researchers’ interests are protected by a principle of ‘proprietary period’ of privileged access. Recommendation: Orbital work with Research & Enterprise to formulate clear policies on data sharing and licensing.

6. Costs and benefits

The most significant RDM costs for the institution occur at the data acquisition/ingest stage. Institutions that invest in RDM can expect significant benefits including new, unforeseen research activities made possible through the re-use and aggregation of data. Recommendation: Orbital provide guidance to researchers on ensuring RDM is costed into future research funding bids.

7. Curation standards, metadata and citation

Without a system for assigning citations to research data, further curation and sharing is impossible. Recommendation: Orbital incorporate the functionality of DataCite to allow Lincoln researchers to secure a DOI (Digital Object Identifier) for their data objects.

8. Technical considerations

The range of file formats involved in engineering research is a significant area of complexity. Recommendation: Orbital continue to work with Siemens, the School of Engineering, the University of Bath and the DCC to develop expertise in handling engineering data formats.

9. Tools, support and training

A range of immediately re-purposable RDM training kits and planning tools already exists. Recommendation: Orbital review the available material, and use them to design a RDM training programme for the University of Lincoln – also incorporate Data Management Planning (DMP) tools within the Orbital application.

In light of this, the initial objectives of the Orbital project were on the mark, but indicate a broad area of institutional responsibility that goes beyond scholarly communication to affect strategic areas such as recruitment and training, business intelligence and continuity, IP and income generation, as well as future curriculum design and our corresponding investment in infrastructure and estate. No small task.

Technical Approach

Our Project Plan outlined the technical approach that we originally anticipated and six months later this has not fundamentally changed. As detailed in the Technical Specification, we remain convinced of the benefits of pursuing a data-driven, API-centric model of development, using storage and access control methods that support the creation of a modular and scaleable web application that is attractive to both Users and Developers.

As we have learned from our requirements gathering and literature review, Research practices both within and across subject disciplines are varied, suggesting that over the next 12 months, the Orbital project should concentrate on developing an application that remains open and attractive to further development, rather than seeking to design a single workflow for all users’ needs – an impossible task.

We believe this approach best supports the sustained development of Orbital beyond the life of the pilot project, allowing both Researchers and software Developers to create applications for Orbital to suit the requirements of specific research disciplines at a given point in time. Likewise, an API-centric approach will also ensure that our existing and related applications, such as institutional repository software and research information systems can equally be treated as ‘users’ (producers and consumers) of Orbital.

As we outlined in our Project Plan, this approach allows us to benefit from work which continues outside the Orbital project such as that around Access and Identity Management and academic profiles, and the development of data.lincoln.ac.uk. It is also a suitable approach for the development of Orbital as open source software, which should remain simple to develop for specific user’s needs if it is to receive interest and contributions from developers outside the university.

The Technical Specification contains five core functional requirements: Projects, Workspace, Archives, Working Dataset, and Publication. A Project may result in a specific Publication(s), while the Workspace, Archives and Working Dataset allow for three non-sequential methods of data storage, manipulation and analysis. These requirements are loosely coupled to one another, but do not represent a publication workflow. Orbital is not simply intended to be a data repository, but the basis of a flexible collaborative environment for working Researchers.

Each Project acts as a conceptual container for all data and represents the ‘space’ in which administrative, descriptive and contextual metadata is captured and stored, as well as the datasets themselves. It is at the level of a Project that Orbital will interface with other systems, such as an institutional repository or research information system by storing, exchanging and publishing information according to recognised standards, such as CERIF, SWORD2, DOI, etc.

Finally, a core requirement from Orbital is that data should be stored, accessed and transported securely. Being a native web application, we have opted to implement the OAuth 2 protocol to provide secured access to all API functions over HTTPS. As such, all user applications will be treated equally and will be required to access the core Orbital APIs via this popular and mature standard for application authentication on the web. OAuth is increasingly being deployed at the University of Lincoln and work continues outside of the Orbital project to implement it as part of an institution-wide Single Sign On (SSO) architecture.

Related project blog posts

Chosen Methodology

Jenkins, build my software

Pivoting Around

Project Planning: Quality Assurance

Understanding and participating in open source culture

The Toolchain: First Pass

Tracking progress

Literature Review

An Orbital project reading list

Initial User Requirements

Meeting our users, the Engineers

Assessment of Data Sources

Research Data vs Research Data

Let’s Look At Data

Data, Data Everywhere

Gluing people together

Evaluation of standards and technologies

How the National Archives use MongoDB

Forecast: Cloudy

Piloting the cloud

Why Orbital is all about the API

Servers, Servers Everywhere

Eating your own dog food: Building a repository with API-driven development

Hello? Is it me you’re looking for?

Orbital and the OAIS reference model

This is a proposal for a paper at the Open Repositories 2012 conference in July.

The JISC-funded Orbital project is building on earlier work at the University of Lincoln to develop a state-of-the-art research data management infrastructure, piloted with the first purpose-built School of Engineering in the UK in over 20 years.

Orbital (figure c) differs from traditional database applications in three significant ways:

  1. Orbital Core uses MongoDB, a document-oriented, schema-less, so-called ‘NoSQL’ database. MongoDB offers flexibility in that it is capable of accepting an object representing any kind of data (e.g. tabular data, survey results, images) without the need to develop a schema beforehand. MongoDB also includes useful features which can boost performance and resiliency, namely sharding – slicing data across multiple servers so a request may be processed by multiple servers in parallel – and replication — keeping multiple identical copies of data on different servers in case one of them fails. Orbital is also designed to be able to spread the ‘core’ – the application which does the heavy lifting – and the ‘manager’ – the front-end user interface – across multiple servers without causing stress. In our experience MongoDB, combined with the Sphinx search engine to perform full-text searching, is also extremely fast and allows us to develop simple, attractive APIs which we can expose to user applications.
  2. Orbital Core mediates access to the data via an open source OAuth 2 server we have developed and implemented at Lincoln.  The use of OAuth 2 allows access to the data from multiple authorised systems providing that the owner of the data has given permission, instantly opening the Orbital application to third-party extension. This method establishes the identity, authentication and authorisation of users, providing direct access to individual data sets or portions of data sets (e.g. specific rows/columns) through APIs on Orbital Core.
  3. The design and development of Orbital Core is API-driven, resulting in an application that offers 100% of its functionality through APIs, whether to our own Orbital Manager or a third-party application, each of which are treated equally by Orbital Core (figure c). As far as Orbital Core is concerned there is no functional difference between Orbital Manager (the front-end) and an application that a researcher has developed to meet a specific need; they are subject to the exact same access controls, restrictions, sanity checking and limitations. We have therefore eschewed some of the traditional approaches of building a database application, where access to the database is either provided via a stand-alone application (figure a) or via an API bolted on to the database (figure b). Orbital is also designed to be both stateless, i.e. all of the API functions are RESTful and thus represent a complete transaction with no requirement for session affinity, are not reliant on SQL features like transactions and joins, and have a reduced requirement for referential integrity.

Under this design, the API is the only way to interface with the data and functionality of the system. This API-driven approach offers several benefits:

  • Architecture is better: We are forced to think about data types and methods early on. Consistent behaviour across the application is easier to achieve.
  • Development is easier: Calling a well designed API is simple; error messages become cleanly captured by design; APIs encourage code reuse at both API and application end.
  • Updates become simpler:  We can run two or more API versions concurrently; tweak the API back-end and all front-end applications (‘official’ and 3rd party) benefit at once.
  • The APIs are better: The APIs must include everything we want our application to be able to do. Reliability of the API is now critical which encourages better design of resiliency and error handling; and usability of the API is essential which encourages better documentation.

The challenges of this approach are that every time we want to build user-facing functionality we have to assess our APIs and work out where the functionality belongs as well as ensuring that we have lightweight data transfer and reliable error handling designed into the application. We also have to double up on some areas of development, writing both the respective Core and Manager parts of the system.

Illustrations

Figure a: The only way to interact with this application is to either be a user, or pretend to be one (for example via screen-scraping).
Figure b: The most common form of API, consisting of a ‘second view’ on the data and functionality of an application. This style of API often exposes a limited subset of the application’s functionality.
Figure c: In an API-driven model the API is the only way to interface with the application.

With the Orbital project we’re looking at taking a leap into the brave new world (well, at least as far as university projects go) of cloud hosting. Now, I hate the word “cloud” when used to describe most services because people bander it about like it’s some magical world-fixing technology when in actual fact all they mean is “it’s on the internet”. We have “cloud” services which are fundamentally no different from doing the same thing on a server kept on a desk somewhere; but Orbital isn’t going to be like that.

I hope.

Instead, Orbital will be a true ‘cloud’ service in that it’s a resource which end users can tap into with no care at all for the underlying technologies. It’ll scale up and down with demand, extending both processing power and storage space as needed. Should one of our servers fail for any reason it won’t be met with a week of downtime whilst we rebuild things, but instead a seamless transition of work to one of the redundant, load balanced alternatives. If a process stops working then instead of the entire system crashing down it’ll adapt, queueing tasks until things are restored. Alongside this, the use of common standards is something that is essential to development. RESTful APIs follow well understood principles for interacting with data, and authentication using OAuth (the same authentication method used by Twitter, Facebook, Google and Microsoft) is core to how things behave.

Whilst the Orbital application itself is built to run in a cloudy manner using these loosely coupled methods and Rambo architecture, we’re also going to be hosting the thing in the cloud. This helps us with a few things including the aforementioned scalability, improved resiliency and the ability to properly analyse how the cloud works for higher education.

There’s also an unexpected benefit of this cloudy approach to Orbital: we gain the ability to pin a real-world cost on the storage of research data since we are quite literally being charged by the GB. At the moment researchers tend to treat storage as a one-off cost – for example buying a pile of hard disks – with less understanding of what it actually costs to keep them spinning. Since Orbital will know more about the intricacies of the stored data than the researchers we will be able (for the first time) to offer a number both in terms of how much it is costing to store data and also the estimated carbon impact.

Both these numbers are something that we want to be able to give researchers to help them understand that hanging on to research data has a cost, but also that it’s probably more efficient to hang on to it in a central, cloud-based platform. Of course, we also want to give people a clean exit strategy so we’re also going to be looking at ways of easily creating ‘hard’ copies for offline, non-cloud storage whilst still maintaining a virtual presence for the purposes of referencing and metadata.