The thigh bone’s connected to the hip bone…

One thing that has been high up the list of considerations during the development of Orbital has been how it integrates with the rest of the University. Whilst a lot of new services tend to exist in relative isolation, making use of scheduled batch imports to keep themselves in step with things like staff lists, Bridge is designed to tie in to the very core of the University’s data platform. The benefits are numerous enough to make it worth the additional development and network overhead, since we’re able to provide a truly continuous experience between previously disparate systems.

Orbital revolves around research projects as its basic unit of data, something which we already had the capacity to store within our Nucleus data model as part of work on the Staff Directory. Whilst there is an awful lot more that Orbital wants to know about research than is relevant to the Directory it made no sense to create yet another list of research projects and introduce a second place to keep things updated. Instead we extended Nucleus’s understanding of a research project to include the new aspects such as linking multiple researchers to a single project, a more complex model for funding and so-on. What this now means is that both Orbital and the Directory share the same data, and when a staff member adds a new project to their research dashboard in Bridge it will appear seamlessly on their Staff Profile.

Since we’re using Nucleus to provide more and more data, as well as sending data in both directions, we took the opportunity to start building a more robust solution for the sending and receiving. What we came up with was a PHP library built on the Guzzle HTTP client framework. Although this is very early in development (your contributions to the code are welcome) it gives us a controllable, standardised platform which we can use to both request data from Nucleus and send data back, taking care of issues such as formatting and encoding. Even better, since the library is ready to go with Composer, we (and anybody else interacting with Nucleus over PHP) can include it in their project with a single line in their configuration.

This brings us back full circle to the Staff Directory, which as of the next version will be making use of this library to communicate with Nucleus. As new solutions are put together which rely on the Nucleus platform this library will be extended further until its our standard way of getting and updating data, adding a layer of abstraction where we no longer care how the data arrives at the application or how it is makes its way back to Nucleus.

The upshot of all this interconnectivity? We can build a brand new application off the back of our research project data very quickly, changing what would have taken weeks or even months into a matter of days.

Open Resources and Open Standards

The Orbital project is about a lot more than just developing a cool bit of software. In fact, the majority of the project impact is to do with policy and training rather than development. However, we think there are some good practices in software development which apply equally to the development of documentation around policy and training. Specifically, revision control.

Throughout the day as we make changes to the source code which makes up Orbital Bridge we record significant states in the development against our revision control software (specifically Git). We can then rewind the state of the entire codebase to any one of these conditions, compare differences between the two, and even pick and choose specific changes to move between states on a line-by-line basis. We can create diverging versions to test new features in isolation and merge them together again with no fear of messing up the working version.

Given that we’re planning to release all of our RDM policy and documentation under an Open licence (specifically CC-BY) it made a lot of sense to use a platform for revision control which makes the most of the community and both allows and encourages people to view our stuff, take it, make changes and even propose changes back to us. Enter GitHub, the most popular source code sharing site in the world. GitHub provides us with a ready to go Git hosting platform, as well as a load of really easy to use tools to help us and other people make the most of our resources.

At the University of Lincoln we already use GitHub for Open Source software projects from both the Online Services Team and the LNCD development group, so it made sense to use it for our RDM documentation as well. The definitive copy of our RDM policy and training materials can now be viewed in the state it was at any given point in time, branched, merged and so-on — but there’s a problem with making documents the Old Fashioned Way that people in the University may be used to. Namely, using Microsoft Word to store a document will cause all kinds of problems for revision management in that Word doesn’t just keep the text, but a whole load of other stuff which is then compressed down into a single binary blob. Using Word would mean that although technically the main features of revision control (versions, branching etc) would work we’d lose some of the more elegant solutions to problems such as line-by-line comparisons of versions and merging of different branches.

A better solution was needed for writing documents, and we ended up with a shortlist of three potential plain-text markup standards. These are ways of marking up a plain text document (such as you’d write in Notepad) with semantic structure and styling so that we can take the document and re-render it in a number of different places. Our three contenders were LaTeX, Markdown and reStructuredText. All three have pros and cons, but have the same basic idea behind the scenes – plain text is surrounded with bits of other plain text that give it meaning. All three result in a document that is fundamentally human readable without the need for any proprietary software, and all three allow for the document to be re-rendered in a form appropriate for the audience.

LaTex is by far the most powerful of the three, having a background in typesetting complex scientific academic papers. It would allow for policy documents to be rendered for both the web and print, but has the downside of being the most complex to use and having a less user-friendly syntax. We want the policy to be as accessible as possible, without needing to understand what a set of tags means.

Markdown and reStructuredText both take a much simpler approach, and use almost identical syntax for most things. However, reStructuredText has a bundle of other markup which mades it better suited to long, structured documents with nested lists. reStructuredText would be ideal if we ever decided to convert the University’s Regulations to a plain text format, but for a simple document such as the RDM policy doesn’t really have any advantage over Markdown.

The tipping point for our decision then lay in the technical implementation of Markdown over reStructuredText. Fortunately this was an easy call, as reStructuredText is very tightly linked into the Python ecosystem whereas Bridge is built entirely in PHP. We could easily drop a PHP library to do Markdown rendering into Bridge, whereas reStructuredText would need additional work to call an external Python library to do the best job of rendering. Should we decide in the future that we need the extra capability of reStructuredText then the migration as far as the document is concerned is virtually non-existent.

You can view our current draft RDM policy in Markdown in our RDM repository on GitHub, as well as fork it and submit pull requests if you want to use it as a basis for your own or propose changes. We will be moving all our training presentations to use a Markdown based in-browser format in the near future.

Orbital deposit of dataset records to the Lincoln Repository: workflow

Further to yesterday’s blog post about linking our CKAN datastore with our EPrints Repository (to allow researchers to deposit permanent, public, citable records of their datasets), here’s a fleshed-out diagram of the proposed dataset deposit workflow process.

At the moment, this assumes a one-time “fire and forget” deposit. At some point, we’re going to have to tackle versioning.

The original diagram is available on Lucidchart. See the table in my previous blog post for details of which data fields are involved in the process (i.e. passed between CKAN, Orbital Bridge, the DataCite API, and EPrints).

This is a proposal and still has to be road-tested. Comments welcome.

Diagram of the dataset deposit process

Stages in the proposed deposit process:

  1. User enters project metadata in AMS
  2. AMS creates project container in CKAN
  3. User creates dataset record in CKAN
  4. Nucleus adds user metadata to CKAN
  5. User deposits data in CKAN
  6. User presses “DEPOSIT DATASET” button in CKAN
  7. Orbital Bridge requests DOI
  8. DataCite API returns DOI
  9. Orbital Bridge adds DOI to dataset record in CKAN
  10. User reviews and approves dataset metadata (making changes if necessary)
  11. Orbital Bridge writes changes back to dataset record in CKAN
  12. Orbital Bridge creates a new EPrints record via SWORD
  13. EPrints confirms existence of new record
  14. Orbital Bridge writes EPrints record URL back to CKAN dataset record

Orbital: AMS–CKAN–EPrints–DataCite

One important piece of work that we’re undertaking at the moment in Orbital is the facility to deposit the existence of a dataset, from CKAN and the University’s new Awards Management System (AMS), into our (EPrints) Repository via SWORD – at the same time requesting a DOI for the dataset via the DataCite API. The software at the centre of this operation is what we refer to as Orbital Bridge.

Here’s a diagram of how the various systems will need to link together.

Diagram of data flow between systems

The table below shows how fields may be mapped between systems. DataCite properties are taken from the DataCite MetaData Schema (v2.2). This is very much a work in progress! In particular, the red question marks (?) in the “CKAN field” column indicate fields that may not yet exist in the source system (CKAN). It’s in no particular order yet.

The following DataCite properties are optional, and we don’t intend to use them at the moment.

  • 3.1 – TitleType
  • 9 – Language
  • 12 – RelatedIdentifier
  • 12.1 – relatedIdentifierType
  • 12.2 – relationType
  • 13 – Size
  • 15 – Version

 

The Development Goes On…

It’s been a while since I gave you an update on the technical side of Orbital, so here’s a lightning-fast overview of what’s going on.

CKAN

We’re still working on fine-tuning CKAN for our needs. Although we’ve made advances in the fields of theming, datastore, HTTPS and a few other tweaks we’re still plagued by mixed HTTP/HTTPS resources, plugins which are difficult to install, broken sign-in using our OAuth 2 SSO service, a broken search and a complete unwillingness of the Recline preview to work. I suspect a lot of this is down to unfamiliarity with the codebase and with Python in general, although some areas of CKAN do feel like they’re a collection of hacks built on top of some more hacks built on a framework which is built on another framework which is built on a collection of libraries which is built on a hack.

In short, CKAN is still in need of a lot of work before our deployment can be considered production ready (hence the “beta” tag). That said, we are already using it to store some research data and the aspects which we’ve managed to get working are working well. We’re going easy though, because CKAN 1.8 and 2.0 are apparently due to land in the next couple of months.

Orbital Bridge

Our awesomely named Orbital Bridge will serve as the central point for all RDM activity around a project, as well as helping people through the process of general project management by being a springboard to our existing policy and training documentation.

Currently Bridge’s public-facing side is in a very basic state, with only static content, but is serving as a test of our deployment toolchain. However, behind the scenes Harry has been working on ways of shuffling data around between systems using abstraction layers for aspects such as datasets, files, people and projects. Today we sat down with Paul and went through some aspects of minimal metadata which are required to construct things to an acceptable standard, which will lead to additional work both on CKAN and our existing ePrints repository to smooth the transfer of things between them.

AMS

The University’s new Awards Management System is designed to help researchers plan their funded research, walking them through the process of building their bid. The system itself has begun its roll-out across the University, and as soon as we’re given access to the APIs we’ll be integrating the AMS with Orbital Bridge, allowing seamless creation of a research project based on the data in the AMS.

This work also helps to inform stuff we’re doing in Bridge around abstracting the notion of a ‘project’ between all our different systems.

Kumo

Our ongoing OpenStack project, which we will use as the bed to provide the technical infrastructure, is slowly moving closer to a state which we can begin to develop on. Tied in with this effort is our continued work on automating our provisioning, configuring, deployment, maintenance, monitoring and scaling.