Open Data Protocols

For those of you playing on the technical side of research data, did you know that Open Data Protocols has done a load of work on standardised ways of storing data and metadata? The standards at Open Data Protocols (and its sister specification on Open Catalogs) are the ones that our future work on long-term preservation of data packages will be informed by (and are involved quite a bit in future work on CKAN). If you’re bundling data up at the end of a project, why not take a look at them?

The Importance of Useful Data

During the development of Orbital (specifically the Researcher Dashboard) we’ve been trying (with mixed success) to make it integrate smoothly with various other University systems. Fortunately, a design decision made by some of the LNCD team a couple of years ago means that we’ve got our own institutional data store (codename Nucleus) with which we can almost exclusively interact to get hold of everything we needed. Where we’ve been integrating with new systems such as the University’s Awards Management System we’ve taken the approach of hooking the data into Nucleus first, so that it’s not only available to Researcher Dashboard but also to any other system which needs it.

Nucleus has quite a powerful framework for managing, structuring and presenting data in a rigorously managed format. It validates things at various points during data entry to make sure that it’s not gibberish, and then at the point of rendering it’s put through another set of functions which ensure it’s presented consistently and in as useful a manner as possible. As a result (using Nucleus, our PHP library, the CWD and our OAuth 2 authorisation server) we can go from a standing start to a fully featured, integrated application in a couple of days. A big part of the reason we can do this is that we make extensive use of dogfooding to ensure that our data is useful.

It saddens me, therefore, that during integration with some other applications both inside and outside the University we are forced to tackle data – often purported to be “machine readable” or “ready for reuse” which has clearly not been looked at by the eye of somebody who wants to reuse it. As an example, one source of data provides a date range which is stored internally (as far as I can gather) as two distinct values; there is a “start date” and there is an “end date”. These are provided through the UI as structured inputs (a date picker) which ensures they’re entered (and presumably then stored) in an expected format which can be manipulated as necessary. The API then chooses to express this date range not as a distinct “start date” and “end date”, but instead as a single “dates”.

You may think that this isn’t such a big problem – after all, how difficult can it be to parse 04/02/2013 - 07/03/2014? In that example it’s actually pretty easy once you’ve decided if you’re using UK or US style dates. The ISO date format can solve this though, giving us 2013-02-04 - 2014-03-07. Sadly, this isn’t what we get. In fact, here are the four (yes, four) distinct ways that “dates” can be represented:

  • 2013-02-04 to 2013-02-04 becomes 4 Feb 2013
  • 2013-02-04 to 2014-03-07 becomes 4 Feb 2013 - 7 Mar 2014
  • 2013-02-04 to 2013-03-07 becomes 4 Feb - 7 Mar 2013
  • 2013-02-04 to 2013-02-07 becomes 4 Feb 2013 - 7 Feb 2013

So, the rule becomes that if the dates are the same you just show the single date, but if the dates are different then you show two dates, unless they are in the same year in which case you only show the year in the final date, unless they are in the same year and the same month, in which case you show two dates. And then you format all the dates with a locale-specific short form of the month name.

Parsing this is understandably more difficult than it should be. Please, think about how your data will actually be used when building outputs.

The Power of Open Policy

One of the outcomes from the Orbital project that I’m part of is a set of new policies on the subject of research data management. Early on it was decided that this would – in the spirit of open research – be made available under an open licence along with the rest of our resources on the subject (such as training and support materials).

Being the technically minded folk that we are, we wanted to make sure that several of us could work on documentation at the same time without running the risk of overwiting each others changes. We also wanted a comprehensive versioning system to be in place from us putting the first words into the keyboard so that we could see every single change and who made it, something that we think is a big part of making a resource truly open. Finally, we also wanted a mechanism which could allow other people indirectly connected to the project to propose changes. Given our history of using similar systems to manage code there was an obvious choice – the Git source control system.

Git is a system which primarily relies on tracking line-by-line changes, meaning that when we wrote stuff we’d want to use a file format which behaved on a line-by-line basis. This made compiled binary formats such as Microsoft Word or even PDF a bit unsuitable, since a small change could result in a huge set of changes spanning hundreds (or more) of lines. We also wanted to use an open standard which didn’t have prohibitive licence restrictions and which was simple enough to be read and understood by anybody with a basic text editor. There are quite a few standards out there which meet this requirement, but again based on past experience we’re using Markdown for our RDM Policy.

Finally, inspired in no small way by the efforts of the Bundestag to convert their entire body of law to Git we wanted to store policy on a platform which not just allowed community involvement, but which positively encouraged it. GitHub is the world’s largest repository of open development, covering every language under the sun and projects ranging from hardcore low level programming through writing documentation through to communal story writing. Even better, they provide free hosting space for open projects. We already had a University of Lincoln user kicking around from past work, so it was a logical place to stick our Git repository. If you’re interested you can take a look at what we’ve got.

What’s interesting about using open text-based standards to write policy, Git for managing revisions and GitHub as a storage provider is that we’ve inadvertently made it very easy for people to do things that they couldn’t do before.

Where previously the process of creating policy was a bit mysterious, the entire world can now see not only the published version of the document – and by digging through the archives previously published versions – but every single change made in the history of the policy and who made it. Every change from the beginning of the document is trackable on a line-by-line basis, and using Git it’s even possible to see who is responsible for any single line of content. This gives us the immediate benefit of accountability, something which is great for finding out exactly who wrote a bit of a document if it needs clarification.

Another huge benefit of doing things this way is in the maintenance of different versions of a document. We’re not longer restricted by having one ‘published’ copy, but because of the ability of Git repositories to create new branches of documents we can do things like create a draft branch, or have a branch purely with spelling fixes. We can tag our ‘definitive’ versions which people should refer to whilst simultaneously working on the next edition or submitting changes to our ‘working’ version. To help illustrate quite how useful this is, I’ve copied our ICT Acceptable Use Policy to GitHub and reformatted it using an open standard (reStructuredText to be specific, more on this later). You can take a look at the tags which show distinct historical versions, or look at the different branches of the policy. The “draft” branch contains some changes I’ve proposed for the next version (visible for the world to see, but not yet in our “master” copy), and you can even see exactly what I’ve changed or do a line by line comparison of the latest “master” and “draft” copies.

Where Git and GitHub really come to the fore however is their encouragement of open collaboration. Although there is the official copy of the ICT policies, since it’s an open repository anybody in the world can make their own copy (such as this one I made earlier) and start making changes. The official repository remains untouched by this (no point in letting the whole world change it willy-nilly), but through an awesome feature of Git known as pull requests it’s possible for people to propose changes back to the official copy, and through the awesome of GitHub these can be discussed before being either accepted or rejected. Suddenly you’re not just sharing the policy openly, but actively allowing the entire world to suggest changes to it. To illustrate this I’ve made some changes in my own version which I’m proposing. Again, you can see each individual set of changes I’m proposing and make a line-by-line comparison. Pull requests can even be used internally within the same repository, which can be used to add mechanisms such as approval and signing off of a large set of changes (such as might be made when a document is published as a new version).

This all neatly leads on to the subject of publishing – when a document such as a policy is published it’s usually disseminated through a number of channels, most commonly print and digital. The University of Lincoln has a mixed history when it comes to publishing digitally, and more than once I’ve had to go through unnecessary amounts of trouble to read something simply because it’s been unnecessarily released as a Word 2010 document with a bunch of weird macros rather than a simple PDF. Fortunately, since we’re using standards such as Markdown or reStructuredText (much the same as Markdown, but with more power for more complex documents), publishing becomes really simple. Through freely available tools such as pandoc we can take our source document in our clean, open, text-only format and quickly get it formatted as HTML for presentation on the web, PDF for digital archiving and print, an eBook format for those who want to dig through it on their iPad and even as a Word document for those who feel an inexplicable need to read things in Word.

Hopefully the University will start to see more benefit in doing things this way – I’m going to be chatting to ICT to see if they want to consider using this method as their new definitive way of maintaining the AUP, and then hopefully I can bring it up the people in Registry and Secretariat who are responsible for the University Regulations. In the meantime, please feel free to go wild with our RDM repository.

The thigh bone’s connected to the hip bone…

One thing that has been high up the list of considerations during the development of Orbital has been how it integrates with the rest of the University. Whilst a lot of new services tend to exist in relative isolation, making use of scheduled batch imports to keep themselves in step with things like staff lists, Bridge is designed to tie in to the very core of the University’s data platform. The benefits are numerous enough to make it worth the additional development and network overhead, since we’re able to provide a truly continuous experience between previously disparate systems.

Orbital revolves around research projects as its basic unit of data, something which we already had the capacity to store within our Nucleus data model as part of work on the Staff Directory. Whilst there is an awful lot more that Orbital wants to know about research than is relevant to the Directory it made no sense to create yet another list of research projects and introduce a second place to keep things updated. Instead we extended Nucleus’s understanding of a research project to include the new aspects such as linking multiple researchers to a single project, a more complex model for funding and so-on. What this now means is that both Orbital and the Directory share the same data, and when a staff member adds a new project to their research dashboard in Bridge it will appear seamlessly on their Staff Profile.

Since we’re using Nucleus to provide more and more data, as well as sending data in both directions, we took the opportunity to start building a more robust solution for the sending and receiving. What we came up with was a PHP library built on the Guzzle HTTP client framework. Although this is very early in development (your contributions to the code are welcome) it gives us a controllable, standardised platform which we can use to both request data from Nucleus and send data back, taking care of issues such as formatting and encoding. Even better, since the library is ready to go with Composer, we (and anybody else interacting with Nucleus over PHP) can include it in their project with a single line in their configuration.

This brings us back full circle to the Staff Directory, which as of the next version will be making use of this library to communicate with Nucleus. As new solutions are put together which rely on the Nucleus platform this library will be extended further until its our standard way of getting and updating data, adding a layer of abstraction where we no longer care how the data arrives at the application or how it is makes its way back to Nucleus.

The upshot of all this interconnectivity? We can build a brand new application off the back of our research project data very quickly, changing what would have taken weeks or even months into a matter of days.

Open Resources and Open Standards

The Orbital project is about a lot more than just developing a cool bit of software. In fact, the majority of the project impact is to do with policy and training rather than development. However, we think there are some good practices in software development which apply equally to the development of documentation around policy and training. Specifically, revision control.

Throughout the day as we make changes to the source code which makes up Orbital Bridge we record significant states in the development against our revision control software (specifically Git). We can then rewind the state of the entire codebase to any one of these conditions, compare differences between the two, and even pick and choose specific changes to move between states on a line-by-line basis. We can create diverging versions to test new features in isolation and merge them together again with no fear of messing up the working version.

Given that we’re planning to release all of our RDM policy and documentation under an Open licence (specifically CC-BY) it made a lot of sense to use a platform for revision control which makes the most of the community and both allows and encourages people to view our stuff, take it, make changes and even propose changes back to us. Enter GitHub, the most popular source code sharing site in the world. GitHub provides us with a ready to go Git hosting platform, as well as a load of really easy to use tools to help us and other people make the most of our resources.

At the University of Lincoln we already use GitHub for Open Source software projects from both the Online Services Team and the LNCD development group, so it made sense to use it for our RDM documentation as well. The definitive copy of our RDM policy and training materials can now be viewed in the state it was at any given point in time, branched, merged and so-on — but there’s a problem with making documents the Old Fashioned Way that people in the University may be used to. Namely, using Microsoft Word to store a document will cause all kinds of problems for revision management in that Word doesn’t just keep the text, but a whole load of other stuff which is then compressed down into a single binary blob. Using Word would mean that although technically the main features of revision control (versions, branching etc) would work we’d lose some of the more elegant solutions to problems such as line-by-line comparisons of versions and merging of different branches.

A better solution was needed for writing documents, and we ended up with a shortlist of three potential plain-text markup standards. These are ways of marking up a plain text document (such as you’d write in Notepad) with semantic structure and styling so that we can take the document and re-render it in a number of different places. Our three contenders were LaTeX, Markdown and reStructuredText. All three have pros and cons, but have the same basic idea behind the scenes – plain text is surrounded with bits of other plain text that give it meaning. All three result in a document that is fundamentally human readable without the need for any proprietary software, and all three allow for the document to be re-rendered in a form appropriate for the audience.

LaTex is by far the most powerful of the three, having a background in typesetting complex scientific academic papers. It would allow for policy documents to be rendered for both the web and print, but has the downside of being the most complex to use and having a less user-friendly syntax. We want the policy to be as accessible as possible, without needing to understand what a set of tags means.

Markdown and reStructuredText both take a much simpler approach, and use almost identical syntax for most things. However, reStructuredText has a bundle of other markup which mades it better suited to long, structured documents with nested lists. reStructuredText would be ideal if we ever decided to convert the University’s Regulations to a plain text format, but for a simple document such as the RDM policy doesn’t really have any advantage over Markdown.

The tipping point for our decision then lay in the technical implementation of Markdown over reStructuredText. Fortunately this was an easy call, as reStructuredText is very tightly linked into the Python ecosystem whereas Bridge is built entirely in PHP. We could easily drop a PHP library to do Markdown rendering into Bridge, whereas reStructuredText would need additional work to call an external Python library to do the best job of rendering. Should we decide in the future that we need the extra capability of reStructuredText then the migration as far as the document is concerned is virtually non-existent.

You can view our current draft RDM policy in Markdown in our RDM repository on GitHub, as well as fork it and submit pull requests if you want to use it as a basis for your own or propose changes. We will be moving all our training presentations to use a Markdown based in-browser format in the near future.