The Power of Open Policy

One of the outcomes from the Orbital project that I’m part of is a set of new policies on the subject of research data management. Early on it was decided that this would – in the spirit of open research – be made available under an open licence along with the rest of our resources on the subject (such as training and support materials).

Being the technically minded folk that we are, we wanted to make sure that several of us could work on documentation at the same time without running the risk of overwiting each others changes. We also wanted a comprehensive versioning system to be in place from us putting the first words into the keyboard so that we could see every single change and who made it, something that we think is a big part of making a resource truly open. Finally, we also wanted a mechanism which could allow other people indirectly connected to the project to propose changes. Given our history of using similar systems to manage code there was an obvious choice – the Git source control system.

Git is a system which primarily relies on tracking line-by-line changes, meaning that when we wrote stuff we’d want to use a file format which behaved on a line-by-line basis. This made compiled binary formats such as Microsoft Word or even PDF a bit unsuitable, since a small change could result in a huge set of changes spanning hundreds (or more) of lines. We also wanted to use an open standard which didn’t have prohibitive licence restrictions and which was simple enough to be read and understood by anybody with a basic text editor. There are quite a few standards out there which meet this requirement, but again based on past experience we’re using Markdown for our RDM Policy.

Finally, inspired in no small way by the efforts of the Bundestag to convert their entire body of law to Git we wanted to store policy on a platform which not just allowed community involvement, but which positively encouraged it. GitHub is the world’s largest repository of open development, covering every language under the sun and projects ranging from hardcore low level programming through writing documentation through to communal story writing. Even better, they provide free hosting space for open projects. We already had a University of Lincoln user kicking around from past work, so it was a logical place to stick our Git repository. If you’re interested you can take a look at what we’ve got.

What’s interesting about using open text-based standards to write policy, Git for managing revisions and GitHub as a storage provider is that we’ve inadvertently made it very easy for people to do things that they couldn’t do before.

Where previously the process of creating policy was a bit mysterious, the entire world can now see not only the published version of the document – and by digging through the archives previously published versions – but every single change made in the history of the policy and who made it. Every change from the beginning of the document is trackable on a line-by-line basis, and using Git it’s even possible to see who is responsible for any single line of content. This gives us the immediate benefit of accountability, something which is great for finding out exactly who wrote a bit of a document if it needs clarification.

Another huge benefit of doing things this way is in the maintenance of different versions of a document. We’re not longer restricted by having one ‘published’ copy, but because of the ability of Git repositories to create new branches of documents we can do things like create a draft branch, or have a branch purely with spelling fixes. We can tag our ‘definitive’ versions which people should refer to whilst simultaneously working on the next edition or submitting changes to our ‘working’ version. To help illustrate quite how useful this is, I’ve copied our ICT Acceptable Use Policy to GitHub and reformatted it using an open standard (reStructuredText to be specific, more on this later). You can take a look at the tags which show distinct historical versions, or look at the different branches of the policy. The “draft” branch contains some changes I’ve proposed for the next version (visible for the world to see, but not yet in our “master” copy), and you can even see exactly what I’ve changed or do a line by line comparison of the latest “master” and “draft” copies.

Where Git and GitHub really come to the fore however is their encouragement of open collaboration. Although there is the official copy of the ICT policies, since it’s an open repository anybody in the world can make their own copy (such as this one I made earlier) and start making changes. The official repository remains untouched by this (no point in letting the whole world change it willy-nilly), but through an awesome feature of Git known as pull requests it’s possible for people to propose changes back to the official copy, and through the awesome of GitHub these can be discussed before being either accepted or rejected. Suddenly you’re not just sharing the policy openly, but actively allowing the entire world to suggest changes to it. To illustrate this I’ve made some changes in my own version which I’m proposing. Again, you can see each individual set of changes I’m proposing and make a line-by-line comparison. Pull requests can even be used internally within the same repository, which can be used to add mechanisms such as approval and signing off of a large set of changes (such as might be made when a document is published as a new version).

This all neatly leads on to the subject of publishing – when a document such as a policy is published it’s usually disseminated through a number of channels, most commonly print and digital. The University of Lincoln has a mixed history when it comes to publishing digitally, and more than once I’ve had to go through unnecessary amounts of trouble to read something simply because it’s been unnecessarily released as a Word 2010 document with a bunch of weird macros rather than a simple PDF. Fortunately, since we’re using standards such as Markdown or reStructuredText (much the same as Markdown, but with more power for more complex documents), publishing becomes really simple. Through freely available tools such as pandoc we can take our source document in our clean, open, text-only format and quickly get it formatted as HTML for presentation on the web, PDF for digital archiving and print, an eBook format for those who want to dig through it on their iPad and even as a Word document for those who feel an inexplicable need to read things in Word.

Hopefully the University will start to see more benefit in doing things this way – I’m going to be chatting to ICT to see if they want to consider using this method as their new definitive way of maintaining the AUP, and then hopefully I can bring it up the people in Registry and Secretariat who are responsible for the University Regulations. In the meantime, please feel free to go wild with our RDM repository.

Orbital project team meeting: notes

Here are the notes of the most recent Orbital project team meeting (31 January 2013).

Present: Nick Jackson, Harry Newton, Paul Stainthorp, Joss Winn.

The project team discussed the following development tasks. The aim is for the following to be completed by the end of February 2013:

  • Demonstratable AMS-CKANEPrints workflow in Orbital Bridge (a minimal but operational RDM infrastructure);
  • Researcher dashboard to include projects and project metadata;
  • Users able to display and create datasets in CKAN from within Orbital Bridge (N.B. need to check changes to CKAN APIs between versions);
  • Demonstrator using the DataCite test API (until a budget is agreed for use of the live DataCite service);
  • Ability to publish dataset metadata to EPrints Repository, with a complete ‘publish’ UI in Orbital Bridge (to be tested on the University’s upgraded EPrints 3.3 Repository in March) – questions over versioning/locking of deposited metadata to be resolved;
  • Researcher dashboard to include analytics fom EPrints, CKAN, AMS, and bibliometric/citation services – add links to external profiles (Scopus, WoS, ORCID, Google Scholar) in the first instance. ACTION: JW to contact Planning to discuss reporting from the researcher dashboard (also data.lincoln.ac.uk; bibiometrics).

JW presented the Orbital business case to the University Senior Management Team on 14th January 2013. JW to work with the Dean of Research (Lisa Mooney) / Deputy V-c (Ieuan Owen) to discuss ongoing resourcing for RDM.

ICT are undertaking a cloud major scoping study, including RDM storage requirements.

The draft RDM policy is to be presented to the Research & Enterprise committee in April.

NJ, HN and PS are working on the display of RDM training and documentation in Orbital Bridge, with versioned text stored as Markdown in Github. Pages in Orbital can be linked to Github.

The next RDM training for postgraduate students will take place on 6th March 2013. ACTION: PS to embed a calendar feed of training events on the Orbital website.

Upcoming events:

CKAN trending

Last summer, we adopted CKAN as our data store/repository/catalogue. At that time, I noted that much had happened in the CKAN project in the few months since the start of the Orbital project in November 2011 that made CKAN a more attractive proposition for managing research data.

Recently, someone on the CKAN mailing list pointed to the graph below, which shows that the interest in CKAN has exploded. In November 2011, interest in CKAN was at just a quarter of its current peak, which is double that of September 2012, when we made the switch to CKAN. Following the European Commission and the UK government, the recent decision by the US government to adopt CKAN for the next version of data.gov will only drive interest in and the development of CKAN even further.

It is an exciting time to be observing and part of this explosion of interest. However, it is worth remembering that the interest in CKAN and data management is still very small compared to interest in other, more generic, content management systems. Publishing structured open data remains a niche interest compared to other open practices on the web, such as blogging. Here’s the graph comparing CKAN to WordPress.

Perhaps a fairer comparison would be that of CKAN with open access repository software, such as ePrints and DSpace.

Of course, the cumulative interest of DSpace and of ePrints over the years is greater than that of CKAN, but right now, there is clearly more interest in CKAN and publishing open data, than there is in open access repository software. The open access movement has matured, while the open data movement is growing rapidly. It will be interesting to follow these trends to measure (in part) the maturity of the open data movement, too.

CKAN and ePrints APIs

For each application that Orbital interfaces with, be it CKAN, ePrints or anything else, it is abstracted through a ‘bridge_application’ library. Orbital is built predominately in PHP. Using CKAN as an example, we have a Ckan.php file in the folder ‘bridge_applications’ containing all the functions needed to interface with CKAN. If one of the functions it contains is needed, it is called on the page where the result of the function is used.

If a dataset is read, it can be stored as a variable, as the function returns an object. It can be output to the page in Orbital to show what the dataset contains, or saved to a variable to used with another function.

Example:

$this->load->library(‘../bridge_applications/ckan’);
$datasets = $this->ckan->read_datasets();

$datasets is set to the result of the ckan function. What it is set to depends on the datasets in CKAN. In this example, it returns:

array(1) {
  [0]=>
  object(Dataset_Object)#362 (6) {
    ["_title":protected]=>
    string(11) "********"
    ["_uri_slug":protected]=>
    string(38) "********"
    ["_creators":protected]=>
    array(1) {
      [0]=>
      string(17) "********"
    }
    ["_subjects":protected]=>
    array(0) {
    }
    ["_date":protected]=>
    int(1358507313)
    ["_keywords":protected]=>
    array(3) {
      [0]=>
      object(stdClass)#95 (6) {
        ["vocabulary_id"]=>
        NULL
        ["display_name"]=>
        string(12) "********"
        ["name"]=>
        string(12) "********"
        ["revision_timestamp"]=>
        string(26) "2013-01-18T11:16:59.137985"
        ["state"]=>
        string(6) "active"
        ["id"]=>
        string(36) "********"
      }
    }
  }
}

*Some results are starred out.

As this example only includes one dataset, the result is an array with the dataset as its only entrant.

This is converted to the standard format used in Orbital. This standard format is used so that every application Orbital links to has a standard input for data to be sent to. so any application can theoretically talk to any other application through Orbital.

The SWORD library, used for SWORD endpoint data entry into ePrints, takes this standard format as input and formats it to the appropriate format before sending it to the ePrints endpoint. The theory here is the same as before; it is a php library for a bridge application. It takes the data and uses the endpoint to create a record via SWORD.

Example:

$this->sword->create_sword($dataset);

The dataset taken from CKAN is fed into the SWORD library and sent to ePrints to create a new ePrint from the dataset. This is done by using simpleXML to build an XML SWORD compliant object that can be sent via a http curl request to the ePrints SWORD endpoint. The result of this is a new entry in ePrints, via SWORD, from the data retrieved from CKAN.

The code is hosted on GitHub and can be found here:

https://github.com/lncd/Orbital-Bridge/tree/develop/src/application/bridge_applications

The thigh bone’s connected to the hip bone…

One thing that has been high up the list of considerations during the development of Orbital has been how it integrates with the rest of the University. Whilst a lot of new services tend to exist in relative isolation, making use of scheduled batch imports to keep themselves in step with things like staff lists, Bridge is designed to tie in to the very core of the University’s data platform. The benefits are numerous enough to make it worth the additional development and network overhead, since we’re able to provide a truly continuous experience between previously disparate systems.

Orbital revolves around research projects as its basic unit of data, something which we already had the capacity to store within our Nucleus data model as part of work on the Staff Directory. Whilst there is an awful lot more that Orbital wants to know about research than is relevant to the Directory it made no sense to create yet another list of research projects and introduce a second place to keep things updated. Instead we extended Nucleus’s understanding of a research project to include the new aspects such as linking multiple researchers to a single project, a more complex model for funding and so-on. What this now means is that both Orbital and the Directory share the same data, and when a staff member adds a new project to their research dashboard in Bridge it will appear seamlessly on their Staff Profile.

Since we’re using Nucleus to provide more and more data, as well as sending data in both directions, we took the opportunity to start building a more robust solution for the sending and receiving. What we came up with was a PHP library built on the Guzzle HTTP client framework. Although this is very early in development (your contributions to the code are welcome) it gives us a controllable, standardised platform which we can use to both request data from Nucleus and send data back, taking care of issues such as formatting and encoding. Even better, since the library is ready to go with Composer, we (and anybody else interacting with Nucleus over PHP) can include it in their project with a single line in their configuration.

This brings us back full circle to the Staff Directory, which as of the next version will be making use of this library to communicate with Nucleus. As new solutions are put together which rely on the Nucleus platform this library will be extended further until its our standard way of getting and updating data, adding a layer of abstraction where we no longer care how the data arrives at the application or how it is makes its way back to Nucleus.

The upshot of all this interconnectivity? We can build a brand new application off the back of our research project data very quickly, changing what would have taken weeks or even months into a matter of days.