Open Data and the Academy: An Evaluation of CKAN for Research Data Management

In August 2012, the Orbital project made the decision to adopt CKAN as part of a technical infrastructure for RDM. In February, due to interest from other universities, we held a workshop in London that looked more closely at ‘CKAN4RDM’. At the end of May, I presented a paper on the use of CKAN for RDM at two conferences: OpenAIRE/LIBER (Ghent, Belgium) and IASSIST (Cologne, Germany).

The paper has been available for comment on Google Docs for over two weeks and viewed by over 200 people. I specifically asked the CKAN-DEV, CKAN-DISCUSS and CKAN4RDM mailing lists, as well as conference attendees to provide feedback on the paper. The conference sessions were very well attended (I guess 400 people must have heard me present the paper across both occasions) and from conversations afterwards, the paper seems to have generated a decent amount of interest as well as raise the awareness of CKAN among data librarians and data archivists.

The paper can now be downloaded in its final form from the Lincoln Repository and my conference presentation slides are below. I hope that my recommendations lead to the further use and development of CKAN for the management of research data. Do join the CKAN4RDM mailing list if you’d like to discuss this further.

http://vimeo.com/68118601

Open Resources and Open Standards

The Orbital project is about a lot more than just developing a cool bit of software. In fact, the majority of the project impact is to do with policy and training rather than development. However, we think there are some good practices in software development which apply equally to the development of documentation around policy and training. Specifically, revision control.

Throughout the day as we make changes to the source code which makes up Orbital Bridge we record significant states in the development against our revision control software (specifically Git). We can then rewind the state of the entire codebase to any one of these conditions, compare differences between the two, and even pick and choose specific changes to move between states on a line-by-line basis. We can create diverging versions to test new features in isolation and merge them together again with no fear of messing up the working version.

Given that we’re planning to release all of our RDM policy and documentation under an Open licence (specifically CC-BY) it made a lot of sense to use a platform for revision control which makes the most of the community and both allows and encourages people to view our stuff, take it, make changes and even propose changes back to us. Enter GitHub, the most popular source code sharing site in the world. GitHub provides us with a ready to go Git hosting platform, as well as a load of really easy to use tools to help us and other people make the most of our resources.

At the University of Lincoln we already use GitHub for Open Source software projects from both the Online Services Team and the LNCD development group, so it made sense to use it for our RDM documentation as well. The definitive copy of our RDM policy and training materials can now be viewed in the state it was at any given point in time, branched, merged and so-on — but there’s a problem with making documents the Old Fashioned Way that people in the University may be used to. Namely, using Microsoft Word to store a document will cause all kinds of problems for revision management in that Word doesn’t just keep the text, but a whole load of other stuff which is then compressed down into a single binary blob. Using Word would mean that although technically the main features of revision control (versions, branching etc) would work we’d lose some of the more elegant solutions to problems such as line-by-line comparisons of versions and merging of different branches.

A better solution was needed for writing documents, and we ended up with a shortlist of three potential plain-text markup standards. These are ways of marking up a plain text document (such as you’d write in Notepad) with semantic structure and styling so that we can take the document and re-render it in a number of different places. Our three contenders were LaTeX, Markdown and reStructuredText. All three have pros and cons, but have the same basic idea behind the scenes – plain text is surrounded with bits of other plain text that give it meaning. All three result in a document that is fundamentally human readable without the need for any proprietary software, and all three allow for the document to be re-rendered in a form appropriate for the audience.

LaTex is by far the most powerful of the three, having a background in typesetting complex scientific academic papers. It would allow for policy documents to be rendered for both the web and print, but has the downside of being the most complex to use and having a less user-friendly syntax. We want the policy to be as accessible as possible, without needing to understand what a set of tags means.

Markdown and reStructuredText both take a much simpler approach, and use almost identical syntax for most things. However, reStructuredText has a bundle of other markup which mades it better suited to long, structured documents with nested lists. reStructuredText would be ideal if we ever decided to convert the University’s Regulations to a plain text format, but for a simple document such as the RDM policy doesn’t really have any advantage over Markdown.

The tipping point for our decision then lay in the technical implementation of Markdown over reStructuredText. Fortunately this was an easy call, as reStructuredText is very tightly linked into the Python ecosystem whereas Bridge is built entirely in PHP. We could easily drop a PHP library to do Markdown rendering into Bridge, whereas reStructuredText would need additional work to call an external Python library to do the best job of rendering. Should we decide in the future that we need the extra capability of reStructuredText then the migration as far as the document is concerned is virtually non-existent.

You can view our current draft RDM policy in Markdown in our RDM repository on GitHub, as well as fork it and submit pull requests if you want to use it as a basis for your own or propose changes. We will be moving all our training presentations to use a Markdown based in-browser format in the near future.

Presentations from the JISC MRD Programme Progress Meeting

Below are two short presentations I gave at the JISC programme meeting today. Both concern different aspects and advantages of using CKAN to manage research data. They simply link through to blog posts that have been written here which offer more detailed information. During the presentations, I gave demonstrations of using CKAN in practice.

Eating Your Own Dog Food: Building a repository with API-driven development

We’re in Edinburgh, at Open Repositories 2012, and will be presenting our paper at 9am tomorrow morning (yes, that’s right, the morning after the conference dinner!). Here’s the paper we’ll be discussing.

As part of its project to develop a new research data management system the University of Lincoln is embracing development practices built around APIs – interfaces to the underlying data and functions of the system which are explicitly designed to make life easy for developers by being machine readable and programmatically accessible.

http://eprints.lincoln.ac.uk/5962/

Eating Your Own Dog Food

View more presentations from Nick Jackson