Research Data Management Policy approved

I am very pleased to announce that our Research Data Management policy, which was one of the main objectives of the Orbital project, has been approved by the university’s Research Committee. The process of drafting the policy began in April 2012 as a collaborative effort between Orbital team members from the Centre for Educational Research and Development, The Library and the Research and Enterprise Office. Comments were then solicited from the Director of ICT, the Director of Research and Enterprise and the University Librarian. The draft was then presented to the Research Committee, which requested that the policy be discussed with the Senior Management Team due to its resourcing implications. This meeting took place in October 2012 and as a result, SMT requested that the College Research Directors were consulted on the Policy and agreed that a Business Case for Research Data Management (to effectively ‘underwrite’ the policy) should be put together. The Business Case was presented to SMT and accepted in January 2013. Following discussion with the Research Directors and further re-drafting, the policy was approved today.

This completes the formal objectives of the Orbital project and places us in a position where we have a Business Case, Policy and a new ‘Research Information Services’ team that is being formed to meet the expectations and aspirations of our researcher community and our funders.

The Power of Open Policy

One of the outcomes from the Orbital project that I’m part of is a set of new policies on the subject of research data management. Early on it was decided that this would – in the spirit of open research – be made available under an open licence along with the rest of our resources on the subject (such as training and support materials).

Being the technically minded folk that we are, we wanted to make sure that several of us could work on documentation at the same time without running the risk of overwiting each others changes. We also wanted a comprehensive versioning system to be in place from us putting the first words into the keyboard so that we could see every single change and who made it, something that we think is a big part of making a resource truly open. Finally, we also wanted a mechanism which could allow other people indirectly connected to the project to propose changes. Given our history of using similar systems to manage code there was an obvious choice – the Git source control system.

Git is a system which primarily relies on tracking line-by-line changes, meaning that when we wrote stuff we’d want to use a file format which behaved on a line-by-line basis. This made compiled binary formats such as Microsoft Word or even PDF a bit unsuitable, since a small change could result in a huge set of changes spanning hundreds (or more) of lines. We also wanted to use an open standard which didn’t have prohibitive licence restrictions and which was simple enough to be read and understood by anybody with a basic text editor. There are quite a few standards out there which meet this requirement, but again based on past experience we’re using Markdown for our RDM Policy.

Finally, inspired in no small way by the efforts of the Bundestag to convert their entire body of law to Git we wanted to store policy on a platform which not just allowed community involvement, but which positively encouraged it. GitHub is the world’s largest repository of open development, covering every language under the sun and projects ranging from hardcore low level programming through writing documentation through to communal story writing. Even better, they provide free hosting space for open projects. We already had a University of Lincoln user kicking around from past work, so it was a logical place to stick our Git repository. If you’re interested you can take a look at what we’ve got.

What’s interesting about using open text-based standards to write policy, Git for managing revisions and GitHub as a storage provider is that we’ve inadvertently made it very easy for people to do things that they couldn’t do before.

Where previously the process of creating policy was a bit mysterious, the entire world can now see not only the published version of the document – and by digging through the archives previously published versions – but every single change made in the history of the policy and who made it. Every change from the beginning of the document is trackable on a line-by-line basis, and using Git it’s even possible to see who is responsible for any single line of content. This gives us the immediate benefit of accountability, something which is great for finding out exactly who wrote a bit of a document if it needs clarification.

Another huge benefit of doing things this way is in the maintenance of different versions of a document. We’re not longer restricted by having one ‘published’ copy, but because of the ability of Git repositories to create new branches of documents we can do things like create a draft branch, or have a branch purely with spelling fixes. We can tag our ‘definitive’ versions which people should refer to whilst simultaneously working on the next edition or submitting changes to our ‘working’ version. To help illustrate quite how useful this is, I’ve copied our ICT Acceptable Use Policy to GitHub and reformatted it using an open standard (reStructuredText to be specific, more on this later). You can take a look at the tags which show distinct historical versions, or look at the different branches of the policy. The “draft” branch contains some changes I’ve proposed for the next version (visible for the world to see, but not yet in our “master” copy), and you can even see exactly what I’ve changed or do a line by line comparison of the latest “master” and “draft” copies.

Where Git and GitHub really come to the fore however is their encouragement of open collaboration. Although there is the official copy of the ICT policies, since it’s an open repository anybody in the world can make their own copy (such as this one I made earlier) and start making changes. The official repository remains untouched by this (no point in letting the whole world change it willy-nilly), but through an awesome feature of Git known as pull requests it’s possible for people to propose changes back to the official copy, and through the awesome of GitHub these can be discussed before being either accepted or rejected. Suddenly you’re not just sharing the policy openly, but actively allowing the entire world to suggest changes to it. To illustrate this I’ve made some changes in my own version which I’m proposing. Again, you can see each individual set of changes I’m proposing and make a line-by-line comparison. Pull requests can even be used internally within the same repository, which can be used to add mechanisms such as approval and signing off of a large set of changes (such as might be made when a document is published as a new version).

This all neatly leads on to the subject of publishing – when a document such as a policy is published it’s usually disseminated through a number of channels, most commonly print and digital. The University of Lincoln has a mixed history when it comes to publishing digitally, and more than once I’ve had to go through unnecessary amounts of trouble to read something simply because it’s been unnecessarily released as a Word 2010 document with a bunch of weird macros rather than a simple PDF. Fortunately, since we’re using standards such as Markdown or reStructuredText (much the same as Markdown, but with more power for more complex documents), publishing becomes really simple. Through freely available tools such as pandoc we can take our source document in our clean, open, text-only format and quickly get it formatted as HTML for presentation on the web, PDF for digital archiving and print, an eBook format for those who want to dig through it on their iPad and even as a Word document for those who feel an inexplicable need to read things in Word.

Hopefully the University will start to see more benefit in doing things this way – I’m going to be chatting to ICT to see if they want to consider using this method as their new definitive way of maintaining the AUP, and then hopefully I can bring it up the people in Registry and Secretariat who are responsible for the University Regulations. In the meantime, please feel free to go wild with our RDM repository.

Open Resources and Open Standards

The Orbital project is about a lot more than just developing a cool bit of software. In fact, the majority of the project impact is to do with policy and training rather than development. However, we think there are some good practices in software development which apply equally to the development of documentation around policy and training. Specifically, revision control.

Throughout the day as we make changes to the source code which makes up Orbital Bridge we record significant states in the development against our revision control software (specifically Git). We can then rewind the state of the entire codebase to any one of these conditions, compare differences between the two, and even pick and choose specific changes to move between states on a line-by-line basis. We can create diverging versions to test new features in isolation and merge them together again with no fear of messing up the working version.

Given that we’re planning to release all of our RDM policy and documentation under an Open licence (specifically CC-BY) it made a lot of sense to use a platform for revision control which makes the most of the community and both allows and encourages people to view our stuff, take it, make changes and even propose changes back to us. Enter GitHub, the most popular source code sharing site in the world. GitHub provides us with a ready to go Git hosting platform, as well as a load of really easy to use tools to help us and other people make the most of our resources.

At the University of Lincoln we already use GitHub for Open Source software projects from both the Online Services Team and the LNCD development group, so it made sense to use it for our RDM documentation as well. The definitive copy of our RDM policy and training materials can now be viewed in the state it was at any given point in time, branched, merged and so-on — but there’s a problem with making documents the Old Fashioned Way that people in the University may be used to. Namely, using Microsoft Word to store a document will cause all kinds of problems for revision management in that Word doesn’t just keep the text, but a whole load of other stuff which is then compressed down into a single binary blob. Using Word would mean that although technically the main features of revision control (versions, branching etc) would work we’d lose some of the more elegant solutions to problems such as line-by-line comparisons of versions and merging of different branches.

A better solution was needed for writing documents, and we ended up with a shortlist of three potential plain-text markup standards. These are ways of marking up a plain text document (such as you’d write in Notepad) with semantic structure and styling so that we can take the document and re-render it in a number of different places. Our three contenders were LaTeX, Markdown and reStructuredText. All three have pros and cons, but have the same basic idea behind the scenes – plain text is surrounded with bits of other plain text that give it meaning. All three result in a document that is fundamentally human readable without the need for any proprietary software, and all three allow for the document to be re-rendered in a form appropriate for the audience.

LaTex is by far the most powerful of the three, having a background in typesetting complex scientific academic papers. It would allow for policy documents to be rendered for both the web and print, but has the downside of being the most complex to use and having a less user-friendly syntax. We want the policy to be as accessible as possible, without needing to understand what a set of tags means.

Markdown and reStructuredText both take a much simpler approach, and use almost identical syntax for most things. However, reStructuredText has a bundle of other markup which mades it better suited to long, structured documents with nested lists. reStructuredText would be ideal if we ever decided to convert the University’s Regulations to a plain text format, but for a simple document such as the RDM policy doesn’t really have any advantage over Markdown.

The tipping point for our decision then lay in the technical implementation of Markdown over reStructuredText. Fortunately this was an easy call, as reStructuredText is very tightly linked into the Python ecosystem whereas Bridge is built entirely in PHP. We could easily drop a PHP library to do Markdown rendering into Bridge, whereas reStructuredText would need additional work to call an external Python library to do the best job of rendering. Should we decide in the future that we need the extra capability of reStructuredText then the migration as far as the document is concerned is virtually non-existent.

You can view our current draft RDM policy in Markdown in our RDM repository on GitHub, as well as fork it and submit pull requests if you want to use it as a basis for your own or propose changes. We will be moving all our training presentations to use a Markdown based in-browser format in the near future.

EPSRC Research Data Management Roadmap

As part of their Policy Framework on Research Data, the EPSRC have requested that all institutions in receipt of their funding develop a clear roadmap for research data management, which should be implemented by May 1st 2015. At the University of Lincoln, the Orbital project is a vehicle for this work and we expect much of the infrastructure (technical, policy and training) to be identified and piloted by April 2013.

The following is a draft policy statement which indicates our intent. We have also examined the EPSRC’s expectations and considered them in light of guidance offered by the Digital Curation Centre. This document should be understood in the context of the Orbital Project Plan, which sets out our aims and objectives until April 2013 and beyond.

The University of Lincoln Research Data Management Policy (draft)

  1. The University of Lincoln recognises that the curation and sharing of research data is key to its mission to create knowledge. Research data is a key asset. Its correct management brings benefits to the university, its members and the public through greater opportunities for access and re-use.
  2. This policy sets out the university’s expectations for the management of research data across all academic disciplines and in all forms.
  3. The development of this policy aims to satisfy the requirements of researchers, funding and statutory bodies and commercial partners, that research data will be managed to the highest standards throughout the research data lifecycle. The University recognises and supports the UK Research Councils’ mandates for data curation and sharing.
  4. Principal Investigators (PI) are required to consider data creation, management or sharing in the development of their research proposals and grant applications. All new research proposals must include research data management plans that explicitly address data capture, management, integrity, confidentiality, retention, sharing and publication.
  5. Staff can expect to receive training, guidance and support from Research & Enterprise Development and the Library for the development of research data management plans and their implementation.
  6. In accordance with the Research Council’s timeframes for preserving access to research outputs, Principal Investigators should record the existence of research data upon creation and deposit it according to their plan (often within six months of publication of research findings). The University will preserve access for as long as specified by the data management plan (which can be up to ten years after it was last accessed). Any data which is retained elsewhere, for example in an international data service or domain repository, should be registered with the University.
  7. Research metadata will be published for permanent citation alongside conventional outputs e.g. journal articles and conference papers. Open access to research data will be granted under appropriate safeguards according to conditions and timeframes specified by researchers, commercial partners and funding bodies. The University will support this through the provision of a data repository. Conditions for the licensing of research data must be made in consultation with Research & Enterprise Development.
  8. The Library and ICT Services will provide the infrastructure and expertise for long-term curation, preservation and access to research data. This will include mechanisms and services for storage, backup, registration, deposit and retention of research data assets in support of current and future access, during and after completion of research projects.
  9. Costs to meet the specific requirements of data management plans should be included in grant applications, where permitted. The University will develop appropriate plans for meeting the costs of long-term storage, preservation and curation of research data.
  10. This policy will be reviewed jointly by the Research & Enterprise Development and Library on an annual basis. Recommendations for amendment will be submitted to the appropriate management committee. Research & Enterprise Development will monitor compliance with this and other related policies, and the effect of such policies on the operation of the University. This policy should be considered alongside other university policies e.g. research ethics, IP policy, and disciplinary procedures.

The current version of this document is maintained here.

Implementation Plan

Introduction

The Orbital Implementation Plan (WP6) is intended to be a synthesis of our initial user requirements gathering (WP5), an assessment of Engineering research data (WP9), an evaluation of standards and technologies (WP10), informed by a literature review of previous work relevant to the Research Data Management (RDM) domain as it relates the discipline of Engineering (WP4).

Therefore, appended to this Implementation Plan is: i) a Technical Specification based on user requirements; ii) a Literature Review; iii) a summary of an institution-wide survey based on the Data Asset Framework; iv) and a draft Research Data Management Policy for the University of Lincoln (WP7), which is currently under-going internal review.

The Implementation Plan has been written at exactly a third of the way into the Orbital project (six months), allowing for a further year of development based on the work brought together in this document. It is worth repeating the objectives of the project, as stated in the Project Plan:

We intend to build on our previous work around the deposit, management and access to university research as well as further existing work in which we are building a platform for data-driven services at the university.

Throughout this undertaking, we aim to improve our understanding of the issues around research data management; develop the requisite skills among the university community to better manage research data; re-use and develop some of the underlying tools we have built to provide an institution-wide service for the ingest, description, preservation and dissemination of research data; improve the way we work on such projects, refining our use of agile methods; build capacity for the local development of academic technologies at the university; develop and implement appropriate institutional policy for the deposit, management and sharing of research data; and develop a Business Plan for the university for the long-term sustainability of our research data.

Our work to-date has pursued many of these objectives closely, reflecting continued effort over the last six months, both inside and outside the project, to build on previous work by using institutional data to drive application development; to improve our methods of access and identity management; and develop an environment that fosters and supports in-house innovation.

This planning document is primarily intended to support the technical implementation of the Orbital application to manage research data at the University of Lincoln. What it does not address is the training to support the use of the application (WP11), nor the Business Case for sustaining the pilot service (WP13), which we are implementing. However, some preliminary work is underway to consider appropriate business models for sustaining Orbital as open source software and we believe that the technical decisions laid out in this Implementation Plan will support the development of a sustainable Business Case for Orbital. This area of work continues and the outcomes are due to be delivered towards the end of the project.

What follows is a brief summary of the appended Technical Specification and Literature Review. I would like to thank Nick Jackson and Paul Stainthorp for their work on these documents, which have brought clarity to the Orbital project and contributed to a much better understanding of RDM at the University of Lincoln.

Joss Winn, Orbital Project Manager, 2nd April 2012.

Literature Review

The management of research data is recognised as one of the most pressing challenges facing the higher education and research sectors. Research data generated by publicly-funded research is seen as a public good and should be available for verification and re-use. In recognition of this principle, all UK Research Councils require their grant holders to manage and retain their research data for re-use, unless there are specific and valid reasons not to do so. (JISC Managing Research Data Programme 2011-13).

To gain a clearer understanding of the more complex and unfamiliar concepts in the emerging discipline of Research Data Management, the Orbital project conducted a review of published literature on the subject (mainly web sites, project reports and guidance documents), with particular reference to RDM in the discipline of engineering.

An online Research Data Management bibliography is being maintained at: http://lncn.eu/bcf6

The project team identified the following nine themes in the literature – for each theme, a recommendation is made which will support the development of RDM infrastructure at the University of Lincoln.

1. Fundamentals of research data and RDM

Researchers are not a homogeneous group, and their data needs are changing as the research landscape becomes more complex. Recommendation: the Orbital project continue its work to assess the storage and other requirements of Lincoln researchers using surveys and interviews.

2. Particular requirements of the discipline of engineering

The ERIM (Engineering Research Information Management) project at the University of Bath has specified the first ever set of RDM principles and terminology designed specifically for engineers. Recommendation: the Orbital continue to work with the Bath team on implementing ERIM’s findings.

3. The behaviour of researchers

What motivates researchers to invest in RDM is not the same as what motivates their institutions. Recommendation: Orbital to use surveys and interviews to understand researchers’ requirements and develop appropriate advocacy materials.

4. RDM policies and legal aspects

All UK Research Councils are introducing mandates for data curation, and in some cases data publication. Recommendation: the Orbital team to support the University’s response to the imminently required EPSRC data policy roadmap and to help develop institutional policies.

5. Data sharing

Research data are at their most useful when they are interoperable with other data. Sharing data leads to a range of real and measurable benefits, and researchers’ interests are protected by a principle of ‘proprietary period’ of privileged access. Recommendation: Orbital work with Research & Enterprise to formulate clear policies on data sharing and licensing.

6. Costs and benefits

The most significant RDM costs for the institution occur at the data acquisition/ingest stage. Institutions that invest in RDM can expect significant benefits including new, unforeseen research activities made possible through the re-use and aggregation of data. Recommendation: Orbital provide guidance to researchers on ensuring RDM is costed into future research funding bids.

7. Curation standards, metadata and citation

Without a system for assigning citations to research data, further curation and sharing is impossible. Recommendation: Orbital incorporate the functionality of DataCite to allow Lincoln researchers to secure a DOI (Digital Object Identifier) for their data objects.

8. Technical considerations

The range of file formats involved in engineering research is a significant area of complexity. Recommendation: Orbital continue to work with Siemens, the School of Engineering, the University of Bath and the DCC to develop expertise in handling engineering data formats.

9. Tools, support and training

A range of immediately re-purposable RDM training kits and planning tools already exists. Recommendation: Orbital review the available material, and use them to design a RDM training programme for the University of Lincoln – also incorporate Data Management Planning (DMP) tools within the Orbital application.

In light of this, the initial objectives of the Orbital project were on the mark, but indicate a broad area of institutional responsibility that goes beyond scholarly communication to affect strategic areas such as recruitment and training, business intelligence and continuity, IP and income generation, as well as future curriculum design and our corresponding investment in infrastructure and estate. No small task.

Technical Approach

Our Project Plan outlined the technical approach that we originally anticipated and six months later this has not fundamentally changed. As detailed in the Technical Specification, we remain convinced of the benefits of pursuing a data-driven, API-centric model of development, using storage and access control methods that support the creation of a modular and scaleable web application that is attractive to both Users and Developers.

As we have learned from our requirements gathering and literature review, Research practices both within and across subject disciplines are varied, suggesting that over the next 12 months, the Orbital project should concentrate on developing an application that remains open and attractive to further development, rather than seeking to design a single workflow for all users’ needs – an impossible task.

We believe this approach best supports the sustained development of Orbital beyond the life of the pilot project, allowing both Researchers and software Developers to create applications for Orbital to suit the requirements of specific research disciplines at a given point in time. Likewise, an API-centric approach will also ensure that our existing and related applications, such as institutional repository software and research information systems can equally be treated as ‘users’ (producers and consumers) of Orbital.

As we outlined in our Project Plan, this approach allows us to benefit from work which continues outside the Orbital project such as that around Access and Identity Management and academic profiles, and the development of data.lincoln.ac.uk. It is also a suitable approach for the development of Orbital as open source software, which should remain simple to develop for specific user’s needs if it is to receive interest and contributions from developers outside the university.

The Technical Specification contains five core functional requirements: Projects, Workspace, Archives, Working Dataset, and Publication. A Project may result in a specific Publication(s), while the Workspace, Archives and Working Dataset allow for three non-sequential methods of data storage, manipulation and analysis. These requirements are loosely coupled to one another, but do not represent a publication workflow. Orbital is not simply intended to be a data repository, but the basis of a flexible collaborative environment for working Researchers.

Each Project acts as a conceptual container for all data and represents the ‘space’ in which administrative, descriptive and contextual metadata is captured and stored, as well as the datasets themselves. It is at the level of a Project that Orbital will interface with other systems, such as an institutional repository or research information system by storing, exchanging and publishing information according to recognised standards, such as CERIF, SWORD2, DOI, etc.

Finally, a core requirement from Orbital is that data should be stored, accessed and transported securely. Being a native web application, we have opted to implement the OAuth 2 protocol to provide secured access to all API functions over HTTPS. As such, all user applications will be treated equally and will be required to access the core Orbital APIs via this popular and mature standard for application authentication on the web. OAuth is increasingly being deployed at the University of Lincoln and work continues outside of the Orbital project to implement it as part of an institution-wide Single Sign On (SSO) architecture.

Related project blog posts

Chosen Methodology

Jenkins, build my software

Pivoting Around

Project Planning: Quality Assurance

Understanding and participating in open source culture

The Toolchain: First Pass

Tracking progress

Literature Review

An Orbital project reading list

Initial User Requirements

Meeting our users, the Engineers

Assessment of Data Sources

Research Data vs Research Data

Let’s Look At Data

Data, Data Everywhere

Gluing people together

Evaluation of standards and technologies

How the National Archives use MongoDB

Forecast: Cloudy

Piloting the cloud

Why Orbital is all about the API

Servers, Servers Everywhere

Eating your own dog food: Building a repository with API-driven development

Hello? Is it me you’re looking for?

Orbital and the OAIS reference model