A JISC-funded Managing Research Data project

The researcher Dashboard has been expanded to interface with the Lincoln Repository, ePrints. From it, a researcher can deposit their datasets directly to the repository, complete with DOI.

In previous posts, I spoke about how the CKAN and ePrints APIs can interface. We have finally implemented both APIs for use with the Researcher Dashboard and created a useable workflow for depositing datasets from CKAN to ePrints via the dashboard.

The workflow goes as follows:

  1. Hit ‘Publish’
  2. Get latest metadata from CKAN
  3. Prompt user to complete form
  4. Generate DOI
  5. Send metadata to Datacite
  6. Mint DOI
  7. Post SWORD2 to ePrints
  8. Get ePrints ID from response
  9. Add ePrint to SQL database as minimal data
  10. Update dataset in database with ePrint link

When a researcher views their project, they are presented with a list of datasets lifted from the project environment in CKAN. If they want to deposit one into ePrints, they can select the deposit button and are prompted to finish the dataset metadata. ePrints requires a minimal set of metadata before the dataset can be deposited. It can be put into a users inbox with merely a title, but requires a minimal specific set before depositing.

The DOI is minted for a unique identifier, by sending the metadata to Datacite along with the generated DOI. A DOI has to be generated first before it can be minted. Again, this is another field that is input to ePrints via the metadata.

The inclusion of ePrints metadata gives an all in one approach to the Research Dashboard. As otherwise, users would have to go into ePrints and fill in the data there. An annoyance easily avoided by having all the necessary steps taken care of on one site. This completes the toolset, so projects now have a central hub of activity. Data is brought into Orbital via the AMS (Awards Management System) for importing funded projects and CKAN for datasets, and exported to ePrints for the depositing into the Lincoln repository.

The original plan for this workflow was published by Paul Stainthorp. The workflow as it stands currently is as written in this post. It is, however, still in the finishing stages and polishing to make sure the process is solid.

For those of you playing on the technical side of research data, did you know that Open Data Protocols has done a load of work on standardised ways of storing data and metadata? The standards at Open Data Protocols (and its sister specification on Open Catalogs) are the ones that our future work on long-term preservation of data packages will be informed by (and are involved quite a bit in future work on CKAN). If you’re bundling data up at the end of a project, why not take a look at them?

On the 18th February, we ran a workshop in London which focused on the use of CKAN for research data management. The Orbital project made the decision to use CKAN last summer and was soon followed by Bristol’s data.bris project, which is using CKAN for its discovery catalogue. Simon Price from Bristol, gave a very interesting presentation of their work with CKAN, which you can read about on their project blog.

The #CKAN4RDM workshop was fully booked with 40 delegates attending – many more than we originally anticipated. It was facilitated by Simon Hodson, the Programme Manager of JISC’s Managing Research Data programme. Following presentations from Lincoln and Bristol on our respective uses of CKAN (ours was a live demo of ‘Orbital Bridge‘), we spent the later part of the morning undertaking a requirements gathering exercise, where tables of around 8-10 people acted as different users, providing ‘stories’ (requirements) for a research data management system. The exercise was introduced in the following few slides.

This was a useful exercise regardless of the software used, but after collating all 70+ stories over lunch, we then returned to our user groups and each table worked with a CKAN expert from the Open Knowledge Foundation to discuss the existing constraints for each requirement and started to develop a gap analysis so as to identify work to be done. The output of this work can be viewed on Google docs.

Types of users
Types of users
The 'researcher' user group
The ‘researcher’ user group

 

There was quite a positive buzz about the day and general feedback suggested that delegates got a lot out of the event. You can read write ups from the DCC, LSE and the Datapool project at Southampton.

One of the original purposes of the workshop was research for a conference paper that I (Joss) am giving at the IASSIST conference in Cologne, in May. The abstract I submitted to the conference was as follows:

This paper offers a full and critical evaluation of the open source CKAN software <http://ckan.org> for use as a Research Data Management (RDM) tool within a university environment. It presents a case study of CKAN’s implementation and use at the University of Lincoln, UK, and highlights its strengths and current weaknesses as an institutional Research Data Management tool. The author draws on his prior experience of implementing a mixed media Digital Asset Management system (DAM), Institutional Repository (IR) and institutional Web Content Management System (CMS), to offer an outline proposal for how CKAN can be used effectively for data analysis, storage and publishing in academia. This will be of interest to researchers, data librarians, and developers, who are responsible for the implementation of institutional RDM infrastructure. This paper is presented as part of the dissemination activities of the JISC-funded Orbital project <http://orbital.blogs.lincoln.ac.uk>.

As well as using last week’s outputs of the CKAN4RDM workshop, I’ll also be working closely with OKF staff to ensure that the evaluation is as thorough, accurate and up-to-date as possible by the time of the conference. It will focus on version 2.0 of CKAN, which is due for release soon.

I’d also like to appeal to other JISC MRD projects to send me any existing requirements documents you have produced during the course of your project. I will use the anonymised data to enrich the requirements we gathered last week. If you have such documents, please email me.

Finally, we have set up a CKAN4RDM mailing list, which anyone is welcome to join to discuss the use of CKAN within academia. One thing is clear to me: the academic community cannot expect OKF and existing CKAN developers to meet all of our requirements for research data management. We need to contribute developer time and other resource and effort to the overall CKAN open source project, just as other public sector organisations are doing.

 

During the development of Orbital (specifically the Researcher Dashboard) we’ve been trying (with mixed success) to make it integrate smoothly with various other University systems. Fortunately, a design decision made by some of the LNCD team a couple of years ago means that we’ve got our own institutional data store (codename Nucleus) with which we can almost exclusively interact to get hold of everything we needed. Where we’ve been integrating with new systems such as the University’s Awards Management System we’ve taken the approach of hooking the data into Nucleus first, so that it’s not only available to Researcher Dashboard but also to any other system which needs it.

Nucleus has quite a powerful framework for managing, structuring and presenting data in a rigorously managed format. It validates things at various points during data entry to make sure that it’s not gibberish, and then at the point of rendering it’s put through another set of functions which ensure it’s presented consistently and in as useful a manner as possible. As a result (using Nucleus, our PHP library, the CWD and our OAuth 2 authorisation server) we can go from a standing start to a fully featured, integrated application in a couple of days. A big part of the reason we can do this is that we make extensive use of dogfooding to ensure that our data is useful.

It saddens me, therefore, that during integration with some other applications both inside and outside the University we are forced to tackle data – often purported to be “machine readable” or “ready for reuse” which has clearly not been looked at by the eye of somebody who wants to reuse it. As an example, one source of data provides a date range which is stored internally (as far as I can gather) as two distinct values; there is a “start date” and there is an “end date”. These are provided through the UI as structured inputs (a date picker) which ensures they’re entered (and presumably then stored) in an expected format which can be manipulated as necessary. The API then chooses to express this date range not as a distinct “start date” and “end date”, but instead as a single “dates”.

You may think that this isn’t such a big problem – after all, how difficult can it be to parse 04/02/2013 - 07/03/2014? In that example it’s actually pretty easy once you’ve decided if you’re using UK or US style dates. The ISO date format can solve this though, giving us 2013-02-04 - 2014-03-07. Sadly, this isn’t what we get. In fact, here are the four (yes, four) distinct ways that “dates” can be represented:

  • 2013-02-04 to 2013-02-04 becomes 4 Feb 2013
  • 2013-02-04 to 2014-03-07 becomes 4 Feb 2013 - 7 Mar 2014
  • 2013-02-04 to 2013-03-07 becomes 4 Feb - 7 Mar 2013
  • 2013-02-04 to 2013-02-07 becomes 4 Feb 2013 - 7 Feb 2013

So, the rule becomes that if the dates are the same you just show the single date, but if the dates are different then you show two dates, unless they are in the same year in which case you only show the year in the final date, unless they are in the same year and the same month, in which case you show two dates. And then you format all the dates with a locale-specific short form of the month name.

Parsing this is understandably more difficult than it should be. Please, think about how your data will actually be used when building outputs.

The following workshop is taking place tomorrow. Here’s the ‘All staff message’ inviting researchers to attend.

Staff from the Digital Curation Centre will be leading a workshop this Thursday, focused on ‘data management planning’ which is increasingly required by funding bodies. Researchers and research students are welcome to attend.

Funding bodies increasingly require grant-holders to develop and implement Data Management and Sharing Plans (DMPs). Plans typically state what data will be created and how, and outline the plans for sharing and preservation, noting what is appropriate given the nature of the data and any restrictions that may need to be applied.

This workshop will provide an overview of research data management and the role of data management planning. There are a small number of places available for researchers to attend.

Please contact Joss Winn if you wish to attend.

Workshop details:

Library UL102. Thursday 28th February.

13:00 – 13:05  Welcome and introductions
13:05 – 13:25  Research Data Management – an overview
13:25 – 13:40  An introduction to DMP Online
13:40 – 14:15  Practical exercise to identify and map research workflows using DMP template (part one)
14:15 – 14:30  Feedback
14:30 – 14:45  Coffee break
14:45 – 15:15  Practical exercise to identify and map research support services using DMP template (part two)
15:15 – 15:30  Feedback, wrap-up