A JISC-funded Managing Research Data project

Posts tagged Open Data

I’ve had to delay this post until confirmation of Tom’s project funding came through, but I’m pleased to be able to say that we’ve published our first complete research dataset(s) on CKAN.

Some months ago, Researchers, Tom Duckett and Feras Dayoub, came to us asking if we could host their data to support two publications and an EU grant application they were about to submit. We quickly stuck the data on one of our servers, they knocked up some HTML pages and we advised them on licensing the data so that it could be re-used. It was a temporary solution but we assured them that their root domain name would always act as a proxy to the final resting place of their data and so they started to tell the world about it. I’m told there was much interest in their data on specialist mailing lists and we were invited to submit a paper which discussed the data and the process of its publication. Their consortium bid for EU funding was also successful. Here’s what Tom had to say:

I believe that publishing our datasets for long-term robotic mapping has helped us: 1) to achieve greater awareness of our work (we were among the first groups in the world to study long-term mapping by mobile robots, in research from 2004-present), enabling other researchers worldwide to use our data, 2) to increase citations to our REF-able research papers in this area, and 3) to play our part in successfully applying for a 4-year FP7 IP project in collaboration with 7 other partners, by showing that we already have a track record in hosting such datasets. (STRANDS project – joint PIs at Lincoln: Marc Hanheide and Tom Duckett). One of the requirements of this project will be to publish even larger datasets of robot data, so we look forward to collaborating with Joss and colleagues again in future to address the challenges of hosting and curating “big data” for robotics research.

Prior to switching to CKAN, we were just about to move Tom and Feras’ data across to our own Orbital software, which met their minimal requirements, but having now switched to integrating with CKAN, we’ve moved the datasets to their permanent home at https://ckan.lincoln.ac.uk.

Just as we promised, Tom and Feras are still able to direct people to the original web address we gave them which points to their research pages, but the data itself is now hosted on CKAN. Having seen Tom’s data presented in this way, his colleague Greg published his data in the same way, using our WordPress platform to build a site explaining the data and CKAN as the actual data store.

This all happened before we had our Orbital Bridge publishing workflow in place (a post on that in a couple of weeks) and in the absence of a working Orbital application, I uploaded the data on Tom and Feras’ behalf. I spent quite some time using CKAN and can make the following observations about version 1.7.x, which is what we currently use.

  • Batch uploads: The data was zipped up into four collections of zip files. My task was to duplicate the organisation of the data which made sense to the researchers. This was possible as you can see, but it was tedious uploading each of the 29 zip files, many of which were over 1GB each. There were no problems doing so, it was just tedious and better batch upload/edit operatios in CKAN would make this much easier. Ideally, I’d like to have uploaded the zip files from each of the four collections of data, catalogued them by batch where they shared the same information and then individually edited attributes like the title of each zip file, for example. Having been an Archivist on and off for the last decade, this is one of the main gripes we have with library and archive systems. When dealing with collection of things, we need to be able to operate on them as collections and not have to deal with each object individually. I’ve spoken to CKAN developers about this and there are work-arounds, using scripts and a form extension, but it’s not something CKAN offers to most users with ease. Yet! :-)
  • Research Groups and projects: The v1.7.x version of CKAN understands the concept of ‘dataset’ e.g. https://ckan.lincoln.ac.uk/en/dataset/ltmro-1 and of that dataset containing discreet resources. e.g. https://ckan.lincoln.ac.uk/en/dataset/ltmro-1/resource/92cbf22b-3293-45a3-b1de-f7782e581fe8 CKAN also understands the concept of ‘groups’ e.g. https://ckan.lincoln.ac.uk/en/group/lincoln-centre-for-autonomous-systems which datasets can be attached to. Groups are simply a label you apply to a dataset. You can add people to a group with specific read/write permissions over the group and you can add datasets to the group, too. CKAN also maintains a history of the actions of that group e.g. https://ckan.lincoln.ac.uk/en/group/history/lincoln-centre-for-autonomous-systems However, currently, CKAN does not (yet) understand ‘projects’, i.e. an organisational concept that is role-based and allows a user to administer other users and work. Groups are not synonymous with projects, but we think that a new feature in CKAN v2.0, due for release in a month or so, will resolve this. As I understand it, CKAN organisations will work like Github organisations and if so, that’s good. On Github, our research group, LNCD, is an ‘organisation’ and within that organisation I can add/remove people, give them roles, create private and public repositories (‘datasets’) and we can be members of more than one organisation, too. e.g. http://github.com/lncd and http://github.com/josswinn There is already a CKAN extension that implements organisations, but we’re waiting for this work to be merged into the core code.
  • Citations: If you look at Tom’s original web pages for their data, they are pretty clear in providing details about how to cite their data. This is so important to academics. CKAN does not offer a way to automatically generate a suggested citation for people who use the data. EPrints, on the other hand, offers the citation details of a research paper right at the top of the publication record e.g. http://eprints.lincoln.ac.uk/6046/ Some work on citations for CKAN has happened – there were conversations a few weeks ago on the IRC channel – but it’s something we need to work on, too. As a temporary solution, I have added the paper citation details as additional fields in the dataset record. CKAN is nice in that it allows you to add adhoc key-value pairs when cataloguing. However, this doesn’t address the citation details for the actual datasets themselves, but rather the publications.

In the near future, our ‘Researcher Dashboard’ application (codenamed ‘Orbital Bridge’) will handle the data deposit workflow from project creation to grabbing a datacite DOI to setting up a CKAN environment, to depositing a record of the data in ePrints for curation and preservation by the university. However, the upload and cataloguing of data will still be done by the researcher using CKAN, with Orbital aggregating information about the project, publications and data into a ‘dashboard’ for the researcher. Something like thisĀ  below, which is an actual screenshot of another project that we’re using to test the ‘Researcher Dashboard’. More on this soon…

Example research project overview
Example research project overview

On the 18th February, we ran a workshop in London which focused on the use of CKAN for research data management. The Orbital project made the decision to use CKAN last summer and was soon followed by Bristol’s data.bris project, which is using CKAN for its discovery catalogue. Simon Price from Bristol, gave a very interesting presentation of their work with CKAN, which you can read about on their project blog.

The #CKAN4RDM workshop was fully booked with 40 delegates attending – many more than we originally anticipated. It was facilitated by Simon Hodson, the Programme Manager of JISC’s Managing Research Data programme. Following presentations from Lincoln and Bristol on our respective uses of CKAN (ours was a live demo of ‘Orbital Bridge‘), we spent the later part of the morning undertaking a requirements gathering exercise, where tables of around 8-10 people acted as different users, providing ‘stories’ (requirements) for a research data management system. The exercise was introduced in the following few slides.

This was a useful exercise regardless of the software used, but after collating all 70+ stories over lunch, we then returned to our user groups and each table worked with a CKAN expert from the Open Knowledge Foundation to discuss the existing constraints for each requirement and started to develop a gap analysis so as to identify work to be done. The output of this work can be viewed on Google docs.

Types of users
Types of users
The 'researcher' user group
The ‘researcher’ user group

 

There was quite a positive buzz about the day and general feedback suggested that delegates got a lot out of the event. You can read write ups from the DCC, LSE and the Datapool project at Southampton.

One of the original purposes of the workshop was research for a conference paper that I (Joss) am giving at the IASSIST conference in Cologne, in May. The abstract I submitted to the conference was as follows:

This paper offers a full and critical evaluation of the open source CKAN software <http://ckan.org> for use as a Research Data Management (RDM) tool within a university environment. It presents a case study of CKAN’s implementation and use at the University of Lincoln, UK, and highlights its strengths and current weaknesses as an institutional Research Data Management tool. The author draws on his prior experience of implementing a mixed media Digital Asset Management system (DAM), Institutional Repository (IR) and institutional Web Content Management System (CMS), to offer an outline proposal for how CKAN can be used effectively for data analysis, storage and publishing in academia. This will be of interest to researchers, data librarians, and developers, who are responsible for the implementation of institutional RDM infrastructure. This paper is presented as part of the dissemination activities of the JISC-funded Orbital project <http://orbital.blogs.lincoln.ac.uk>.

As well as using last week’s outputs of the CKAN4RDM workshop, I’ll also be working closely with OKF staff to ensure that the evaluation is as thorough, accurate and up-to-date as possible by the time of the conference. It will focus on version 2.0 of CKAN, which is due for release soon.

I’d also like to appeal to other JISC MRD projects to send me any existing requirements documents you have produced during the course of your project. I will use the anonymised data to enrich the requirements we gathered last week. If you have such documents, please email me.

Finally, we have set up a CKAN4RDM mailing list, which anyone is welcome to join to discuss the use of CKAN within academia. One thing is clear to me: the academic community cannot expect OKF and existing CKAN developers to meet all of our requirements for research data management. We need to contribute developer time and other resource and effort to the overall CKAN open source project, just as other public sector organisations are doing.

 

Last summer, we adopted CKAN as our data store/repository/catalogue. At that time, I noted that much had happened in the CKAN project in the few months since the start of the Orbital project in November 2011 that made CKAN a more attractive proposition for managing research data.

Recently, someone on the CKAN mailing list pointed to the graph below, which shows that the interest in CKAN has exploded. In November 2011, interest in CKAN was at just a quarter of its current peak, which is double that of September 2012, when we made the switch to CKAN. Following the European Commission and the UK government, the recent decision by the US government to adopt CKAN for the next version of data.gov will only drive interest in and the development of CKAN even further.

It is an exciting time to be observing and part of this explosion of interest. However, it is worth remembering that the interest in CKAN and data management is still very small compared to interest in other, more generic, content management systems. Publishing structured open data remains a niche interest compared to other open practices on the web, such as blogging. Here’s the graph comparing CKAN to WordPress.

Perhaps a fairer comparison would be that of CKAN with open access repository software, such as ePrints and DSpace.

Of course, the cumulative interest of DSpace and of ePrints over the years is greater than that of CKAN, but right now, there is clearly more interest in CKAN and publishing open data, than there is in open access repository software. The open access movement has matured, while the open data movement is growing rapidly. It will be interesting to follow these trends to measure (in part) the maturity of the open data movement, too.

On Wednesday, we hosted three people from the Open Knowledge Foundation, to discuss the Orbital project and their software, CKAN. It was a very engaging and productive day spent with Peter Murray-Rust (on the Advisory Board of OKFN), Mark Wainwright (community co-ordinator) and Ross Jones (core developer). We asked them at the start of the day to challenge us about our technical work on Orbital so far and I described the day to them as an opportunity to evaluate our work developing the Orbital software so far. We didn’t touch on the other aspects of the Orbital project such as policy development and training for researchers.

To cut to the chase, the Orbital project will be adopting CKAN as the primary platform for further development of the technical infrastrcuture for RDM at Lincoln. This is subject to approval by the Steering Group, but the reasons are compelling in many ways and I am confident that the Steering Group will accept this recommendation. More importantly, the Implementation Plan that was approved by the Steering group and submitted to JISC remains unchanged.

The raw notes from our meeting are available here. Remember these are raw notes written throughout the day, primarily for our own record. They probably mean more to us than they do to you! Thanks to Paul Stainthorp for his fanatical note taking :-)

Here’s the list of attendees and our agenda:

Present

Peter Murray-Rust (OKFN)
Mark Wainwright (OKFN)
Ross Jones (OKFN)
Joss Winn (University of Lincoln, CERD)
Nick Jackson (University of Lincoln, CERD)
Harry Newton (University of Lincoln, CERD)
Jamie Mahoney (University of Lincoln, CERD)
Alex Bilbie (University of Lincoln, ICT services)
Paul Stainthorp (University of Lincoln, Library)

Agenda

09.30 Introductions
10.00 Orbital introduction and context: Student as Producer, LNCD; Orbital bid and pilot project; Discussion of Orbital approach, the data we’re using, user needs etc.
10.30 CKAN introduction and context
11.00 Technical discussion – Orbital
12.00 LUNCH
12.30 Technical discussion – CKAN
13.30 Discussion – should Orbital adopt CKAN?
14.00 data[.lincoln].ac.uk
15.00 Next steps; Opportunities for collaboration/funding?

What is probably of most interest to people reading this are the pros & cons of the Orbital project adopting CKAN. I’ll provide more context further into the post, but here’s a summary copied from our notes:

(more…)