Robotics data now stored in CKAN

I’ve had to delay this post until confirmation of Tom’s project funding came through, but I’m pleased to be able to say that we’ve published our first complete research dataset(s) on CKAN.

Some months ago, Researchers, Tom Duckett and Feras Dayoub, came to us asking if we could host their data to support two publications and an EU grant application they were about to submit. We quickly stuck the data on one of our servers, they knocked up some HTML pages and we advised them on licensing the data so that it could be re-used. It was a temporary solution but we assured them that their root domain name would always act as a proxy to the final resting place of their data and so they started to tell the world about it. I’m told there was much interest in their data on specialist mailing lists and we were invited to submit a paper which discussed the data and the process of its publication. Their consortium bid for EU funding was also successful. Here’s what Tom had to say:

I believe that publishing our datasets for long-term robotic mapping has helped us: 1) to achieve greater awareness of our work (we were among the first groups in the world to study long-term mapping by mobile robots, in research from 2004-present), enabling other researchers worldwide to use our data, 2) to increase citations to our REF-able research papers in this area, and 3) to play our part in successfully applying for a 4-year FP7 IP project in collaboration with 7 other partners, by showing that we already have a track record in hosting such datasets. (STRANDS project – joint PIs at Lincoln: Marc Hanheide and Tom Duckett). One of the requirements of this project will be to publish even larger datasets of robot data, so we look forward to collaborating with Joss and colleagues again in future to address the challenges of hosting and curating “big data” for robotics research.

Prior to switching to CKAN, we were just about to move Tom and Feras’ data across to our own Orbital software, which met their minimal requirements, but having now switched to integrating with CKAN, we’ve moved the datasets to their permanent home at https://ckan.lincoln.ac.uk.

Just as we promised, Tom and Feras are still able to direct people to the original web address we gave them which points to their research pages, but the data itself is now hosted on CKAN. Having seen Tom’s data presented in this way, his colleague Greg published his data in the same way, using our WordPress platform to build a site explaining the data and CKAN as the actual data store.

This all happened before we had our Orbital Bridge publishing workflow in place (a post on that in a couple of weeks) and in the absence of a working Orbital application, I uploaded the data on Tom and Feras’ behalf. I spent quite some time using CKAN and can make the following observations about version 1.7.x, which is what we currently use.

  • Batch uploads: The data was zipped up into four collections of zip files. My task was to duplicate the organisation of the data which made sense to the researchers. This was possible as you can see, but it was tedious uploading each of the 29 zip files, many of which were over 1GB each. There were no problems doing so, it was just tedious and better batch upload/edit operatios in CKAN would make this much easier. Ideally, I’d like to have uploaded the zip files from each of the four collections of data, catalogued them by batch where they shared the same information and then individually edited attributes like the title of each zip file, for example. Having been an Archivist on and off for the last decade, this is one of the main gripes we have with library and archive systems. When dealing with collection of things, we need to be able to operate on them as collections and not have to deal with each object individually. I’ve spoken to CKAN developers about this and there are work-arounds, using scripts and a form extension, but it’s not something CKAN offers to most users with ease. Yet! 🙂
  • Research Groups and projects: The v1.7.x version of CKAN understands the concept of ‘dataset’ e.g. https://ckan.lincoln.ac.uk/en/dataset/ltmro-1 and of that dataset containing discreet resources. e.g. https://ckan.lincoln.ac.uk/en/dataset/ltmro-1/resource/92cbf22b-3293-45a3-b1de-f7782e581fe8 CKAN also understands the concept of ‘groups’ e.g. https://ckan.lincoln.ac.uk/en/group/lincoln-centre-for-autonomous-systems which datasets can be attached to. Groups are simply a label you apply to a dataset. You can add people to a group with specific read/write permissions over the group and you can add datasets to the group, too. CKAN also maintains a history of the actions of that group e.g. https://ckan.lincoln.ac.uk/en/group/history/lincoln-centre-for-autonomous-systems However, currently, CKAN does not (yet) understand ‘projects’, i.e. an organisational concept that is role-based and allows a user to administer other users and work. Groups are not synonymous with projects, but we think that a new feature in CKAN v2.0, due for release in a month or so, will resolve this. As I understand it, CKAN organisations will work like Github organisations and if so, that’s good. On Github, our research group, LNCD, is an ‘organisation’ and within that organisation I can add/remove people, give them roles, create private and public repositories (‘datasets’) and we can be members of more than one organisation, too. e.g. http://github.com/lncd and http://github.com/josswinn There is already a CKAN extension that implements organisations, but we’re waiting for this work to be merged into the core code.
  • Citations: If you look at Tom’s original web pages for their data, they are pretty clear in providing details about how to cite their data. This is so important to academics. CKAN does not offer a way to automatically generate a suggested citation for people who use the data. EPrints, on the other hand, offers the citation details of a research paper right at the top of the publication record e.g. http://eprints.lincoln.ac.uk/6046/ Some work on citations for CKAN has happened – there were conversations a few weeks ago on the IRC channel – but it’s something we need to work on, too. As a temporary solution, I have added the paper citation details as additional fields in the dataset record. CKAN is nice in that it allows you to add adhoc key-value pairs when cataloguing. However, this doesn’t address the citation details for the actual datasets themselves, but rather the publications.

In the near future, our ‘Researcher Dashboard’ application (codenamed ‘Orbital Bridge’) will handle the data deposit workflow from project creation to grabbing a datacite DOI to setting up a CKAN environment, to depositing a record of the data in ePrints for curation and preservation by the university. However, the upload and cataloguing of data will still be done by the researcher using CKAN, with Orbital aggregating information about the project, publications and data into a ‘dashboard’ for the researcher. Something like this  below, which is an actual screenshot of another project that we’re using to test the ‘Researcher Dashboard’. More on this soon…

Example research project overview
Example research project overview

CKAN for RDM workshop

On the 18th February, we ran a workshop in London which focused on the use of CKAN for research data management. The Orbital project made the decision to use CKAN last summer and was soon followed by Bristol’s data.bris project, which is using CKAN for its discovery catalogue. Simon Price from Bristol, gave a very interesting presentation of their work with CKAN, which you can read about on their project blog.

The #CKAN4RDM workshop was fully booked with 40 delegates attending – many more than we originally anticipated. It was facilitated by Simon Hodson, the Programme Manager of JISC’s Managing Research Data programme. Following presentations from Lincoln and Bristol on our respective uses of CKAN (ours was a live demo of ‘Orbital Bridge‘), we spent the later part of the morning undertaking a requirements gathering exercise, where tables of around 8-10 people acted as different users, providing ‘stories’ (requirements) for a research data management system. The exercise was introduced in the following few slides.

This was a useful exercise regardless of the software used, but after collating all 70+ stories over lunch, we then returned to our user groups and each table worked with a CKAN expert from the Open Knowledge Foundation to discuss the existing constraints for each requirement and started to develop a gap analysis so as to identify work to be done. The output of this work can be viewed on Google docs.

Types of users
Types of users
The 'researcher' user group
The ‘researcher’ user group

 

There was quite a positive buzz about the day and general feedback suggested that delegates got a lot out of the event. You can read write ups from the DCC, LSE and the Datapool project at Southampton.

One of the original purposes of the workshop was research for a conference paper that I (Joss) am giving at the IASSIST conference in Cologne, in May. The abstract I submitted to the conference was as follows:

This paper offers a full and critical evaluation of the open source CKAN software <http://ckan.org> for use as a Research Data Management (RDM) tool within a university environment. It presents a case study of CKAN’s implementation and use at the University of Lincoln, UK, and highlights its strengths and current weaknesses as an institutional Research Data Management tool. The author draws on his prior experience of implementing a mixed media Digital Asset Management system (DAM), Institutional Repository (IR) and institutional Web Content Management System (CMS), to offer an outline proposal for how CKAN can be used effectively for data analysis, storage and publishing in academia. This will be of interest to researchers, data librarians, and developers, who are responsible for the implementation of institutional RDM infrastructure. This paper is presented as part of the dissemination activities of the JISC-funded Orbital project <https://orbital.blogs.lincoln.ac.uk>.

As well as using last week’s outputs of the CKAN4RDM workshop, I’ll also be working closely with OKF staff to ensure that the evaluation is as thorough, accurate and up-to-date as possible by the time of the conference. It will focus on version 2.0 of CKAN, which is due for release soon.

I’d also like to appeal to other JISC MRD projects to send me any existing requirements documents you have produced during the course of your project. I will use the anonymised data to enrich the requirements we gathered last week. If you have such documents, please email me.

Finally, we have set up a CKAN4RDM mailing list, which anyone is welcome to join to discuss the use of CKAN within academia. One thing is clear to me: the academic community cannot expect OKF and existing CKAN developers to meet all of our requirements for research data management. We need to contribute developer time and other resource and effort to the overall CKAN open source project, just as other public sector organisations are doing.

 

Research Data Management Planning workshop

The following workshop is taking place tomorrow. Here’s the ‘All staff message’ inviting researchers to attend.

Staff from the Digital Curation Centre will be leading a workshop this Thursday, focused on ‘data management planning’ which is increasingly required by funding bodies. Researchers and research students are welcome to attend.

Funding bodies increasingly require grant-holders to develop and implement Data Management and Sharing Plans (DMPs). Plans typically state what data will be created and how, and outline the plans for sharing and preservation, noting what is appropriate given the nature of the data and any restrictions that may need to be applied.

This workshop will provide an overview of research data management and the role of data management planning. There are a small number of places available for researchers to attend.

Please contact Joss Winn if you wish to attend.

Workshop details:

Library UL102. Thursday 28th February.

13:00 – 13:05  Welcome and introductions
13:05 – 13:25  Research Data Management – an overview
13:25 – 13:40  An introduction to DMP Online
13:40 – 14:15  Practical exercise to identify and map research workflows using DMP template (part one)
14:15 – 14:30  Feedback
14:30 – 14:45  Coffee break
14:45 – 15:15  Practical exercise to identify and map research support services using DMP template (part two)
15:15 – 15:30  Feedback, wrap-up

CKAN trending

Last summer, we adopted CKAN as our data store/repository/catalogue. At that time, I noted that much had happened in the CKAN project in the few months since the start of the Orbital project in November 2011 that made CKAN a more attractive proposition for managing research data.

Recently, someone on the CKAN mailing list pointed to the graph below, which shows that the interest in CKAN has exploded. In November 2011, interest in CKAN was at just a quarter of its current peak, which is double that of September 2012, when we made the switch to CKAN. Following the European Commission and the UK government, the recent decision by the US government to adopt CKAN for the next version of data.gov will only drive interest in and the development of CKAN even further.

It is an exciting time to be observing and part of this explosion of interest. However, it is worth remembering that the interest in CKAN and data management is still very small compared to interest in other, more generic, content management systems. Publishing structured open data remains a niche interest compared to other open practices on the web, such as blogging. Here’s the graph comparing CKAN to WordPress.

Perhaps a fairer comparison would be that of CKAN with open access repository software, such as ePrints and DSpace.

Of course, the cumulative interest of DSpace and of ePrints over the years is greater than that of CKAN, but right now, there is clearly more interest in CKAN and publishing open data, than there is in open access repository software. The open access movement has matured, while the open data movement is growing rapidly. It will be interesting to follow these trends to measure (in part) the maturity of the open data movement, too.

Orbital Team meeting 13-12-12

Present

Joss
Melanie
Harry
Nick

Apologies

Annalisa
Paul

Previous Actions

  • JW to circulate draft documents for business case and SMT presentation to Orbital team. DONE
  • PS to talk to NJ and HN about ingestion of this content to Bridge. DONE
  • MB to send NJ/HN information on impact recording systems. DONE

Agenda

Policy and Business Case

Business case presentation to SMT has moved to Jan 14th
JW has distributed draft documents for SMT presentation to project team.
Documents & presentation aim to secure the groundwork for a research data management ‘road map’ over next 2 years from end of Orbital. Includes Research Services developer, supporting ePrints, Orbital, bibliometrics, RDM, etc. Also to raise awareness of Data Scientist position.
Action: MB/JW to contact Lisa Mooney regarding review of committee structure.

Training/Documentation

PS and JW met with Mike Neary at Graduate School, with agreement that Orbital would run training workshop with graduates on RDM to refine workshops and documentation. PS has blogged an outline of this training.
Action: PS/MB/JW/AJ to meet regarding training materials.
Joss spoke to Martin Donnelly at the DCC about RDM training and a branded version of DMPOnline. Will arrange a DCC workshop at Lincoln end to February.
ACTION: Joss to contact Martin about requirements for a branded version of DMPOnline.

Technical

Orbital Bridge is ‘Researcher Dashboard’. v0.2 released yesterday. Will collect metrics from ePrints, Scopus, Web of Science, Google Analytics, CKAN, etc.  Provides researcher with overview of their reserch profile and impact. Aggregates metrics for the institution.
ACTION: NJ to discuss bibliometrics with PS/HN
Still waiting for access to the AMS. NJ has met with ICT. Still issues around user permissions. John Bark will talk to Worktribe.
ACTION: NJ to organise conference call with Worktribe/ICT
Waiting to hear from DCC about DMPOnline APIs. HN has written an Orbital library for DMPOnline.
ICT Cloud Scoping Study includes Research Data Management requirement. Reports back May 2013.
Nucleus v2 (N2) is ready for production use. Will be a source of data for Researcher profile and store metrics, etc.
Open Stack not yet built. Will spend a day before Christmas looking at this. Joss is being interviewed by David Flanders (ANDS) for a podcast about academic uses of OpenStack.

Dissemination/Outreach/External

Joss is meeting with JISC and OKFN 14th December to discuss CKAN.
Carlos Silva (KAPTUR)  is visiting Lincon to discuss our use of CKAN in January
Paul has booked to attend the DCC conference in Amsterdam in January: Theme “What is a data scientist?”
Joss attended MRD Benefits and Impact event and discussed the Orbital project.

Budget

Joss is meeting with Jill Hubbard to get budget update.