I’ve had to delay this post until confirmation of Tom’s project funding came through, but I’m pleased to be able to say that we’ve published our first complete research dataset(s) on CKAN.
Some months ago, Researchers, Tom Duckett and Feras Dayoub, came to us asking if we could host their data to support two publications and an EU grant application they were about to submit. We quickly stuck the data on one of our servers, they knocked up some HTML pages and we advised them on licensing the data so that it could be re-used. It was a temporary solution but we assured them that their root domain name would always act as a proxy to the final resting place of their data and so they started to tell the world about it. I’m told there was much interest in their data on specialist mailing lists and we were invited to submit a paper which discussed the data and the process of its publication. Their consortium bid for EU funding was also successful. Here’s what Tom had to say:
I believe that publishing our datasets for long-term robotic mapping has helped us: 1) to achieve greater awareness of our work (we were among the first groups in the world to study long-term mapping by mobile robots, in research from 2004-present), enabling other researchers worldwide to use our data, 2) to increase citations to our REF-able research papers in this area, and 3) to play our part in successfully applying for a 4-year FP7 IP project in collaboration with 7 other partners, by showing that we already have a track record in hosting such datasets. (STRANDS project – joint PIs at Lincoln: Marc Hanheide and Tom Duckett). One of the requirements of this project will be to publish even larger datasets of robot data, so we look forward to collaborating with Joss and colleagues again in future to address the challenges of hosting and curating “big data” for robotics research.
Prior to switching to CKAN, we were just about to move Tom and Feras’ data across to our own Orbital software, which met their minimal requirements, but having now switched to integrating with CKAN, we’ve moved the datasets to their permanent home at https://ckan.lincoln.ac.uk.
Just as we promised, Tom and Feras are still able to direct people to the original web address we gave them which points to their research pages, but the data itself is now hosted on CKAN. Having seen Tom’s data presented in this way, his colleague Greg published his data in the same way, using our WordPress platform to build a site explaining the data and CKAN as the actual data store.
This all happened before we had our Orbital Bridge publishing workflow in place (a post on that in a couple of weeks) and in the absence of a working Orbital application, I uploaded the data on Tom and Feras’ behalf. I spent quite some time using CKAN and can make the following observations about version 1.7.x, which is what we currently use.
- Batch uploads: The data was zipped up into four collections of zip files. My task was to duplicate the organisation of the data which made sense to the researchers. This was possible as you can see, but it was tedious uploading each of the 29 zip files, many of which were over 1GB each. There were no problems doing so, it was just tedious and better batch upload/edit operatios in CKAN would make this much easier. Ideally, I’d like to have uploaded the zip files from each of the four collections of data, catalogued them by batch where they shared the same information and then individually edited attributes like the title of each zip file, for example. Having been an Archivist on and off for the last decade, this is one of the main gripes we have with library and archive systems. When dealing with collection of things, we need to be able to operate on them as collections and not have to deal with each object individually. I’ve spoken to CKAN developers about this and there are work-arounds, using scripts and a form extension, but it’s not something CKAN offers to most users with ease. Yet!
- Research Groups and projects: The v1.7.x version of CKAN understands the concept of ‘dataset’ e.g. https://ckan.lincoln.ac.uk/en/dataset/ltmro-1 and of that dataset containing discreet resources. e.g. https://ckan.lincoln.ac.uk/en/dataset/ltmro-1/resource/92cbf22b-3293-45a3-b1de-f7782e581fe8 CKAN also understands the concept of ‘groups’ e.g. https://ckan.lincoln.ac.uk/en/group/lincoln-centre-for-autonomous-systems which datasets can be attached to. Groups are simply a label you apply to a dataset. You can add people to a group with specific read/write permissions over the group and you can add datasets to the group, too. CKAN also maintains a history of the actions of that group e.g. https://ckan.lincoln.ac.uk/en/group/history/lincoln-centre-for-autonomous-systems However, currently, CKAN does not (yet) understand ‘projects’, i.e. an organisational concept that is role-based and allows a user to administer other users and work. Groups are not synonymous with projects, but we think that a new feature in CKAN v2.0, due for release in a month or so, will resolve this. As I understand it, CKAN organisations will work like Github organisations and if so, that’s good. On Github, our research group, LNCD, is an ‘organisation’ and within that organisation I can add/remove people, give them roles, create private and public repositories (‘datasets’) and we can be members of more than one organisation, too. e.g. http://github.com/lncd and http://github.com/josswinn There is already a CKAN extension that implements organisations, but we’re waiting for this work to be merged into the core code.
- Citations: If you look at Tom’s original web pages for their data, they are pretty clear in providing details about how to cite their data. This is so important to academics. CKAN does not offer a way to automatically generate a suggested citation for people who use the data. EPrints, on the other hand, offers the citation details of a research paper right at the top of the publication record e.g. http://eprints.lincoln.ac.uk/6046/ Some work on citations for CKAN has happened – there were conversations a few weeks ago on the IRC channel – but it’s something we need to work on, too. As a temporary solution, I have added the paper citation details as additional fields in the dataset record. CKAN is nice in that it allows you to add adhoc key-value pairs when cataloguing. However, this doesn’t address the citation details for the actual datasets themselves, but rather the publications.
In the near future, our ‘Researcher Dashboard’ application (codenamed ‘Orbital Bridge’) will handle the data deposit workflow from project creation to grabbing a datacite DOI to setting up a CKAN environment, to depositing a record of the data in ePrints for curation and preservation by the university. However, the upload and cataloguing of data will still be done by the researcher using CKAN, with Orbital aggregating information about the project, publications and data into a ‘dashboard’ for the researcher. Something like this below, which is an actual screenshot of another project that we’re using to test the ‘Researcher Dashboard’. More on this soon…