Choosing CKAN for research data management

The switch to CKAN was an important decision for the Orbital project and I’d like to think that it will help raise the profile of CKAN within the academic community. We’d been keeping an eye on CKAN development from earlier on in the year, but it was the opportunity to talk to Mark Wainwright, OKFN Community Co-ordinator, at the Open Repositories 2012 conference that prompted us to really look at the potential of using CKAN as part of Lincoln’s Research Data Management infrastructure. Mark’s OR2012 poster (PDF) provides an nice overview of what CKAN currently offers.

Before I go into more detail about why we think CKAN is suitable for academia, here are some of the feature highlights that we like:

  • Data entry via web UI, APIs or spreadsheet import
  • versioned metadata
  • configurable user roles and permissions
  • data previewing/visualisation
  • user extensible metadata fields
  • a license picker
  • quality assurance indicator
  • organisations, tags, collections, groups
  • unique IDs and cool URIs
  • comprehensive search features
  • geospacial features
  • social: comments, feeds, notifications, sharing, following, activity streams
  • data visualisation (tables, graphs, maps, images)
  • datastore (‘dynamic data’) + file store + catalogue
  • extensible through over 60 extensions and a rich API for all core features
  • can harvest metadata and is harvestable, too

You can take a tour or demo CKAN to get a better idea of its current features. The demo site isĀ  running the new/next UI design, too, which looks great.

CKAN’s impact

In its five years of development, CKAN has achieved significant impact across the world. Despite web scale open data publishing being a relatively recent initiaitve, CKAN, through the efforts of OKFN, is the defacto standard for the publishing of open data with over 40+ instances running around the world. How do the UK, Dutch, Norweigan and Brazilian governments make their data publicly accessible? The European Commission? They use CKAN.

On the flip side, CKAN has attracted significant interest from developers with 53 code contributors over 5 years and 60+ extensions.

Major CKAN changes since Orbital project began

When we first bid for the JISC MRD programme funding, CKAN was a less attractive offering to us. Our bid focused on an approach we’ve taken on a number of projects, using MongoDB as a datastore over which we built an application that adds/edits/reads data via a set of APIs we would write. Our bid also focused on security and the confidentiality of commercial engineering data. Since starting the Orbital project these concerns have been addressed or are being addressed by CKAN and the requested features we’ve identified through our engagement with researchers have also been integrated into CKAN, such as activity streams and data visualisation. Reading through the CKAN changelog shows just how much work is going into CKAN and with each release it’s developing into a better tool for RDM. Here are some of the headline features, in order of priority, that have turned our attention to CKAN over the course of the Orbital project.

CKAN in an academic environment

We’ve discussed the idea of a Minimum Viable Product for RDM, and consider it to be authentication, data storage, hosting/publishing, licensing, a persistent URI and analytics. These features alone allow an academic to reliably and permanently publish data to support their research findings and help measure its impact. CKAN meets these requirements ‘out of the box’. Other requirements of a tool for managing research data include the following (you can add more in the comment box – these are based on our own discussions with researchers and a quick scan of other JISC MRD projects)

  • Integration with the institutional research environment (e.g. hooks into CRIS system, Institutional Repository, DMPOnline, networked storage)
  • Capturing the research process/context/activity; notation, not just data
  • Controlled access to non-Lincoln staff e.g. research partners
  • Good, comprehensive search tools
  • Version control for data and metadata
  • Customisable, extensible meatadata
  • Adherence to data standards e.g. RDF
  • Multi-level access policies
  • Secure, backed up, scalable file storage for anywhere access to files and file sharing (e.g. Dropbox)
  • Command-line tools and good web UI for deposit/update of data
  • Permanent URIs for citation e.g. DOIs
  • Import/export of common data formats
  • Linking datasets (by project, type, research output, person, etc.)
  • Rights/license management
  • Commercial support/widely used, popular platform (‘community’)

RDM features that are currently lacking in CKAN

During our meeting with OKFN staff last month we identified several areas that need addressing for CKAN to meet our wider requirements for RDM. These are:

  • Security: CKAN is not lacking in security measures, but we need to look at CKAN’s security model more closely (roles, permissions, access, authentication) and also tie it into the university’s Single Sign On environment
  • ‘Projects’ concept: We think that the new ‘organisations‘ feature might work conceptually in the same way as this.
  • Academic terminology + documentation for academic use: We need to review CKAN and write documentation for an academic use case as well as provide a modified language file that ‘translates’ certain terminology into that more appropriate for the academic context.
  • Batch edit/upload controls. Certain batch functions are available on the command line, but out of the box, there’s no way to upload and batch edit multiple files, for example.
  • ownCloud integration: CKAN doesn’t provide the network drive storage that researchers (actually pretty much everyone) relies on to organise their files. Increasingly people are using Dropbox because of the synchonisation and sharing features. These are important to researchers, too, and moving data from such a drive to CKAN will be key to researchers adopting it.
  • EPrints integration (SWORD2): A way to create a record of CKAN data in EPrints, thereby joining research outputs with research data.

It’s these features that we’ll be concentrating on in our development on the Orbital project.

Harry and I are attending the Open Knowledge Festival in Helsinki later this month and will talk more about our choice of CKAN for research data. I’d be interested to hear from anyone working in a university who has looked at CKAN in detail and decided against using it for RDM. It seems odd to me that it has such a low profile in academia (or maybe I’m just clueless??) and I do think that the time has come to embrace CKAN and acknowledge the efforts of OKFN more widely. I know there are people like Peter Murray Rust and Mark MacGillivray, who are actively trying to do this and OKFN’s presence at Dev8D and OR2012 this year demonstrates its eagerness to work more closely with the university sector. Perhaps we’re near a tipping point?

10 Replies to “Choosing CKAN for research data management”

  1. Hi Joss,
    CKan appears to be an interesting alternative to the repository software that is typically used by universities and I’m interested in seeing how it develops in the future.

    The various demos show CKan being used for structured datasets, but there doesn’t appear to be anything for other content types. Are you aware of anyone using it for qualitative data (e.g. interview transcripts and other documents) or non-textual resources (still and moving images, 3D objects)?

    1. Hi Gareth,

      CKAN is a datastore (for storing data in an SQL database) and a repository for data blobs, such as documents, images, XLS files, etc. It can also simply be a catalogue which points at data held elsewhere.

      Its web interface is lacking batch operation features that make it useful for large collections of blob data such as images – it is not a digital asset management tool – but could be used as part of such a system as its functions are fully accessible via its APIs.

      It would be worth joining the CKAn user mailing list to ask about specific uses of it. Over the next few months, I intend to look at CKAN in detail and produce a critical evaluation of it, which should answer your questions in more detail.

  2. Thanks very much for this thoughtful post, Josh!

    Out team is also looking closely at using CKAN for research dataset management. Our current short-list of issues (which I think overlap yours) are:

    (1) Ability to extend CKAN metadata model to include DataCite schema
    (2) Ability to export or otherwise integrate CKAN metadata persistence in RDF endpoint
    (3) Ability to sensibly integrate dataset ingestion with PID registration. We’ll probably obtain our own Handle System prefix and create a parallel universe to EPIC and DOI

    We would love to “dialog” with you about this; I’ll send you an intro off-blog…

    John

    1. Hi John,

      Great to hear that you’re thinking of using CKAN, too. I heard that Bristol have adopted it for research data management as well, so hopefully there will soon be a good sized user group in academia. To answer your specific questions:

      1) We’ve not tackled this yet but there is a CKAN extension that has extended the CKAN schema for IATI, which shows how it can be extended to meet the needs of specific domains: https://github.com/okfn/ckanext-iati And there is documentation on creating custom schemas http://docs.ckan.org/en/latest/forms.html#custom-schemas

      2) RDF export is built in e.g. https://ckan.lincoln.ac.uk/dataset/ltmro-1.rdf See http://docs.ckan.org/en/latest/linked-data-and-rdf.html

      3) There’s been discussion about this on the CKAN mailing list. http://lists.okfn.org/pipermail/ckan-discuss/2011-November/001848.html I think it’s something that academia needs to contribute to the CKAN community and I’m sure it’s not beyond us. You, Lincoln and Bristol all want to add this functionality. We should talk about how we go about it.

Comments are closed.