Have Data, Will CKAN

One thing that Orbital is focussed on is the notion that data should remain as raw and accessible as possible throughout the research cycle, with as few steps as possible between the source of the data and its storage. We don’t want our rich sensor data being turned into Excel spreadsheets unnecessarily, and we don’t want to have to manually run reports, extract data and then load the data into something else just to get work done.

When we were building our own platform this was achieved with something we called Dynamic Datasets; a MongoDB cluster with a RESTful API bolted on top which would accept massive amounts of data and re-expose it on demand, including powerful filtering and output options. In CKAN we’re using the DataStore API, powered by ElasticSearch (at the moment, this will change in the future), to do the same thing. The DataStore API was originally intended to provide a searchable view on data such as CSV files as part of the Recline preview, but we’re using it for a slightly different purpose as the sole repository of data without any corresponding originating file. Data comes straight from the source, through any sanitation process which we need to run, and into DataStore.

To test the principle whilst we’re finishing evaluating security for our more confidential research data (where we get to play with big engineering data) we’ve started loading data which is being used by On Course into CKAN. Specifically we’re ingesting our course data and our organisational structure — and we’re doing it all automatically.

Once a day we run a set of queries against our Nucleus institutional data warehouse to gather the required data. This is our ‘sensor’ in the model, representing the source of the data prior to any analysis or further work. The data is then manipulated into ElasticSearch’s bulk update API format, and injected directly into DataStore. Check out some of our resources, maintained automatically by direct interaction between our source data and CKAN’s DataStore:

We can also use this data to run ‘real-time’ queries against the data — something researchers are likely to find useful.For example, we can ask for the first 100 modules taught by the School of Computing with “Games” in their title. This query is always as accurate as the last update of the DataStore API, which means that (as an engineering example) researchers can ask for data such as “every sensor value for machines recorded in the last day where the temperature was over 40”, and the data will be hot off the press every single time, with no requirement to re-export the information.

There are still a few rough edges we’re going to do our best to iron out, but all in all I’m fairly impressed with the ease of getting going.

Choosing CKAN for research data management

The switch to CKAN was an important decision for the Orbital project and I’d like to think that it will help raise the profile of CKAN within the academic community. We’d been keeping an eye on CKAN development from earlier on in the year, but it was the opportunity to talk to Mark Wainwright, OKFN Community Co-ordinator, at the Open Repositories 2012 conference that prompted us to really look at the potential of using CKAN as part of Lincoln’s Research Data Management infrastructure. Mark’s OR2012 poster (PDF) provides an nice overview of what CKAN currently offers.

Before I go into more detail about why we think CKAN is suitable for academia, here are some of the feature highlights that we like:

  • Data entry via web UI, APIs or spreadsheet import
  • versioned metadata
  • configurable user roles and permissions
  • data previewing/visualisation
  • user extensible metadata fields
  • a license picker
  • quality assurance indicator
  • organisations, tags, collections, groups
  • unique IDs and cool URIs
  • comprehensive search features
  • geospacial features
  • social: comments, feeds, notifications, sharing, following, activity streams
  • data visualisation (tables, graphs, maps, images)
  • datastore (‘dynamic data’) + file store + catalogue
  • extensible through over 60 extensions and a rich API for all core features
  • can harvest metadata and is harvestable, too

You can take a tour or demo CKAN to get a better idea of its current features. The demo site is  running the new/next UI design, too, which looks great.

CKAN’s impact

In its five years of development, CKAN has achieved significant impact across the world. Despite web scale open data publishing being a relatively recent initiaitve, CKAN, through the efforts of OKFN, is the defacto standard for the publishing of open data with over 40+ instances running around the world. How do the UK, Dutch, Norweigan and Brazilian governments make their data publicly accessible? The European Commission? They use CKAN.

On the flip side, CKAN has attracted significant interest from developers with 53 code contributors over 5 years and 60+ extensions.

Major CKAN changes since Orbital project began

When we first bid for the JISC MRD programme funding, CKAN was a less attractive offering to us. Our bid focused on an approach we’ve taken on a number of projects, using MongoDB as a datastore over which we built an application that adds/edits/reads data via a set of APIs we would write. Our bid also focused on security and the confidentiality of commercial engineering data. Since starting the Orbital project these concerns have been addressed or are being addressed by CKAN and the requested features we’ve identified through our engagement with researchers have also been integrated into CKAN, such as activity streams and data visualisation. Reading through the CKAN changelog shows just how much work is going into CKAN and with each release it’s developing into a better tool for RDM. Here are some of the headline features, in order of priority, that have turned our attention to CKAN over the course of the Orbital project.

CKAN in an academic environment

We’ve discussed the idea of a Minimum Viable Product for RDM, and consider it to be authentication, data storage, hosting/publishing, licensing, a persistent URI and analytics. These features alone allow an academic to reliably and permanently publish data to support their research findings and help measure its impact. CKAN meets these requirements ‘out of the box’. Other requirements of a tool for managing research data include the following (you can add more in the comment box – these are based on our own discussions with researchers and a quick scan of other JISC MRD projects)

  • Integration with the institutional research environment (e.g. hooks into CRIS system, Institutional Repository, DMPOnline, networked storage)
  • Capturing the research process/context/activity; notation, not just data
  • Controlled access to non-Lincoln staff e.g. research partners
  • Good, comprehensive search tools
  • Version control for data and metadata
  • Customisable, extensible meatadata
  • Adherence to data standards e.g. RDF
  • Multi-level access policies
  • Secure, backed up, scalable file storage for anywhere access to files and file sharing (e.g. Dropbox)
  • Command-line tools and good web UI for deposit/update of data
  • Permanent URIs for citation e.g. DOIs
  • Import/export of common data formats
  • Linking datasets (by project, type, research output, person, etc.)
  • Rights/license management
  • Commercial support/widely used, popular platform (‘community’)

RDM features that are currently lacking in CKAN

During our meeting with OKFN staff last month we identified several areas that need addressing for CKAN to meet our wider requirements for RDM. These are:

  • Security: CKAN is not lacking in security measures, but we need to look at CKAN’s security model more closely (roles, permissions, access, authentication) and also tie it into the university’s Single Sign On environment
  • ‘Projects’ concept: We think that the new ‘organisations‘ feature might work conceptually in the same way as this.
  • Academic terminology + documentation for academic use: We need to review CKAN and write documentation for an academic use case as well as provide a modified language file that ‘translates’ certain terminology into that more appropriate for the academic context.
  • Batch edit/upload controls. Certain batch functions are available on the command line, but out of the box, there’s no way to upload and batch edit multiple files, for example.
  • ownCloud integration: CKAN doesn’t provide the network drive storage that researchers (actually pretty much everyone) relies on to organise their files. Increasingly people are using Dropbox because of the synchonisation and sharing features. These are important to researchers, too, and moving data from such a drive to CKAN will be key to researchers adopting it.
  • EPrints integration (SWORD2): A way to create a record of CKAN data in EPrints, thereby joining research outputs with research data.

It’s these features that we’ll be concentrating on in our development on the Orbital project.

Harry and I are attending the Open Knowledge Festival in Helsinki later this month and will talk more about our choice of CKAN for research data. I’d be interested to hear from anyone working in a university who has looked at CKAN in detail and decided against using it for RDM. It seems odd to me that it has such a low profile in academia (or maybe I’m just clueless??) and I do think that the time has come to embrace CKAN and acknowledge the efforts of OKFN more widely. I know there are people like Peter Murray Rust and Mark MacGillivray, who are actively trying to do this and OKFN’s presence at Dev8D and OR2012 this year demonstrates its eagerness to work more closely with the university sector. Perhaps we’re near a tipping point?

A Bridge To The Skies

Following on from our meeting with Team CKAN, we’ve had to make a few sweeping architectural changes. Here’s what’s happening:

  • We’re totally scrapping Orbital Core and Orbital Manager in their current form, since they mostly replicate functionality which already exists in ownCloud and CKAN.
  • We’re developing a brand new application, codenamed Orbital Bridge, which sits between various systems which make up the Orbital platform (ownCloud, CKAN, and in the future our Staff Directory and Awards Management System).
  • Orbital Bridge acts to orchestrate aspects of the various systems, and provides a high-level concept of research projects and project team members. The relevant aspects of these concepts (such as group membership, folder sharing and security permissions) are then managed in the individual systems.
  • Orbital Bridge will also include options for moving files and data (via the CKAN DataStore API) through conversion tools, for example taking a CSV and loading it directly into a datastore, converting binary files to open standards, or taking a datastore and converting it to something more useful based on given search parameters.
  • CKAN exposes a variety of data using its APIs (also delicious RDF). Orbital Bridge takes this data to boost our Nucleus institutional data store (and our upcoming institutional triple store), and through that our other integrated services.

Information which Orbital Bridge uses to determine things like projects and users will come initially from our University SSO service coupled to manual creation of projects and external users, and in the future (APIs permitting) partially from our Awards Management System for automatic population of project information. Check out our shiny graphic for a visual overview:

Hello CKAN

On Wednesday, we hosted three people from the Open Knowledge Foundation, to discuss the Orbital project and their software, CKAN. It was a very engaging and productive day spent with Peter Murray-Rust (on the Advisory Board of OKFN), Mark Wainwright (community co-ordinator) and Ross Jones (core developer). We asked them at the start of the day to challenge us about our technical work on Orbital so far and I described the day to them as an opportunity to evaluate our work developing the Orbital software so far. We didn’t touch on the other aspects of the Orbital project such as policy development and training for researchers.

To cut to the chase, the Orbital project will be adopting CKAN as the primary platform for further development of the technical infrastrcuture for RDM at Lincoln. This is subject to approval by the Steering Group, but the reasons are compelling in many ways and I am confident that the Steering Group will accept this recommendation. More importantly, the Implementation Plan that was approved by the Steering group and submitted to JISC remains unchanged.

The raw notes from our meeting are available here. Remember these are raw notes written throughout the day, primarily for our own record. They probably mean more to us than they do to you! Thanks to Paul Stainthorp for his fanatical note taking 🙂

Here’s the list of attendees and our agenda:

Present

Peter Murray-Rust (OKFN)
Mark Wainwright (OKFN)
Ross Jones (OKFN)
Joss Winn (University of Lincoln, CERD)
Nick Jackson (University of Lincoln, CERD)
Harry Newton (University of Lincoln, CERD)
Jamie Mahoney (University of Lincoln, CERD)
Alex Bilbie (University of Lincoln, ICT services)
Paul Stainthorp (University of Lincoln, Library)

Agenda

09.30 Introductions
10.00 Orbital introduction and context: Student as Producer, LNCD; Orbital bid and pilot project; Discussion of Orbital approach, the data we’re using, user needs etc.
10.30 CKAN introduction and context
11.00 Technical discussion – Orbital
12.00 LUNCH
12.30 Technical discussion – CKAN
13.30 Discussion – should Orbital adopt CKAN?
14.00 data[.lincoln].ac.uk
15.00 Next steps; Opportunities for collaboration/funding?

What is probably of most interest to people reading this are the pros & cons of the Orbital project adopting CKAN. I’ll provide more context further into the post, but here’s a summary copied from our notes:

Continue reading “Hello CKAN”

ownCloud: An ‘academic dropbox’?

Following up on our post from the MRDHack day, what follows is an evaluation of ownCloud as an institutional alternative to Dropbox. Our DAF survey showed that researchers at Lincoln require better managed storage space than the current 1GB FTP “H: Drive” provided to each staff member. Many of them are using portable drives, USB sticks and cloud-based services such as Dropbox to store and share their research data. Services like Dropbox provide compelling advantages to more traditional storage. Dropbox provides, for free, double the storage on offer to Lincoln researchers at present; it is always backed up, existing on both the local machine and on Amazon’s servers, it offers versioning for files up to 30 days old with the free account and ‘forever’ for paid accounts, it is accessible from almost all devices with Linux, OS X, Windows, iOS and Android clients available. Files can be published to the web or shared privately with other Dropbox users.

We know, however, that researchers using Dropbox are doing so without a clear understanding of the terms and conditions of the service and would ideally like a similar service to be provided internally by the University, where we can retain control of the data and its associated security.

We were first made aware of ownCloud when D’Arcy Norman blogged about his initial trial of it. ownCloud is an open source tool, which provides the same features as Dropbox (and more). With the release of version 4 in May, it appears to be a credible alternative to Dropbox for institutions wishing to provide a modern storage solution for their staff and students. D’Arcy’s initial experience with ownCloud was promising but he found issues with the syncing of files. Our recent tests of ownCloud have found that these problems have now been resolved with a recent update and what follows is an evaluation of ownCloud version 4.0.6, looking at it from three perspectives:

  1. ownCloud as a general purpose storage technology for an academic community
  2. ownCloud as a storage technology for research data
  3. ownCloud as a technology for integration with Orbital

Storage for an academic community?

ownCloud is an AGPLv3 licensed open source project, which started in January 2010. The project is run by a company, also called ownCloud, which provides commercial services and support for its software. The company resides in both Germany and the USA. The development of ownCloud is also open and supported by standard tools for open source projects: a source code repository, bug tracker, IRC channel, mailing list, wiki and forum. There are currently 13 core members of the project and 34 contributing developers. Development of the code is currently very active with changes made several times a day. The ownCloud project was started by the KDE community (it has no dependencies on KDE) and therefore benefits from the involvement of experienced open source developers. The software is written in PHP and can use MySQL, PostgreSQL or SQLite for its database. As of version 4, ownCloud has the following features:

  • Web user interface for file uploads and management of account and other features. Files can also be uploaded from an existing URL.
  • Windows/Mac/Linux/Android/iOS synchronisation clients.
  • WebDav integration for direct access to file storage. ownCloud has its own WebDAV server.
  • Folder and/or file sharing: publish to the web or share with groups or individuals.
  • File versioning.
  • An API for application integration.
  • Previewing for a number of filetypes.
  • Server side file encryption.
  • LDAP integration.
  • Notifications.
  • ownCloud can be installed in a PHP/MySQL environment on both Linux and Windows servers.

ownCloud also provides a number of other applications that can be activated or not, including a calendar, task list, contacts management, a built in text editor, image management, and experimental support for FTP, Google Drive and Dropbox integration. These are all available from the built in ‘app store’ and configurable by administrators. Maximum quotas for each account can be specified on an account-by-account basis, and a maximum file upload size can also be specified if needed.

The roadmap for version 5, due in August 2012, list the following:

  • Inter-ownCloud Sharing
  • Ajax interface
  • Mozilla Sync Integration
  • Improved permissions
  • Mounting of Dropbox and Google Drive
  • Improved version control

Continue reading “ownCloud: An ‘academic dropbox’?”