Orbital training and documentation

I’ve been quiet—too quiet—about the Orbital project recently. While I’ve not been blogging, Joss, Nick and Harry have overseen several fairly important developments:

As Orbital-the-product (coherent set of products, really) develops, my own focus between now and the end of the project (March 2013) will be on Orbital-the-servicetraining, support, documentation, and implementation of RDM policy at the University of Lincoln. I’ll work closely with the Research & Enterprise department on these aspects.

Four level hierarchy of documentationAs part of this strand of the project (which cuts across workpackages 7, 11, and 12), I want to consider the following:

  1. The current usability of ownCloud, CKAN, EPrints, etc. – what ‘sticking plaster’ help materials do we need to provide right now (if any?).
  2. How the production of documentation fits in to the software development release cycle (“change management“?) – particularly so in an agile/iterative environment, and how we ensure we meet our responsibility to ‘leave no feature undocumented’ as well as provide adequate contextual information on RDM. Related: I’m thinking about a four-level hierarchy of documentation (see right): how do the different levels relate to each other (how do we ensure internal consistency?), and how do we ensure all four levels are covered?
  3. [How] should we contribute to an (OKFN-co-ordinated) open research [data] handbook initiative (c.f. the Open Data Handbook; Data Journalism Handbook) instead of—or as well as—writing our own operational help guides? Contributing to and re-consuming community-written RDM materials will be more efficient than writing our own guidebook from scratch, but we need to make sure our local documentation is relevant to Lincoln.
  4. I’ve already started collated a list of other peoples’ RDM help materials (Joss has collected many more) – I’ll publish the list to this blog soon. I’ll be looking to see what we can re-use. There are some very good, openly-licensed training materials available, but I don’t want us to use them uncritically.
  5. How do we use our (still not-yet-accepted) RDM policy as a jumping-off point for training events?
  6. What did we learn from our recent(ish) Data Asset Framework exercise? How can we use researchers’ priorities as identified in the DAF to inform training? Should we re-run the exercise and/or follow it up with more detailed discussions?
  7. It possible/likely that we will shortly have a new member of staff to work with the Lincoln Repository and the University’s REF submission. What responsibility might that person have for RDM training and support?

Next I need to organise a meeting with the Research & Enterprise department to plan our ‘version 0.1’ training programme, possibly consisting of (i) a discussion of the issues raised in our DAF survey and people’s current RDM practice, (ii) a discussion of the RDM policy, and (iii) presentation of the various VRE tools available (CKAN, ownCloud, EPrints, DataCite, DMPOnline). We’ll probably pilot this on a group of willing PhD students in the School of Engineering.

Orbital at the Open Knowledge Festival #okfest

Harry and I attended the Open Knowledge Festival in Helsinki last week. Harry attended the CKAN sessions, while I was invited to be on a panel discussing ‘Immediate Access to Raw Data from Experiments’, which was part of the Open Research and Education stream of events. None of the panel members gave presentations as such, but you can read my notes and the session was recorded, too. Here’s all 46 minutes of it for your viewing pleasure.

The festival/conference was probably the best conference I’ve ever been to. It was completely sold out with 800 delegates and about 1000 participants in total. It was very international with many participants from outside the EU. It seemed like a genuine effort had been made to ensure that people from Africa, Asia and South America could attend, with some bursaries available. The conference programme, over five days, was largely crowdsourced in the run up to the event, and this made the programme very diverse, reflecting the diversity of interests people have in ‘openness’. It was also reassuring to find that despite the huge enthusiasm for openness in many aspects of public and civil society, people are also keenly aware of the challenges and issues that this raises, too, and ultimately the political ramifications of this endeavour.

The conference also seemed very well funded/sponsored, with support from the  Finish government, among many partners. The event was held at the fantastic Arabia Campus of the Aalto University, School of Art, Design and Architecture. When I visited Helsinki in 2008 for a conference about the design of learning spaces, delegates were bused up to the Arabia campus simply to see what a great place it is!

As well as participating in the above panel, I also got involved in the drafting of the ‘Open Research Data Handbook‘, which is a collaborative exercise in writing a handbook aimed at researchers who work with data. It’s my intention that the Orbital project commits some time to this and ultimately produces a Handbook useful for all researchers and possibly a variant for Lincoln researchers, too. I ensured that the authors of the Handbook are all aware of the DCC’s work as well as the various JISC-funded projects to produce training and guidance for researchers and I suspect that the Handbook will largely be a synthesis of sources which are already available.

Finally, I learned about the Panton Fellowships that the Open Knowledge Foundation have awarded this year, and both Fellows presented on their work. I think this is an excellent initiative from the OKFN to create a strong and direct tie with academia and support further research and action in our community. You can see both presentations from the Panton Fellows here and here.

Have Data, Will CKAN

One thing that Orbital is focussed on is the notion that data should remain as raw and accessible as possible throughout the research cycle, with as few steps as possible between the source of the data and its storage. We don’t want our rich sensor data being turned into Excel spreadsheets unnecessarily, and we don’t want to have to manually run reports, extract data and then load the data into something else just to get work done.

When we were building our own platform this was achieved with something we called Dynamic Datasets; a MongoDB cluster with a RESTful API bolted on top which would accept massive amounts of data and re-expose it on demand, including powerful filtering and output options. In CKAN we’re using the DataStore API, powered by ElasticSearch (at the moment, this will change in the future), to do the same thing. The DataStore API was originally intended to provide a searchable view on data such as CSV files as part of the Recline preview, but we’re using it for a slightly different purpose as the sole repository of data without any corresponding originating file. Data comes straight from the source, through any sanitation process which we need to run, and into DataStore.

To test the principle whilst we’re finishing evaluating security for our more confidential research data (where we get to play with big engineering data) we’ve started loading data which is being used by On Course into CKAN. Specifically we’re ingesting our course data and our organisational structure — and we’re doing it all automatically.

Once a day we run a set of queries against our Nucleus institutional data warehouse to gather the required data. This is our ‘sensor’ in the model, representing the source of the data prior to any analysis or further work. The data is then manipulated into ElasticSearch’s bulk update API format, and injected directly into DataStore. Check out some of our resources, maintained automatically by direct interaction between our source data and CKAN’s DataStore:

We can also use this data to run ‘real-time’ queries against the data — something researchers are likely to find useful.For example, we can ask for the first 100 modules taught by the School of Computing with “Games” in their title. This query is always as accurate as the last update of the DataStore API, which means that (as an engineering example) researchers can ask for data such as “every sensor value for machines recorded in the last day where the temperature was over 40”, and the data will be hot off the press every single time, with no requirement to re-export the information.

There are still a few rough edges we’re going to do our best to iron out, but all in all I’m fairly impressed with the ease of getting going.

Choosing CKAN for research data management

The switch to CKAN was an important decision for the Orbital project and I’d like to think that it will help raise the profile of CKAN within the academic community. We’d been keeping an eye on CKAN development from earlier on in the year, but it was the opportunity to talk to Mark Wainwright, OKFN Community Co-ordinator, at the Open Repositories 2012 conference that prompted us to really look at the potential of using CKAN as part of Lincoln’s Research Data Management infrastructure. Mark’s OR2012 poster (PDF) provides an nice overview of what CKAN currently offers.

Before I go into more detail about why we think CKAN is suitable for academia, here are some of the feature highlights that we like:

  • Data entry via web UI, APIs or spreadsheet import
  • versioned metadata
  • configurable user roles and permissions
  • data previewing/visualisation
  • user extensible metadata fields
  • a license picker
  • quality assurance indicator
  • organisations, tags, collections, groups
  • unique IDs and cool URIs
  • comprehensive search features
  • geospacial features
  • social: comments, feeds, notifications, sharing, following, activity streams
  • data visualisation (tables, graphs, maps, images)
  • datastore (‘dynamic data’) + file store + catalogue
  • extensible through over 60 extensions and a rich API for all core features
  • can harvest metadata and is harvestable, too

You can take a tour or demo CKAN to get a better idea of its current features. The demo site is  running the new/next UI design, too, which looks great.

CKAN’s impact

In its five years of development, CKAN has achieved significant impact across the world. Despite web scale open data publishing being a relatively recent initiaitve, CKAN, through the efforts of OKFN, is the defacto standard for the publishing of open data with over 40+ instances running around the world. How do the UK, Dutch, Norweigan and Brazilian governments make their data publicly accessible? The European Commission? They use CKAN.

On the flip side, CKAN has attracted significant interest from developers with 53 code contributors over 5 years and 60+ extensions.

Major CKAN changes since Orbital project began

When we first bid for the JISC MRD programme funding, CKAN was a less attractive offering to us. Our bid focused on an approach we’ve taken on a number of projects, using MongoDB as a datastore over which we built an application that adds/edits/reads data via a set of APIs we would write. Our bid also focused on security and the confidentiality of commercial engineering data. Since starting the Orbital project these concerns have been addressed or are being addressed by CKAN and the requested features we’ve identified through our engagement with researchers have also been integrated into CKAN, such as activity streams and data visualisation. Reading through the CKAN changelog shows just how much work is going into CKAN and with each release it’s developing into a better tool for RDM. Here are some of the headline features, in order of priority, that have turned our attention to CKAN over the course of the Orbital project.

CKAN in an academic environment

We’ve discussed the idea of a Minimum Viable Product for RDM, and consider it to be authentication, data storage, hosting/publishing, licensing, a persistent URI and analytics. These features alone allow an academic to reliably and permanently publish data to support their research findings and help measure its impact. CKAN meets these requirements ‘out of the box’. Other requirements of a tool for managing research data include the following (you can add more in the comment box – these are based on our own discussions with researchers and a quick scan of other JISC MRD projects)

  • Integration with the institutional research environment (e.g. hooks into CRIS system, Institutional Repository, DMPOnline, networked storage)
  • Capturing the research process/context/activity; notation, not just data
  • Controlled access to non-Lincoln staff e.g. research partners
  • Good, comprehensive search tools
  • Version control for data and metadata
  • Customisable, extensible meatadata
  • Adherence to data standards e.g. RDF
  • Multi-level access policies
  • Secure, backed up, scalable file storage for anywhere access to files and file sharing (e.g. Dropbox)
  • Command-line tools and good web UI for deposit/update of data
  • Permanent URIs for citation e.g. DOIs
  • Import/export of common data formats
  • Linking datasets (by project, type, research output, person, etc.)
  • Rights/license management
  • Commercial support/widely used, popular platform (‘community’)

RDM features that are currently lacking in CKAN

During our meeting with OKFN staff last month we identified several areas that need addressing for CKAN to meet our wider requirements for RDM. These are:

  • Security: CKAN is not lacking in security measures, but we need to look at CKAN’s security model more closely (roles, permissions, access, authentication) and also tie it into the university’s Single Sign On environment
  • ‘Projects’ concept: We think that the new ‘organisations‘ feature might work conceptually in the same way as this.
  • Academic terminology + documentation for academic use: We need to review CKAN and write documentation for an academic use case as well as provide a modified language file that ‘translates’ certain terminology into that more appropriate for the academic context.
  • Batch edit/upload controls. Certain batch functions are available on the command line, but out of the box, there’s no way to upload and batch edit multiple files, for example.
  • ownCloud integration: CKAN doesn’t provide the network drive storage that researchers (actually pretty much everyone) relies on to organise their files. Increasingly people are using Dropbox because of the synchonisation and sharing features. These are important to researchers, too, and moving data from such a drive to CKAN will be key to researchers adopting it.
  • EPrints integration (SWORD2): A way to create a record of CKAN data in EPrints, thereby joining research outputs with research data.

It’s these features that we’ll be concentrating on in our development on the Orbital project.

Harry and I are attending the Open Knowledge Festival in Helsinki later this month and will talk more about our choice of CKAN for research data. I’d be interested to hear from anyone working in a university who has looked at CKAN in detail and decided against using it for RDM. It seems odd to me that it has such a low profile in academia (or maybe I’m just clueless??) and I do think that the time has come to embrace CKAN and acknowledge the efforts of OKFN more widely. I know there are people like Peter Murray Rust and Mark MacGillivray, who are actively trying to do this and OKFN’s presence at Dev8D and OR2012 this year demonstrates its eagerness to work more closely with the university sector. Perhaps we’re near a tipping point?

A Bridge To The Skies

Following on from our meeting with Team CKAN, we’ve had to make a few sweeping architectural changes. Here’s what’s happening:

  • We’re totally scrapping Orbital Core and Orbital Manager in their current form, since they mostly replicate functionality which already exists in ownCloud and CKAN.
  • We’re developing a brand new application, codenamed Orbital Bridge, which sits between various systems which make up the Orbital platform (ownCloud, CKAN, and in the future our Staff Directory and Awards Management System).
  • Orbital Bridge acts to orchestrate aspects of the various systems, and provides a high-level concept of research projects and project team members. The relevant aspects of these concepts (such as group membership, folder sharing and security permissions) are then managed in the individual systems.
  • Orbital Bridge will also include options for moving files and data (via the CKAN DataStore API) through conversion tools, for example taking a CSV and loading it directly into a datastore, converting binary files to open standards, or taking a datastore and converting it to something more useful based on given search parameters.
  • CKAN exposes a variety of data using its APIs (also delicious RDF). Orbital Bridge takes this data to boost our Nucleus institutional data store (and our upcoming institutional triple store), and through that our other integrated services.

Information which Orbital Bridge uses to determine things like projects and users will come initially from our University SSO service coupled to manual creation of projects and external users, and in the future (APIs permitting) partially from our Awards Management System for automatic population of project information. Check out our shiny graphic for a visual overview: