CKAN for RDM workshop

On the 18th February, we ran a workshop in London which focused on the use of CKAN for research data management. The Orbital project made the decision to use CKAN last summer and was soon followed by Bristol’s data.bris project, which is using CKAN for its discovery catalogue. Simon Price from Bristol, gave a very interesting presentation of their work with CKAN, which you can read about on their project blog.

The #CKAN4RDM workshop was fully booked with 40 delegates attending – many more than we originally anticipated. It was facilitated by Simon Hodson, the Programme Manager of JISC’s Managing Research Data programme. Following presentations from Lincoln and Bristol on our respective uses of CKAN (ours was a live demo of ‘Orbital Bridge‘), we spent the later part of the morning undertaking a requirements gathering exercise, where tables of around 8-10 people acted as different users, providing ‘stories’ (requirements) for a research data management system. The exercise was introduced in the following few slides.

This was a useful exercise regardless of the software used, but after collating all 70+ stories over lunch, we then returned to our user groups and each table worked with a CKAN expert from the Open Knowledge Foundation to discuss the existing constraints for each requirement and started to develop a gap analysis so as to identify work to be done. The output of this work can be viewed on Google docs.

Types of users
Types of users
The 'researcher' user group
The ‘researcher’ user group

 

There was quite a positive buzz about the day and general feedback suggested that delegates got a lot out of the event. You can read write ups from the DCC, LSE and the Datapool project at Southampton.

One of the original purposes of the workshop was research for a conference paper that I (Joss) am giving at the IASSIST conference in Cologne, in May. The abstract I submitted to the conference was as follows:

This paper offers a full and critical evaluation of the open source CKAN software <http://ckan.org> for use as a Research Data Management (RDM) tool within a university environment. It presents a case study of CKAN’s implementation and use at the University of Lincoln, UK, and highlights its strengths and current weaknesses as an institutional Research Data Management tool. The author draws on his prior experience of implementing a mixed media Digital Asset Management system (DAM), Institutional Repository (IR) and institutional Web Content Management System (CMS), to offer an outline proposal for how CKAN can be used effectively for data analysis, storage and publishing in academia. This will be of interest to researchers, data librarians, and developers, who are responsible for the implementation of institutional RDM infrastructure. This paper is presented as part of the dissemination activities of the JISC-funded Orbital project <https://orbital.blogs.lincoln.ac.uk>.

As well as using last week’s outputs of the CKAN4RDM workshop, I’ll also be working closely with OKF staff to ensure that the evaluation is as thorough, accurate and up-to-date as possible by the time of the conference. It will focus on version 2.0 of CKAN, which is due for release soon.

I’d also like to appeal to other JISC MRD projects to send me any existing requirements documents you have produced during the course of your project. I will use the anonymised data to enrich the requirements we gathered last week. If you have such documents, please email me.

Finally, we have set up a CKAN4RDM mailing list, which anyone is welcome to join to discuss the use of CKAN within academia. One thing is clear to me: the academic community cannot expect OKF and existing CKAN developers to meet all of our requirements for research data management. We need to contribute developer time and other resource and effort to the overall CKAN open source project, just as other public sector organisations are doing.

 

Gluing people together

In December, colleagues in the Web Team (who manage the corporate web site in the Department of Marketing and Communications) approached a few of us about building a tool to allow staff to edit their profile for the new version of the lincoln.ac.uk website. We suggested that much of the work was already done and it just needed gluing together. Yesterday we met with the Web Team again to tell them that our part of the work is pretty much complete. Here’s how it works.

Quick sketch of profile building at Lincoln
Quick sketch of profile building at Lincoln

This requires a bit of explanation, but let me tell you, it’s the holy grail as far as I’m concerned and having this in place brings benefits to Orbital and any other new application we might develop. Here’s a clearer rendering.

 

Building staff profiles
Building staff profiles

The chart above strips out the stuff around authentication that you see in the bottom right of the whiteboard photo. That’s for another post – something Alex is better placed to write.

Information about staff at the university starts with the HR database. This feeds the Active Directory, which authenticates people against different web services. Last year, Nick and Alex pulled this data into Nucleus, our MongoDB datastore, and with it built a new, slick staff directory. Then they started bolting things on to it, like research outputs from the repository and blog posts from our WordPress/BuddyPress platform. To illustrate what was possible, they started pulling information from my BuddyPress profile, which I could edit anytime I wanted to. It got to the point where I started using my staff directory link in my email signature because it offered the most comprehensive profile of me anywhere on a Lincoln website.

By the time we first met with the Web Team about the possibility of helping them with staff profiles, Alex and Nick had 80% of the work already done. What remained was to create a richer number of required fields in BuddyPress for staff to edit about themselves and a scheduled XML dump for the Web Team to wrangle into their new templates on www.lincoln.ac.uk.

So the work is nearly done. The XML file is RDF Linked Data, which means that we have a rich aggregation of staff information and some simple relationships, feeding the Staff Directory, being refreshed every three hours and then being output either as HTML, JSON or RDF/XML.

For the Orbital project, all this glue is invaluable. When staff login to Orbital (Nick’s working on this part right now), we’ll already know who they are, which department they work in, what research outputs they’ve deposited in the institutional repository, what their research interests are, what projects they’re working on, the research groups they’re members of, their recent awards and grants, and the keywords they’ve chosen to tag their profile with. It’s our intention that with some simple AI, we’ll be able to make Orbital a space where Researchers find themselves in an environment which already knows quite a bit about their work and the context of the research they’re undertaking. Once Orbital starts collecting specific staff data of its own, it can feed that back into Nucleus, too.

This reminds me of our discussion last month with Mansur Darlington of the ERIM/REDm-MED project. Mansur stressed the importance of gathering data about the context of the research itself, emphasising that without context, research data becomes increasingly meaningless over time. Having rich user profiles in Orbital and ensuring that we record data about the Researcher’s activity while using Orbital, should help provide that context to the research data itself.

Orbital, therefore, becomes an infrastructure not only for storing and managing research data, but also a system for storing and managing data about the research itself.

Building on the ERIM and REDm-MED projects

On January 20th, Dr. Mansur Darlington from the ERIM & REDm-Med projects came to Lincoln to discuss his work in relation to the Orbital project. Mansur has a consultancy role on the Orbital project and will be joining us again later on in the year, to help us evaluate our progress. It was a very useful and interesting meeting for all of the Orbital Team and the Engineering Researchers working with us. What became clear to us is that while ERIM offers the Orbital project a great deal of the underlying research and analysis of how Engineers work with data, Orbital can reciprocally feed back observations and issues arising from ERIM’s recommendations, which are theoretically robust but have not yet been tested in implementation. Similarly, with the REDmMEd project, which finishes in May/June, I hope that we can take the outputs of that prototyping work and build on them in the development of Orbital.

Here are Mansur’s slides from the meeting and below that, my notes.

  1. Purpose of the meeting
  2. Introductions: Bev, Annalisa, Bingo, Chunmei, Joss, Stuart, Lee, Mark, Nick, Paul, Mansur. Apologies, Chris Leach.
  3. Engineers: Bingo, Chunmei, Stuart
  4. See slides. ERIM research offers good spread of Engineering research data.Industry collaboration is vitally important.
  5. MRD in general:

* Need to find out which RC (%), the funding into Engineering School comes from.
* All institutions have to put together a roadmap for RDM by May 2012 for EPSRC.
* Siemens/Lincoln spend a lot of effort in discovery of existing data to base investigations on.
* No national, dedicated Engineering data archive
* Need to look at API integration with DPMOnline (DCC)
* Orbital as tool for managing research projects?
* Ask DCC to visit Lincoln for Policy development and training.
* Reporting to DCC is a formal requirement.
* Include costs of MRD in the university overhead when bidding for funds.
* Datasets as an outcome of research projects. More ‘efficient’ to deal with RDM as part of project.
* ‘Market’ for data. Expectation of costs and benefits of MRD

6. The Nature of Engineering Research Data:

* ERIM: Engineering Research Information Management: Research activity data as well
* Problems with terminology. Need for definition. Both theoretical and practical/empirical outputs from the project.
* Good slides for terminology and understanding domain
* How does Orbital fit into the VRE puzzle?
* Transparent logging and capture of as much activity data as possible.
* Knowing the context is vital for understanding data. Orbital needs to concentrate on contextual data as much as ‘research data’.
* Orbital supports research lifecycle from bidding to completion?
* ‘Engineering research data’ covers pretty much all types of data.
* Need to identify other types of Engineering users to broaden scope of ‘Engineering data’
* Look outside Engineering for variety of data types/activity. Look beyond Engineering. Generalisable.
* Data types is one thing; methodologies and the data they produce are another.
* We manage data so that it can be RE-USED (by someone)
* Must not add to bureaucracy of research

Jenkins, build my software!

Orbital is going to be a big bit of software, with lots of things doing lots of other things. A big part of putting together such a large bit of software – alongside our Pivotal Tracker instance – is the regular process of ‘building’ the software from source code into something that can actually be used, testing it and getting it onto our development servers so that we can actually see what it’s doing. As part of Orbital we’re taking a step into what is a relatively unexplored frontier for the development team here at Lincoln – Continuous Integration.

Continuous Integration means that as we develop our software it’s constantly being built, tested and deployed to make sure that it’s behaving as expected. We’re using the popular Jenkins server to manage everything that’s going on as part of this process, as well as provide reports on what’s happened. We’re slowly adding more things to the list of what’s actually happening when the magic starts, but here’s what we’re going to be doing by the end of the project every single time that somebody makes a change to our codebase:

  • Ensure that the source code is available from GitHub.
  • Invoke Phing to do all kinds of additional goodness as part of an automated build, including:
    • Run unit tests on our code using PHPUnit.
    • Verify that the code adheres to certain style standards (We use the CodeIgniter Style Guide) using PHP Code Sniffer. Specifically we’re using Thomas Ernest’s implementation of the guide.
    • Run a whole battery of analysis that looks for messy code structure and duplicate code.
    • Automatically build the technical documentation using DocBlox. This isn’t the end-user documentation, but it does tell us exactly what all our code is supposed to be doing so that we have a reference.
    • Perform token replacement on the resultant codebase. This means that we can keep the code repository clear of all environment and institution specific configuration, since these are replaced as we perform a build.
  • Deploy the built codebase to our development and testing platform so that we can actually use it.
  • Tell us the results of all of the above in a variety of pretty graphs and reports.

Continue reading “Jenkins, build my software!”

USTLG meeting on research data management

Clare CollegeYesterday I was at Clare College, University of Cambridge for a meeting organised by USTLG, the University Science & Technology Librarians Group. The group—open to any librarians involved with engineering, science or technology in UK universities—has meetings once or twice a year. The theme of yesterday’s meeting (free to attend, thanks to sponsorship from the IEEE) was data management, with an implied focus on research data.

The meeting consisted of a series of presentations (plus a fantastic lunchtime diversion, below) with plenty of time for networking – there were about 40 people there, all with an interest in research data management – though interestingly, a show of hands suggested very few people were actively engaged in looking after their own institution’s researchers’ data.

As usual, this blog post has been partially reconstructed from the Twitter stream (hashtag #ustlg).

First up, Laura Molloy, substituting for Joy Davidson of the Digital Curation Centre (DCC), on a project called the Data Management Skills Support Initiative (DaMSSI), looking at the [shades of information literacy] skills needed by different people involved in the research data curation process. “DaMSSI aims to facilitate the use of tools like Vitae’s Researcher Development Framework (RDF) and the Seven Pillars of Information Literacy model” developed by SCONUL. Key question: how do you assess the effectiveness of research data management training?

Continue reading “USTLG meeting on research data management”