Robotics data now stored in CKAN

I’ve had to delay this post until confirmation of Tom’s project funding came through, but I’m pleased to be able to say that we’ve published our first complete research dataset(s) on CKAN.

Some months ago, Researchers, Tom Duckett and Feras Dayoub, came to us asking if we could host their data to support two publications and an EU grant application they were about to submit. We quickly stuck the data on one of our servers, they knocked up some HTML pages and we advised them on licensing the data so that it could be re-used. It was a temporary solution but we assured them that their root domain name would always act as a proxy to the final resting place of their data and so they started to tell the world about it. I’m told there was much interest in their data on specialist mailing lists and we were invited to submit a paper which discussed the data and the process of its publication. Their consortium bid for EU funding was also successful. Here’s what Tom had to say:

I believe that publishing our datasets for long-term robotic mapping has helped us: 1) to achieve greater awareness of our work (we were among the first groups in the world to study long-term mapping by mobile robots, in research from 2004-present), enabling other researchers worldwide to use our data, 2) to increase citations to our REF-able research papers in this area, and 3) to play our part in successfully applying for a 4-year FP7 IP project in collaboration with 7 other partners, by showing that we already have a track record in hosting such datasets. (STRANDS project – joint PIs at Lincoln: Marc Hanheide and Tom Duckett). One of the requirements of this project will be to publish even larger datasets of robot data, so we look forward to collaborating with Joss and colleagues again in future to address the challenges of hosting and curating “big data” for robotics research.

Prior to switching to CKAN, we were just about to move Tom and Feras’ data across to our own Orbital software, which met their minimal requirements, but having now switched to integrating with CKAN, we’ve moved the datasets to their permanent home at https://ckan.lincoln.ac.uk.

Just as we promised, Tom and Feras are still able to direct people to the original web address we gave them which points to their research pages, but the data itself is now hosted on CKAN. Having seen Tom’s data presented in this way, his colleague Greg published his data in the same way, using our WordPress platform to build a site explaining the data and CKAN as the actual data store.

This all happened before we had our Orbital Bridge publishing workflow in place (a post on that in a couple of weeks) and in the absence of a working Orbital application, I uploaded the data on Tom and Feras’ behalf. I spent quite some time using CKAN and can make the following observations about version 1.7.x, which is what we currently use.

  • Batch uploads: The data was zipped up into four collections of zip files. My task was to duplicate the organisation of the data which made sense to the researchers. This was possible as you can see, but it was tedious uploading each of the 29 zip files, many of which were over 1GB each. There were no problems doing so, it was just tedious and better batch upload/edit operatios in CKAN would make this much easier. Ideally, I’d like to have uploaded the zip files from each of the four collections of data, catalogued them by batch where they shared the same information and then individually edited attributes like the title of each zip file, for example. Having been an Archivist on and off for the last decade, this is one of the main gripes we have with library and archive systems. When dealing with collection of things, we need to be able to operate on them as collections and not have to deal with each object individually. I’ve spoken to CKAN developers about this and there are work-arounds, using scripts and a form extension, but it’s not something CKAN offers to most users with ease. Yet! 🙂
  • Research Groups and projects: The v1.7.x version of CKAN understands the concept of ‘dataset’ e.g. https://ckan.lincoln.ac.uk/en/dataset/ltmro-1 and of that dataset containing discreet resources. e.g. https://ckan.lincoln.ac.uk/en/dataset/ltmro-1/resource/92cbf22b-3293-45a3-b1de-f7782e581fe8 CKAN also understands the concept of ‘groups’ e.g. https://ckan.lincoln.ac.uk/en/group/lincoln-centre-for-autonomous-systems which datasets can be attached to. Groups are simply a label you apply to a dataset. You can add people to a group with specific read/write permissions over the group and you can add datasets to the group, too. CKAN also maintains a history of the actions of that group e.g. https://ckan.lincoln.ac.uk/en/group/history/lincoln-centre-for-autonomous-systems However, currently, CKAN does not (yet) understand ‘projects’, i.e. an organisational concept that is role-based and allows a user to administer other users and work. Groups are not synonymous with projects, but we think that a new feature in CKAN v2.0, due for release in a month or so, will resolve this. As I understand it, CKAN organisations will work like Github organisations and if so, that’s good. On Github, our research group, LNCD, is an ‘organisation’ and within that organisation I can add/remove people, give them roles, create private and public repositories (‘datasets’) and we can be members of more than one organisation, too. e.g. http://github.com/lncd and http://github.com/josswinn There is already a CKAN extension that implements organisations, but we’re waiting for this work to be merged into the core code.
  • Citations: If you look at Tom’s original web pages for their data, they are pretty clear in providing details about how to cite their data. This is so important to academics. CKAN does not offer a way to automatically generate a suggested citation for people who use the data. EPrints, on the other hand, offers the citation details of a research paper right at the top of the publication record e.g. http://eprints.lincoln.ac.uk/6046/ Some work on citations for CKAN has happened – there were conversations a few weeks ago on the IRC channel – but it’s something we need to work on, too. As a temporary solution, I have added the paper citation details as additional fields in the dataset record. CKAN is nice in that it allows you to add adhoc key-value pairs when cataloguing. However, this doesn’t address the citation details for the actual datasets themselves, but rather the publications.

In the near future, our ‘Researcher Dashboard’ application (codenamed ‘Orbital Bridge’) will handle the data deposit workflow from project creation to grabbing a datacite DOI to setting up a CKAN environment, to depositing a record of the data in ePrints for curation and preservation by the university. However, the upload and cataloguing of data will still be done by the researcher using CKAN, with Orbital aggregating information about the project, publications and data into a ‘dashboard’ for the researcher. Something like this  below, which is an actual screenshot of another project that we’re using to test the ‘Researcher Dashboard’. More on this soon…

Example research project overview
Example research project overview

The thigh bone’s connected to the hip bone…

One thing that has been high up the list of considerations during the development of Orbital has been how it integrates with the rest of the University. Whilst a lot of new services tend to exist in relative isolation, making use of scheduled batch imports to keep themselves in step with things like staff lists, Bridge is designed to tie in to the very core of the University’s data platform. The benefits are numerous enough to make it worth the additional development and network overhead, since we’re able to provide a truly continuous experience between previously disparate systems.

Orbital revolves around research projects as its basic unit of data, something which we already had the capacity to store within our Nucleus data model as part of work on the Staff Directory. Whilst there is an awful lot more that Orbital wants to know about research than is relevant to the Directory it made no sense to create yet another list of research projects and introduce a second place to keep things updated. Instead we extended Nucleus’s understanding of a research project to include the new aspects such as linking multiple researchers to a single project, a more complex model for funding and so-on. What this now means is that both Orbital and the Directory share the same data, and when a staff member adds a new project to their research dashboard in Bridge it will appear seamlessly on their Staff Profile.

Since we’re using Nucleus to provide more and more data, as well as sending data in both directions, we took the opportunity to start building a more robust solution for the sending and receiving. What we came up with was a PHP library built on the Guzzle HTTP client framework. Although this is very early in development (your contributions to the code are welcome) it gives us a controllable, standardised platform which we can use to both request data from Nucleus and send data back, taking care of issues such as formatting and encoding. Even better, since the library is ready to go with Composer, we (and anybody else interacting with Nucleus over PHP) can include it in their project with a single line in their configuration.

This brings us back full circle to the Staff Directory, which as of the next version will be making use of this library to communicate with Nucleus. As new solutions are put together which rely on the Nucleus platform this library will be extended further until its our standard way of getting and updating data, adding a layer of abstraction where we no longer care how the data arrives at the application or how it is makes its way back to Nucleus.

The upshot of all this interconnectivity? We can build a brand new application off the back of our research project data very quickly, changing what would have taken weeks or even months into a matter of days.

Choosing CKAN for research data management

The switch to CKAN was an important decision for the Orbital project and I’d like to think that it will help raise the profile of CKAN within the academic community. We’d been keeping an eye on CKAN development from earlier on in the year, but it was the opportunity to talk to Mark Wainwright, OKFN Community Co-ordinator, at the Open Repositories 2012 conference that prompted us to really look at the potential of using CKAN as part of Lincoln’s Research Data Management infrastructure. Mark’s OR2012 poster (PDF) provides an nice overview of what CKAN currently offers.

Before I go into more detail about why we think CKAN is suitable for academia, here are some of the feature highlights that we like:

  • Data entry via web UI, APIs or spreadsheet import
  • versioned metadata
  • configurable user roles and permissions
  • data previewing/visualisation
  • user extensible metadata fields
  • a license picker
  • quality assurance indicator
  • organisations, tags, collections, groups
  • unique IDs and cool URIs
  • comprehensive search features
  • geospacial features
  • social: comments, feeds, notifications, sharing, following, activity streams
  • data visualisation (tables, graphs, maps, images)
  • datastore (‘dynamic data’) + file store + catalogue
  • extensible through over 60 extensions and a rich API for all core features
  • can harvest metadata and is harvestable, too

You can take a tour or demo CKAN to get a better idea of its current features. The demo site is  running the new/next UI design, too, which looks great.

CKAN’s impact

In its five years of development, CKAN has achieved significant impact across the world. Despite web scale open data publishing being a relatively recent initiaitve, CKAN, through the efforts of OKFN, is the defacto standard for the publishing of open data with over 40+ instances running around the world. How do the UK, Dutch, Norweigan and Brazilian governments make their data publicly accessible? The European Commission? They use CKAN.

On the flip side, CKAN has attracted significant interest from developers with 53 code contributors over 5 years and 60+ extensions.

Major CKAN changes since Orbital project began

When we first bid for the JISC MRD programme funding, CKAN was a less attractive offering to us. Our bid focused on an approach we’ve taken on a number of projects, using MongoDB as a datastore over which we built an application that adds/edits/reads data via a set of APIs we would write. Our bid also focused on security and the confidentiality of commercial engineering data. Since starting the Orbital project these concerns have been addressed or are being addressed by CKAN and the requested features we’ve identified through our engagement with researchers have also been integrated into CKAN, such as activity streams and data visualisation. Reading through the CKAN changelog shows just how much work is going into CKAN and with each release it’s developing into a better tool for RDM. Here are some of the headline features, in order of priority, that have turned our attention to CKAN over the course of the Orbital project.

CKAN in an academic environment

We’ve discussed the idea of a Minimum Viable Product for RDM, and consider it to be authentication, data storage, hosting/publishing, licensing, a persistent URI and analytics. These features alone allow an academic to reliably and permanently publish data to support their research findings and help measure its impact. CKAN meets these requirements ‘out of the box’. Other requirements of a tool for managing research data include the following (you can add more in the comment box – these are based on our own discussions with researchers and a quick scan of other JISC MRD projects)

  • Integration with the institutional research environment (e.g. hooks into CRIS system, Institutional Repository, DMPOnline, networked storage)
  • Capturing the research process/context/activity; notation, not just data
  • Controlled access to non-Lincoln staff e.g. research partners
  • Good, comprehensive search tools
  • Version control for data and metadata
  • Customisable, extensible meatadata
  • Adherence to data standards e.g. RDF
  • Multi-level access policies
  • Secure, backed up, scalable file storage for anywhere access to files and file sharing (e.g. Dropbox)
  • Command-line tools and good web UI for deposit/update of data
  • Permanent URIs for citation e.g. DOIs
  • Import/export of common data formats
  • Linking datasets (by project, type, research output, person, etc.)
  • Rights/license management
  • Commercial support/widely used, popular platform (‘community’)

RDM features that are currently lacking in CKAN

During our meeting with OKFN staff last month we identified several areas that need addressing for CKAN to meet our wider requirements for RDM. These are:

  • Security: CKAN is not lacking in security measures, but we need to look at CKAN’s security model more closely (roles, permissions, access, authentication) and also tie it into the university’s Single Sign On environment
  • ‘Projects’ concept: We think that the new ‘organisations‘ feature might work conceptually in the same way as this.
  • Academic terminology + documentation for academic use: We need to review CKAN and write documentation for an academic use case as well as provide a modified language file that ‘translates’ certain terminology into that more appropriate for the academic context.
  • Batch edit/upload controls. Certain batch functions are available on the command line, but out of the box, there’s no way to upload and batch edit multiple files, for example.
  • ownCloud integration: CKAN doesn’t provide the network drive storage that researchers (actually pretty much everyone) relies on to organise their files. Increasingly people are using Dropbox because of the synchonisation and sharing features. These are important to researchers, too, and moving data from such a drive to CKAN will be key to researchers adopting it.
  • EPrints integration (SWORD2): A way to create a record of CKAN data in EPrints, thereby joining research outputs with research data.

It’s these features that we’ll be concentrating on in our development on the Orbital project.

Harry and I are attending the Open Knowledge Festival in Helsinki later this month and will talk more about our choice of CKAN for research data. I’d be interested to hear from anyone working in a university who has looked at CKAN in detail and decided against using it for RDM. It seems odd to me that it has such a low profile in academia (or maybe I’m just clueless??) and I do think that the time has come to embrace CKAN and acknowledge the efforts of OKFN more widely. I know there are people like Peter Murray Rust and Mark MacGillivray, who are actively trying to do this and OKFN’s presence at Dev8D and OR2012 this year demonstrates its eagerness to work more closely with the university sector. Perhaps we’re near a tipping point?

Orbital: Impact and benefits

I’ve been asked to highlight the benefits of running the Orbital project so far. We’re seven months or so into Orbital, which is an 18 month project. In our initial Project Plan, we identified the anticipated impact of the project and so I’ll use that list in this blog post to reflect on the impact and benefits so far. The headings and text in italics are copied from the Project Plan.

Research practices

Researcher’s data management practices will change, supported by technologies that encourage new processes in the administration and dissemination of data.

We’ve had very little impact in this area so far. It’s too early to impact on researchers’ practices when we’re still developing our own knowledge and the infrastructure to support RDM. Changing researchers’ practices takes time. However, there are indications that it will happen. Engagement with our user groups and ad hoc requests for help from researchers who know we’re working on the project has shown us that researchers do want to change their practices. Our recent DAF survey also told us that researchers know that their practices could be improved and where they need support to do so.

Internal auditing

Greater oversight and analysis of research data created by researchers will be possible.

We’ve had no impact here and won’t until the Orbital software is built. We will be working on this in the next version of Orbital and our recent effort at the MRD Hack Day around activity data was a precursor to this work. We are increasingly seeing activity data as a key component of Orbital and this was underlined early on in the project when Mansur Darlington from the ERIM and REDm-MED projects stressed the importance of capturing contextual metadata.

Research governance

Improved methods of auditing research undertaken by the university will be possible, enabling greater cross-disciplinary work.

This relates to the benefit above around capturing activity data and improving ‘business intelligence’. As yet no real impact, but we still anticipate Orbital being a useful tool for reporting and enabling greater cross-disciplinary collaboration though greater transparency. Related to this is the creation of RDM Policy, which we have begun and has resulted in a statement being made for the EPSRC RDM ‘Road Map’.

Integrated services

Research data management will be integrated into existing systems, such as staff profiles, the institutional repository, blogs and calendars. Towards a Virtual Research Environment.

We have had a direct impact on the creation of new staff profiles at the university. Nick Jackson, Lead Developer on Orbital has been working with Alex Bilbie in the ICT Online Services Team to create an aggregated profile for staff. We blogged about it earlier and you can see my example. Profiles of staff are now aggregated from different systems and stored as RDF Linked Data and we intend to pull in activity data from Orbital to further enrich staff profiles. In this way, our work is being recognised and valued by other teams in the university. Furthermore, the university has recently procured a new ‘Awards Management System’, which will provide data to Orbital about funded research projects and we intend to couple Orbital with EPrints using SWORD2.

What is clear from our discussions with users is that they expect Orbital to do more than simply store and publish research data. Without using the term, they are effectively asking for a Virtual Research Environment (VRE). This is something which we did anticipate and have always planned for Orbital to be a tool for both analysing and publish research data. When discussing ‘Research Data Management’, there is a fine line between DMP planning, research project management, team workspaces and public web publishing and while we need to be careful that the scope of Orbital does not creep, we are sensitive to on-going user requirements.

FOI compliance

Will make FOI requests easier to respond to or unnecessary.

We’ve had no impact here so far, nor are we yet in a position to.

Open Data

Will promote and enable the production of public data sets.

As I will discuss in a forthcoming blog post about our first release of Orbital, we have had some impact here and have witnessed the benefits. In summary, a researcher contacted us for help with publishing some data, the result of which was an invitation to write a journal article about the data, offers of collaboration and the strengthening of an EU grant application.

Our workshop on open licensing has also led to a further meeting between myself (Joss Winn, Orbital PM), and the university’s IP Manager. A further follow up meeting is planned to draft guidance for staff on the use of open licenses for source code and data. Furthermore, research staff are being directed to the Orbital team for informal advice on open licensing. In this sense, we are beginning to improve the awareness and understanding of open licenses among researchers.

The innovation cycle

Will embed new technologies and culture change among professional staff at the university and lead to further innovation in our services.

We are having some impact here and have had meetings with central ICT staff about integrating our server farm with cloud services. We are currently developing the Orbital software using Rackspace, but have recently ordered hardware, partly paid for by the project, to establish a private cloud, running on OpenStack, for research and development at the university. In addition to this, our development toolchain has changed and we now have tools and processes in place that we did not have six months ago. These are being adopted outside of the Orbital project by staff within the ICT Online Services Team and other projects we are running. In addition to this, we intend that the changes we make to our own R&D tools and processes, are made available to other researchers and students. Over the summer, we will set up and maintain a university-wide Gitorious source code repository service (similar to Github), where staff and students who write code can form teams, manage their source code, and publish it if they wish. We also intend to run a Jenkins server for similar purposes so that all staff and students can benefit from source code control and the quality assurance processes that we have implemented through Orbital. Orbital is now a driver for a general R&D infrastructure for Academic Computing that project members and wider members of LNCD  are building.

I will write more about this at a later date because, for someone who manages R&D infrastructure projects at the university and wishes to engage staff and students in our work, I am excited to be able to integrate this into academic programmes and the work of other researchers.

I want to also stress that like all of our projects, the benefits, however slight, spill into other aspects of our work. Being a large project, Orbital has allowed us to concentrate on developing our toolchain and development environment across other projects, it’s given us time to learn new skills and share our learning with colleagues. In this way, it has been pivotal in the way we work and the future direction of our work.

Recruitment

Will build capacity for local development of innovative services

Orbital has allowed us to recruit two full-time Developers (Nick Jackson and Harry Newton). We are therefore two staff up and it is my intention to try and keep it that way.

Staff skills

Will improve staff skills and experience

Yes, we are benefitting in this way. The Orbital project team are now the RDM ‘experts’ in the university and despite being novices in this regard, over the course of the project staff working in the Library, ICT, Research and Enterprise Office and Centre for Educational Research and Development, are each developing their understanding of the processes and implications of RDM.

Clearly an 18 month project (at least the way I run them!), allows for staff to learn new tools and skills, experiment with new methods of working and disseminate this learning to other staff. This is one constant that I value highly about our project work. Despite the stop and start nature of project work and that not all of our work eventually makes it into a fully fledged university services, the tools and learning, especially as we engage more with academic programmes, goes beyond the confines of the project and is most satisfying.

I have written more about my interest in how hackers learn and the university as a hackerspace.

Culture change

Will change the research culture of the university by improving the tools available for managing and sharing data.

From the point-of-view of RDM, this is closely related to the first anticipated impact/benefit. We cannot claim to have any real evidence of benefits or impact at this stage on how we manage research data. However, as I’ve noted several times above, our use of the cloud, our advocacy of open licensing, our implementation of new R&D tools and processes, are also part of ‘culture change’ at the university. Furthermore, due to the DAF survey, the Orbital project is now widely known by researchers beyond our initial user group in the School of Engineering, and through our reporting to the university Research, Innovation and Enterprise Committee, staff at all levels are made aware of our work. Gradually, the idea of ‘research data management’ is being understood.

Technology choices

Will influence future choices in technologies (both locally developed and outsourced).

Yes! See above.

HE sector R&D

Contributes to innovative R&D in the HE sector

Yes, I think we are beginning to do this and the benefits so far are around shared learning among developers across different projects. We were instrumental in early discussions about the DevCSI MRD Hack Day and three of us contributed to the two day event. We blog regularly on this site (around 50 posts so far) and share our work with anyone who is interested (see the links in the sidebar).

Public Sector data management

As yet, the Orbital project can only claim to have resulted in one research dataset being published (again, more on this soon as I want to explain it in more detail). However, Orbital has grown out of our work over the last couple of years around managing and re-using institutional data, resulting in data.lincoln.ac.uk. We are also active members of the data.ac.uk initiative and I chaired the data.ac.uk panel at Dev8D this year.

Efficient re/use of resources

Demonstrably re-uses and builds on previous work, both funded and non-funded projects.

Yes, this was an early benefit of the project. We are building on our previous work and what we have learned from it in past projects. Our use of MongoDB, our work on staff profiles, our use of OAuth, and our API-driven approach to development, all build on past projects, funded and un-funded.

Keeping Research Data Safe and the benefits of Orbital

Note of apology: early in December 2011 we attended the launch event of the JISC Managing Research Data programme at the National College for School Leadership in Nottingham. I managed to blog day 1 of the event there and then. Unfortunately my notes on day 2 fell into an abyss. Here they are: late, but unscathed.

Aspire!

The first exercise (on this second day of the programme launch event) was to examine the benefits and metrics checklists provided by the KRDS frameworks project, and to identify the benefits that Orbital will provide & that we can measure. Then to blog a first statement of the benefits we expect Orbital will generate.

KRDS = Keeping Research Data Safe

Notes from Neil Beagrie‘s presentation on the benefits analysis toolkit (which I have already blogged about at the RDMF7 event, but noted here in more detail.)

  • There are two strands to the KRDS toolkit. These tools can be combined for maximum effect (and to reduce wasted effort); tools can also be customised to specific project needs:
    1. The KRDS Benefits Framework (guide + worksheet)
    2. The KRDS/I2S2 value chain and benefit impact tool (guide + impact statement + impact analysis worksheet)
  • Designed for use by wide audience over the full RDM project lifecycle.
  • Conisider the KRDS Benefits Framework ‘triangle’
    • What is outcome? direct/indirect
    • When is it received? near-term/long-term
    • Who benefits? internal/external
  • Tips: quantitative benefits must be measurable (“cashable“) – if not within the project lifecycle then longer-term benchmarking… qualitative benefits could take the form of case studies (working in a team can help to tone down the subjectivity of benefit assessments. Don’t go it alone!)
  • More information at: http://beagrie.com/krds-i2s2.php
  • Previous RMD programme produced benefits report & case studies which can be useful reference points.

Practical workshop

The KRDS benefits and metrics handouts provided here were extremely useful in developing this first statement of benefits for the Orbital project.

Points from the round-table discussion:

  • Checklist v useful brainstorming exercise – not a to do list!
  • Want to do everything and world peace too
  • But how make relevant to project? Target useful examples of top-level things
  • How evidence?
  • Lack of evidence/measurement not a reason not to do it – think of a way of measuring!”
  • Don’t rely on q’aires 🙂
  • Think of benefits from the programme as a whole into which orbital can feed in
  • Practical time & efficiency savings for researcher – i.e. not having to go to london with a USB in pocket
  • Similarities engineering with other applied – e.g. NHS
  • Case studies/user story – iterative method  – as user requirements change (become more mature) – that’s a way of measuring benefit!
  • Set actions for the steering group / RIEC

Benefits of Orbital

This is the list of benefits we came up with. Bear in mind, some of them are benefits specific to an MRD project, such as Orbital, but some of benefits of any large project where the institution has a vested interest. Note that some of these can also be found in the ‘Anticipated Outputs and Outcomes’ section of our Project Plan. As Joss mentions in the post on awareness of open source, not all benefits can be anticipated and there may be outcomes of the project, which are quite tangential to the original objectives. We especially look forward to those!

  • Very mention of Orbital attracting expressions of interest from research staff applying for funding. Researchers have to consider RDM when writing bids. We’re doing their work for them!
  • Knock on effect on other university services: authentication, repository, staff profiles, cloud computing, software development environment and methodology, open source awareness and guidance.
  • Supports the development of RDM plans and policies.
  • MRD programme activity is akin to staff training and development of a community of practice.
  • Combines and improves our understanding about research administration, research methods, research data and research outputs.
  • Changes to researcher practices. Improves RDM practices.
  • Should reduce institutional risk (legal liabilities of commercial contracts)
  • Simplifies collaboration among researchers
  • Produces open source software for re-use
  • Provides rapid access to results and derived data
  • Increases awareness of support among researchers. e.g. Aids grant writing.
  • Produces reliable citations of research data
  • Embeds institutional support and training
  • No recreation of existing data. Better security, greater efficiency.
  • Improved version control and transparency.
  • Improved understanding of research methods.
  • Further thinking about and planning for the sustainability of institution-wide services. Who pays?