CKAN for RDM workshop

On the 18th February, we ran a workshop in London which focused on the use of CKAN for research data management. The Orbital project made the decision to use CKAN last summer and was soon followed by Bristol’s data.bris project, which is using CKAN for its discovery catalogue. Simon Price from Bristol, gave a very interesting presentation of their work with CKAN, which you can read about on their project blog.

The #CKAN4RDM workshop was fully booked with 40 delegates attending – many more than we originally anticipated. It was facilitated by Simon Hodson, the Programme Manager of JISC’s Managing Research Data programme. Following presentations from Lincoln and Bristol on our respective uses of CKAN (ours was a live demo of ‘Orbital Bridge‘), we spent the later part of the morning undertaking a requirements gathering exercise, where tables of around 8-10 people acted as different users, providing ‘stories’ (requirements) for a research data management system. The exercise was introduced in the following few slides.

This was a useful exercise regardless of the software used, but after collating all 70+ stories over lunch, we then returned to our user groups and each table worked with a CKAN expert from the Open Knowledge Foundation to discuss the existing constraints for each requirement and started to develop a gap analysis so as to identify work to be done. The output of this work can be viewed on Google docs.

Types of users
Types of users
The 'researcher' user group
The ‘researcher’ user group

 

There was quite a positive buzz about the day and general feedback suggested that delegates got a lot out of the event. You can read write ups from the DCC, LSE and the Datapool project at Southampton.

One of the original purposes of the workshop was research for a conference paper that I (Joss) am giving at the IASSIST conference in Cologne, in May. The abstract I submitted to the conference was as follows:

This paper offers a full and critical evaluation of the open source CKAN software <http://ckan.org> for use as a Research Data Management (RDM) tool within a university environment. It presents a case study of CKAN’s implementation and use at the University of Lincoln, UK, and highlights its strengths and current weaknesses as an institutional Research Data Management tool. The author draws on his prior experience of implementing a mixed media Digital Asset Management system (DAM), Institutional Repository (IR) and institutional Web Content Management System (CMS), to offer an outline proposal for how CKAN can be used effectively for data analysis, storage and publishing in academia. This will be of interest to researchers, data librarians, and developers, who are responsible for the implementation of institutional RDM infrastructure. This paper is presented as part of the dissemination activities of the JISC-funded Orbital project <https://orbital.blogs.lincoln.ac.uk>.

As well as using last week’s outputs of the CKAN4RDM workshop, I’ll also be working closely with OKF staff to ensure that the evaluation is as thorough, accurate and up-to-date as possible by the time of the conference. It will focus on version 2.0 of CKAN, which is due for release soon.

I’d also like to appeal to other JISC MRD projects to send me any existing requirements documents you have produced during the course of your project. I will use the anonymised data to enrich the requirements we gathered last week. If you have such documents, please email me.

Finally, we have set up a CKAN4RDM mailing list, which anyone is welcome to join to discuss the use of CKAN within academia. One thing is clear to me: the academic community cannot expect OKF and existing CKAN developers to meet all of our requirements for research data management. We need to contribute developer time and other resource and effort to the overall CKAN open source project, just as other public sector organisations are doing.

 

ownCloud: An ‘academic dropbox’?

Following up on our post from the MRDHack day, what follows is an evaluation of ownCloud as an institutional alternative to Dropbox. Our DAF survey showed that researchers at Lincoln require better managed storage space than the current 1GB FTP “H: Drive” provided to each staff member. Many of them are using portable drives, USB sticks and cloud-based services such as Dropbox to store and share their research data. Services like Dropbox provide compelling advantages to more traditional storage. Dropbox provides, for free, double the storage on offer to Lincoln researchers at present; it is always backed up, existing on both the local machine and on Amazon’s servers, it offers versioning for files up to 30 days old with the free account and ‘forever’ for paid accounts, it is accessible from almost all devices with Linux, OS X, Windows, iOS and Android clients available. Files can be published to the web or shared privately with other Dropbox users.

We know, however, that researchers using Dropbox are doing so without a clear understanding of the terms and conditions of the service and would ideally like a similar service to be provided internally by the University, where we can retain control of the data and its associated security.

We were first made aware of ownCloud when D’Arcy Norman blogged about his initial trial of it. ownCloud is an open source tool, which provides the same features as Dropbox (and more). With the release of version 4 in May, it appears to be a credible alternative to Dropbox for institutions wishing to provide a modern storage solution for their staff and students. D’Arcy’s initial experience with ownCloud was promising but he found issues with the syncing of files. Our recent tests of ownCloud have found that these problems have now been resolved with a recent update and what follows is an evaluation of ownCloud version 4.0.6, looking at it from three perspectives:

  1. ownCloud as a general purpose storage technology for an academic community
  2. ownCloud as a storage technology for research data
  3. ownCloud as a technology for integration with Orbital

Storage for an academic community?

ownCloud is an AGPLv3 licensed open source project, which started in January 2010. The project is run by a company, also called ownCloud, which provides commercial services and support for its software. The company resides in both Germany and the USA. The development of ownCloud is also open and supported by standard tools for open source projects: a source code repository, bug tracker, IRC channel, mailing list, wiki and forum. There are currently 13 core members of the project and 34 contributing developers. Development of the code is currently very active with changes made several times a day. The ownCloud project was started by the KDE community (it has no dependencies on KDE) and therefore benefits from the involvement of experienced open source developers. The software is written in PHP and can use MySQL, PostgreSQL or SQLite for its database. As of version 4, ownCloud has the following features:

  • Web user interface for file uploads and management of account and other features. Files can also be uploaded from an existing URL.
  • Windows/Mac/Linux/Android/iOS synchronisation clients.
  • WebDav integration for direct access to file storage. ownCloud has its own WebDAV server.
  • Folder and/or file sharing: publish to the web or share with groups or individuals.
  • File versioning.
  • An API for application integration.
  • Previewing for a number of filetypes.
  • Server side file encryption.
  • LDAP integration.
  • Notifications.
  • ownCloud can be installed in a PHP/MySQL environment on both Linux and Windows servers.

ownCloud also provides a number of other applications that can be activated or not, including a calendar, task list, contacts management, a built in text editor, image management, and experimental support for FTP, Google Drive and Dropbox integration. These are all available from the built in ‘app store’ and configurable by administrators. Maximum quotas for each account can be specified on an account-by-account basis, and a maximum file upload size can also be specified if needed.

The roadmap for version 5, due in August 2012, list the following:

  • Inter-ownCloud Sharing
  • Ajax interface
  • Mozilla Sync Integration
  • Improved permissions
  • Mounting of Dropbox and Google Drive
  • Improved version control

Continue reading “ownCloud: An ‘academic dropbox’?”

Eating Your Own Dog Food: Building a repository with API-driven development

We’re in Edinburgh, at Open Repositories 2012, and will be presenting our paper at 9am tomorrow morning (yes, that’s right, the morning after the conference dinner!). Here’s the paper we’ll be discussing.

As part of its project to develop a new research data management system the University of Lincoln is embracing development practices built around APIs – interfaces to the underlying data and functions of the system which are explicitly designed to make life easy for developers by being machine readable and programmatically accessible.

http://eprints.lincoln.ac.uk/5962/

Eating Your Own Dog Food

View more presentations from Nick Jackson

High-level overview of University Research Information Systems (VRE:RIM)

As part of Orbital, we’ve started to seriously consider how RDM (Research Data Management) fits into the whole range of University systems to support research information. In some contexts (usually research support/administration), we might use the term RIM (“Research Information Management”); in other, academic, contexts, we might talk about a VRE (“Virtual Research Environment”) – but I prefer to think of both as functions of the same set of systems.

Below is a first attempt to model various University research systems and what information might be shared/pass between them. This is a first draft only and will be discussed/developed over time – it’s here for comments!

Screenshot of the VRE RIM diagram

It’s available to view on Lucidchart.

Shared, versioned network drives

I’m at the DevCSI #mrdHackday with Nick, Harry and about 30 other people interested in hacking around research data. One of the user requirements identified among some MRD projects is the need for personal and shared networked workspaces i.e. a desktop drive for dumping, organising and sharing research data.

In our recent survey of researchers at Lincoln, we learned that many academics (myself included!) are using Dropbox as a way to share project files and research data among partners. It has the advantage over the FTP ‘H Drive’ that Lincoln staff are given in that Dropbox offers more storage and folders/files can be shared among people both inside and outside the university. The first couple of GB of storage is free and the pricing is clear when you need more space.

Just as researchers surveyed said they were using Dropbox, they also acknowledged in the survey that this isn’t an ideal situation. It’s being held by a third-party service, it’s runs unreliably on our university desktops, there’s a 30 day version history, but there’s no information about what changes were made and no way to compare versions. Part of the Orbital implementation plan is to provide an alternative to Dropbox and other similar network drives to Lincoln researchers. One that (probably) runs over HTTP, does version control properly, can be accessed through a web interface if necessary, and can be shared securely. The DataFlow project at Oxford has gone down the route of using WebDav for remote file storage and sharing and it’s an area we should investigate, too. There is a WebDav extension that provides versioning, too.

Of all the comments by researchers who responded to our survey, the clearest message which united them was for more, secure, backed up, and flexible storage. Within Orbital, we’ve been thinking about how Git (or a similar versioned source code repository tool) could be used to provide this functionality. Git is a proven and popular repository tool for managing text files, developed for the Linux kernel project and the basis for the popular Github ‘social network’ for developers. Jez Cope from Bath mentioned that there was an open source desktop tool called SparkleShare that provides a folder on your PC, just like Dropbox, Google Drive and Ubuntu One do, and uses Git as its backend. Jez and I have been playing with SparkleShare the last couple of days, having installed the Mac client on our laptops and it shows some promise but also needs some further consideration and effort to meet our immediate requirements for RDM. Jez has written a companion post about this, too.

SparkleShare for RDM

Pros

Multi platform GUI client

Easy to install

Relatively mature, actively maintained open source project

Version control built into backend (Git)

Notifications of changes to folder contents

Cons

Git isn’t built for handling large, binary files

Version control not built into desktop client (shows a high-level history of changes, but no roll-back functionality)

Sharing folders not built into desktop client

Next steps?

If Git isn’t the right choice of backend, SparkleShare can use something else. Whatever the underlying versioned repository technology, SparkleShare currently lacks detailed versioning information and roll-back functionality, which is in the backend repository. Presumably it could be surfaced and further functionality built around it. Likewise, a more convenient way to share repository folders with other people could be added to the client. Currently, you need to share the repository with them outside of the client.

Windows Explorer integration

Most researchers are using Windows as their OS, so it’s worth looking at the integration with Windows Explorer that other tools use. The DATUM project selected Bazaar over Git because they found the integration (TortoiseBZR) with Explorer to be better. I have found the standard Git tools for Windows Explorer to be pretty good, too. Neither provide the transparent functionality of SparkleShare or Dropbox.

Handling big files

The git architecture simply sucks for big objects. It was discussed somewhat durign the early stages, but a lot of it really is pretty fundamental. The fact that all the operations work on a full object, and the delta’s are (on purpose) just a very specific and limited kind of size compression is just very ingrained… Personally, I think the answer is “git is good for lots of small files”. It’s very much what git was designed for, and the fact that it doesn’t work for everything is a trade-off for the things it _does_ work well for.

So says Linus Torvalds, the creator of Git (and the Linux kernel). Git and other source code repository software were not designed to handle big files. However, there are other Git-based and alternative projects that are addressing this. git-annex is a mature well-documented and maintained project that

allows managing files with git, without checking the file contents into git. While that may seem paradoxical, it is useful when dealing with files larger than git can currently easily handle, whether due to limitations in memory, time, or disk space. Even without file content tracking, being able to manage files with git, move files around and delete files with versioned directory trees, and use branches and distributed clones, are all very handy reasons to use git. And annexed files can co-exist in the same git repository with regularly versioned files, which is convenient for maintaining documents, Makefiles, etc that are associated with annexed files but that benefit from full revision control.

git-annex includes a use case on its home page that speaks to the RDM domain:

use case: The Archivist

Bob has many drives to archive his data, most of them kept offline, in a safe place.

With git-annex, Bob has a single directory tree that includes all his files, even if their content is being stored offline. He can reorganize his files using that tree, committing new versions to git, without worry about accidentally deleting anything.

When Bob needs access to some files, git-annex can tell him which drive(s) they’re on, and easily make them available. Indeed, every drive knows what is on every other drive. more about location tracking

Bob thinks long-term, and so he appreciates that git-annex uses a simple repository format. He knows his files will be accessible in the future even if the world has forgotten about git-annex and git. more about future-proofing

Run in a cron job, git-annex adds new files to archival drives at night. It also helps Bob keep track of intentional, and unintentional copies of files, and logs information he can use to decide when it’s time to duplicate the content of old drives. more about backup copies

The git-annex website has a useful page that discusses what it is not and it points to Sharebox as a FUSE filesystem built on top of git-annex. The project doesn’t look as mature as SparkleShare, but it’s good to see work being done on this, as the use case for Sharebox is very close to what I think several RDM projects are looking for. The git-annex website also points to other projects that are worth considering:

git-annex is more than just a workaround for git limitations that might eventually be fixed by efforts like git-bigfiles.

git-bigfiles does not tackle the same use cases that SparkleShare and Sharebox are focused on, but could perhaps provide the backend to such tools.

git-media has the advantage of using git smudge filters rather than git-annex’s pile of symlinks, and it may be a tighter fit for certain situations. It lacks git-annex’s support for widely distributed storage, using only a single backend data store. It also does not support partial checkouts of file contents, like git-annex does.

git-media is also a command-line tool and therefore provides only part of the solution to a ‘Dropbox alternative’ for big files. It doesn’t look like there’s been very much activity on the project in the last couple of years.

Boar implements its own version control system, rather than simply embracing and extending git. And while boar supports distributed clones of a repository, it does not support keeping different files in different clones of the same repository, which git-annex does, and is an important feature for large-scale archiving.

Boar does not use git, but is an alternative “version control and backup for photos, videos and other binary files.” It is not a distributed version control system either, but “does however work well with repositories on mapped network file systems, such as Windows shares and NFS.” The rationale for Boar is worth reading as it addresses many of the problems found in the RDM domain. It’s a well-maintained and well documented project, which, like git-annex, was clearly written to tackle genuine archival problems.

Boar aims to be the perfect way to make sure your most important digital information, like pictures, movies and documents, are stored safely.

  • Boar makes it possible for you to restore any or all of your files from any point in time.
  • Boar makes it easy to maintain verified backups of your data, including file history.
  • Boar imposes no limits on file or repository sizes.
  • Using boar is an effective way to prevent data loss due to human or machine error.

If you are familiar with vcs software such as Subversion, you might think of boar as “version control for large binary files”.

This sounds like an ideal tool for expert users willing to use the command line for managing large research datasets and binary files and it would be worth looking at how much work it would take to write GUI client or Windows Explorer integration as an alternative to Dropbox.

In summary, there are robust command line tools suitable for managing workspaces for research data over a network, but more work is required to build effective, simple graphical clients that can be used by any researcher.