Orbital: Impact and benefits

I’ve been asked to highlight the benefits of running the Orbital project so far. We’re seven months or so into Orbital, which is an 18 month project. In our initial Project Plan, we identified the anticipated impact of the project and so I’ll use that list in this blog post to reflect on the impact and benefits so far. The headings and text in italics are copied from the Project Plan.

Research practices

Researcher’s data management practices will change, supported by technologies that encourage new processes in the administration and dissemination of data.

We’ve had very little impact in this area so far. It’s too early to impact on researchers’ practices when we’re still developing our own knowledge and the infrastructure to support RDM. Changing researchers’ practices takes time. However, there are indications that it will happen. Engagement with our user groups and ad hoc requests for help from researchers who know we’re working on the project has shown us that researchers do want to change their practices. Our recent DAF survey also told us that researchers know that their practices could be improved and where they need support to do so.

Internal auditing

Greater oversight and analysis of research data created by researchers will be possible.

We’ve had no impact here and won’t until the Orbital software is built. We will be working on this in the next version of Orbital and our recent effort at the MRD Hack Day around activity data was a precursor to this work. We are increasingly seeing activity data as a key component of Orbital and this was underlined early on in the project when Mansur Darlington from the ERIM and REDm-MED projects stressed the importance of capturing contextual metadata.

Research governance

Improved methods of auditing research undertaken by the university will be possible, enabling greater cross-disciplinary work.

This relates to the benefit above around capturing activity data and improving ‘business intelligence’. As yet no real impact, but we still anticipate Orbital being a useful tool for reporting and enabling greater cross-disciplinary collaboration though greater transparency. Related to this is the creation of RDM Policy, which we have begun and has resulted in a statement being made for the EPSRC RDM ‘Road Map’.

Integrated services

Research data management will be integrated into existing systems, such as staff profiles, the institutional repository, blogs and calendars. Towards a Virtual Research Environment.

We have had a direct impact on the creation of new staff profiles at the university. Nick Jackson, Lead Developer on Orbital has been working with Alex Bilbie in the ICT Online Services Team to create an aggregated profile for staff. We blogged about it earlier and you can see my example. Profiles of staff are now aggregated from different systems and stored as RDF Linked Data and we intend to pull in activity data from Orbital to further enrich staff profiles. In this way, our work is being recognised and valued by other teams in the university. Furthermore, the university has recently procured a new ‘Awards Management System’, which will provide data to Orbital about funded research projects and we intend to couple Orbital with EPrints using SWORD2.

What is clear from our discussions with users is that they expect Orbital to do more than simply store and publish research data. Without using the term, they are effectively asking for a Virtual Research Environment (VRE). This is something which we did anticipate and have always planned for Orbital to be a tool for both analysing and publish research data. When discussing ‘Research Data Management’, there is a fine line between DMP planning, research project management, team workspaces and public web publishing and while we need to be careful that the scope of Orbital does not creep, we are sensitive to on-going user requirements.

FOI compliance

Will make FOI requests easier to respond to or unnecessary.

We’ve had no impact here so far, nor are we yet in a position to.

Open Data

Will promote and enable the production of public data sets.

As I will discuss in a forthcoming blog post about our first release of Orbital, we have had some impact here and have witnessed the benefits. In summary, a researcher contacted us for help with publishing some data, the result of which was an invitation to write a journal article about the data, offers of collaboration and the strengthening of an EU grant application.

Our workshop on open licensing has also led to a further meeting between myself (Joss Winn, Orbital PM), and the university’s IP Manager. A further follow up meeting is planned to draft guidance for staff on the use of open licenses for source code and data. Furthermore, research staff are being directed to the Orbital team for informal advice on open licensing. In this sense, we are beginning to improve the awareness and understanding of open licenses among researchers.

The innovation cycle

Will embed new technologies and culture change among professional staff at the university and lead to further innovation in our services.

We are having some impact here and have had meetings with central ICT staff about integrating our server farm with cloud services. We are currently developing the Orbital software using Rackspace, but have recently ordered hardware, partly paid for by the project, to establish a private cloud, running on OpenStack, for research and development at the university. In addition to this, our development toolchain has changed and we now have tools and processes in place that we did not have six months ago. These are being adopted outside of the Orbital project by staff within the ICT Online Services Team and other projects we are running. In addition to this, we intend that the changes we make to our own R&D tools and processes, are made available to other researchers and students. Over the summer, we will set up and maintain a university-wide Gitorious source code repository service (similar to Github), where staff and students who write code can form teams, manage their source code, and publish it if they wish. We also intend to run a Jenkins server for similar purposes so that all staff and students can benefit from source code control and the quality assurance processes that we have implemented through Orbital. Orbital is now a driver for a general R&D infrastructure for Academic Computing that project members and wider members of LNCD  are building.

I will write more about this at a later date because, for someone who manages R&D infrastructure projects at the university and wishes to engage staff and students in our work, I am excited to be able to integrate this into academic programmes and the work of other researchers.

I want to also stress that like all of our projects, the benefits, however slight, spill into other aspects of our work. Being a large project, Orbital has allowed us to concentrate on developing our toolchain and development environment across other projects, it’s given us time to learn new skills and share our learning with colleagues. In this way, it has been pivotal in the way we work and the future direction of our work.

Recruitment

Will build capacity for local development of innovative services

Orbital has allowed us to recruit two full-time Developers (Nick Jackson and Harry Newton). We are therefore two staff up and it is my intention to try and keep it that way.

Staff skills

Will improve staff skills and experience

Yes, we are benefitting in this way. The Orbital project team are now the RDM ‘experts’ in the university and despite being novices in this regard, over the course of the project staff working in the Library, ICT, Research and Enterprise Office and Centre for Educational Research and Development, are each developing their understanding of the processes and implications of RDM.

Clearly an 18 month project (at least the way I run them!), allows for staff to learn new tools and skills, experiment with new methods of working and disseminate this learning to other staff. This is one constant that I value highly about our project work. Despite the stop and start nature of project work and that not all of our work eventually makes it into a fully fledged university services, the tools and learning, especially as we engage more with academic programmes, goes beyond the confines of the project and is most satisfying.

I have written more about my interest in how hackers learn and the university as a hackerspace.

Culture change

Will change the research culture of the university by improving the tools available for managing and sharing data.

From the point-of-view of RDM, this is closely related to the first anticipated impact/benefit. We cannot claim to have any real evidence of benefits or impact at this stage on how we manage research data. However, as I’ve noted several times above, our use of the cloud, our advocacy of open licensing, our implementation of new R&D tools and processes, are also part of ‘culture change’ at the university. Furthermore, due to the DAF survey, the Orbital project is now widely known by researchers beyond our initial user group in the School of Engineering, and through our reporting to the university Research, Innovation and Enterprise Committee, staff at all levels are made aware of our work. Gradually, the idea of ‘research data management’ is being understood.

Technology choices

Will influence future choices in technologies (both locally developed and outsourced).

Yes! See above.

HE sector R&D

Contributes to innovative R&D in the HE sector

Yes, I think we are beginning to do this and the benefits so far are around shared learning among developers across different projects. We were instrumental in early discussions about the DevCSI MRD Hack Day and three of us contributed to the two day event. We blog regularly on this site (around 50 posts so far) and share our work with anyone who is interested (see the links in the sidebar).

Public Sector data management

As yet, the Orbital project can only claim to have resulted in one research dataset being published (again, more on this soon as I want to explain it in more detail). However, Orbital has grown out of our work over the last couple of years around managing and re-using institutional data, resulting in data.lincoln.ac.uk. We are also active members of the data.ac.uk initiative and I chaired the data.ac.uk panel at Dev8D this year.

Efficient re/use of resources

Demonstrably re-uses and builds on previous work, both funded and non-funded projects.

Yes, this was an early benefit of the project. We are building on our previous work and what we have learned from it in past projects. Our use of MongoDB, our work on staff profiles, our use of OAuth, and our API-driven approach to development, all build on past projects, funded and un-funded.

Shared, versioned network drives

I’m at the DevCSI #mrdHackday with Nick, Harry and about 30 other people interested in hacking around research data. One of the user requirements identified among some MRD projects is the need for personal and shared networked workspaces i.e. a desktop drive for dumping, organising and sharing research data.

In our recent survey of researchers at Lincoln, we learned that many academics (myself included!) are using Dropbox as a way to share project files and research data among partners. It has the advantage over the FTP ‘H Drive’ that Lincoln staff are given in that Dropbox offers more storage and folders/files can be shared among people both inside and outside the university. The first couple of GB of storage is free and the pricing is clear when you need more space.

Just as researchers surveyed said they were using Dropbox, they also acknowledged in the survey that this isn’t an ideal situation. It’s being held by a third-party service, it’s runs unreliably on our university desktops, there’s a 30 day version history, but there’s no information about what changes were made and no way to compare versions. Part of the Orbital implementation plan is to provide an alternative to Dropbox and other similar network drives to Lincoln researchers. One that (probably) runs over HTTP, does version control properly, can be accessed through a web interface if necessary, and can be shared securely. The DataFlow project at Oxford has gone down the route of using WebDav for remote file storage and sharing and it’s an area we should investigate, too. There is a WebDav extension that provides versioning, too.

Of all the comments by researchers who responded to our survey, the clearest message which united them was for more, secure, backed up, and flexible storage. Within Orbital, we’ve been thinking about how Git (or a similar versioned source code repository tool) could be used to provide this functionality. Git is a proven and popular repository tool for managing text files, developed for the Linux kernel project and the basis for the popular Github ‘social network’ for developers. Jez Cope from Bath mentioned that there was an open source desktop tool called SparkleShare that provides a folder on your PC, just like Dropbox, Google Drive and Ubuntu One do, and uses Git as its backend. Jez and I have been playing with SparkleShare the last couple of days, having installed the Mac client on our laptops and it shows some promise but also needs some further consideration and effort to meet our immediate requirements for RDM. Jez has written a companion post about this, too.

SparkleShare for RDM

Pros

Multi platform GUI client

Easy to install

Relatively mature, actively maintained open source project

Version control built into backend (Git)

Notifications of changes to folder contents

Cons

Git isn’t built for handling large, binary files

Version control not built into desktop client (shows a high-level history of changes, but no roll-back functionality)

Sharing folders not built into desktop client

Next steps?

If Git isn’t the right choice of backend, SparkleShare can use something else. Whatever the underlying versioned repository technology, SparkleShare currently lacks detailed versioning information and roll-back functionality, which is in the backend repository. Presumably it could be surfaced and further functionality built around it. Likewise, a more convenient way to share repository folders with other people could be added to the client. Currently, you need to share the repository with them outside of the client.

Windows Explorer integration

Most researchers are using Windows as their OS, so it’s worth looking at the integration with Windows Explorer that other tools use. The DATUM project selected Bazaar over Git because they found the integration (TortoiseBZR) with Explorer to be better. I have found the standard Git tools for Windows Explorer to be pretty good, too. Neither provide the transparent functionality of SparkleShare or Dropbox.

Handling big files

The git architecture simply sucks for big objects. It was discussed somewhat durign the early stages, but a lot of it really is pretty fundamental. The fact that all the operations work on a full object, and the delta’s are (on purpose) just a very specific and limited kind of size compression is just very ingrained… Personally, I think the answer is “git is good for lots of small files”. It’s very much what git was designed for, and the fact that it doesn’t work for everything is a trade-off for the things it _does_ work well for.

So says Linus Torvalds, the creator of Git (and the Linux kernel). Git and other source code repository software were not designed to handle big files. However, there are other Git-based and alternative projects that are addressing this. git-annex is a mature well-documented and maintained project that

allows managing files with git, without checking the file contents into git. While that may seem paradoxical, it is useful when dealing with files larger than git can currently easily handle, whether due to limitations in memory, time, or disk space. Even without file content tracking, being able to manage files with git, move files around and delete files with versioned directory trees, and use branches and distributed clones, are all very handy reasons to use git. And annexed files can co-exist in the same git repository with regularly versioned files, which is convenient for maintaining documents, Makefiles, etc that are associated with annexed files but that benefit from full revision control.

git-annex includes a use case on its home page that speaks to the RDM domain:

use case: The Archivist

Bob has many drives to archive his data, most of them kept offline, in a safe place.

With git-annex, Bob has a single directory tree that includes all his files, even if their content is being stored offline. He can reorganize his files using that tree, committing new versions to git, without worry about accidentally deleting anything.

When Bob needs access to some files, git-annex can tell him which drive(s) they’re on, and easily make them available. Indeed, every drive knows what is on every other drive. more about location tracking

Bob thinks long-term, and so he appreciates that git-annex uses a simple repository format. He knows his files will be accessible in the future even if the world has forgotten about git-annex and git. more about future-proofing

Run in a cron job, git-annex adds new files to archival drives at night. It also helps Bob keep track of intentional, and unintentional copies of files, and logs information he can use to decide when it’s time to duplicate the content of old drives. more about backup copies

The git-annex website has a useful page that discusses what it is not and it points to Sharebox as a FUSE filesystem built on top of git-annex. The project doesn’t look as mature as SparkleShare, but it’s good to see work being done on this, as the use case for Sharebox is very close to what I think several RDM projects are looking for. The git-annex website also points to other projects that are worth considering:

git-annex is more than just a workaround for git limitations that might eventually be fixed by efforts like git-bigfiles.

git-bigfiles does not tackle the same use cases that SparkleShare and Sharebox are focused on, but could perhaps provide the backend to such tools.

git-media has the advantage of using git smudge filters rather than git-annex’s pile of symlinks, and it may be a tighter fit for certain situations. It lacks git-annex’s support for widely distributed storage, using only a single backend data store. It also does not support partial checkouts of file contents, like git-annex does.

git-media is also a command-line tool and therefore provides only part of the solution to a ‘Dropbox alternative’ for big files. It doesn’t look like there’s been very much activity on the project in the last couple of years.

Boar implements its own version control system, rather than simply embracing and extending git. And while boar supports distributed clones of a repository, it does not support keeping different files in different clones of the same repository, which git-annex does, and is an important feature for large-scale archiving.

Boar does not use git, but is an alternative “version control and backup for photos, videos and other binary files.” It is not a distributed version control system either, but “does however work well with repositories on mapped network file systems, such as Windows shares and NFS.” The rationale for Boar is worth reading as it addresses many of the problems found in the RDM domain. It’s a well-maintained and well documented project, which, like git-annex, was clearly written to tackle genuine archival problems.

Boar aims to be the perfect way to make sure your most important digital information, like pictures, movies and documents, are stored safely.

  • Boar makes it possible for you to restore any or all of your files from any point in time.
  • Boar makes it easy to maintain verified backups of your data, including file history.
  • Boar imposes no limits on file or repository sizes.
  • Using boar is an effective way to prevent data loss due to human or machine error.

If you are familiar with vcs software such as Subversion, you might think of boar as “version control for large binary files”.

This sounds like an ideal tool for expert users willing to use the command line for managing large research datasets and binary files and it would be worth looking at how much work it would take to write GUI client or Windows Explorer integration as an alternative to Dropbox.

In summary, there are robust command line tools suitable for managing workspaces for research data over a network, but more work is required to build effective, simple graphical clients that can be used by any researcher.

EPSRC Research Data Management Roadmap

As part of their Policy Framework on Research Data, the EPSRC have requested that all institutions in receipt of their funding develop a clear roadmap for research data management, which should be implemented by May 1st 2015. At the University of Lincoln, the Orbital project is a vehicle for this work and we expect much of the infrastructure (technical, policy and training) to be identified and piloted by April 2013.

The following is a draft policy statement which indicates our intent. We have also examined the EPSRC’s expectations and considered them in light of guidance offered by the Digital Curation Centre. This document should be understood in the context of the Orbital Project Plan, which sets out our aims and objectives until April 2013 and beyond.

The University of Lincoln Research Data Management Policy (draft)

  1. The University of Lincoln recognises that the curation and sharing of research data is key to its mission to create knowledge. Research data is a key asset. Its correct management brings benefits to the university, its members and the public through greater opportunities for access and re-use.
  2. This policy sets out the university’s expectations for the management of research data across all academic disciplines and in all forms.
  3. The development of this policy aims to satisfy the requirements of researchers, funding and statutory bodies and commercial partners, that research data will be managed to the highest standards throughout the research data lifecycle. The University recognises and supports the UK Research Councils’ mandates for data curation and sharing.
  4. Principal Investigators (PI) are required to consider data creation, management or sharing in the development of their research proposals and grant applications. All new research proposals must include research data management plans that explicitly address data capture, management, integrity, confidentiality, retention, sharing and publication.
  5. Staff can expect to receive training, guidance and support from Research & Enterprise Development and the Library for the development of research data management plans and their implementation.
  6. In accordance with the Research Council’s timeframes for preserving access to research outputs, Principal Investigators should record the existence of research data upon creation and deposit it according to their plan (often within six months of publication of research findings). The University will preserve access for as long as specified by the data management plan (which can be up to ten years after it was last accessed). Any data which is retained elsewhere, for example in an international data service or domain repository, should be registered with the University.
  7. Research metadata will be published for permanent citation alongside conventional outputs e.g. journal articles and conference papers. Open access to research data will be granted under appropriate safeguards according to conditions and timeframes specified by researchers, commercial partners and funding bodies. The University will support this through the provision of a data repository. Conditions for the licensing of research data must be made in consultation with Research & Enterprise Development.
  8. The Library and ICT Services will provide the infrastructure and expertise for long-term curation, preservation and access to research data. This will include mechanisms and services for storage, backup, registration, deposit and retention of research data assets in support of current and future access, during and after completion of research projects.
  9. Costs to meet the specific requirements of data management plans should be included in grant applications, where permitted. The University will develop appropriate plans for meeting the costs of long-term storage, preservation and curation of research data.
  10. This policy will be reviewed jointly by the Research & Enterprise Development and Library on an annual basis. Recommendations for amendment will be submitted to the appropriate management committee. Research & Enterprise Development will monitor compliance with this and other related policies, and the effect of such policies on the operation of the University. This policy should be considered alongside other university policies e.g. research ethics, IP policy, and disciplinary procedures.

The current version of this document is maintained here.

Data Assets Framework survey summary

Over the last month, we’ve been asking academic staff to complete a local version of the DAF survey. Here are the anonymised results. For almost all questions, there was a comments box, which many staff used and proved very useful in understanding researchers’ specific issues.

Click on the image below to download a PDF summary of the survey, which 44 staff completed. This represents about 8% of staff on research or research/teaching contracts.

DAF summary
Click to download the full summary

One thought I have at the moment… 50% of staff are holding up to 100GB of data right now. Among all of the comments, the requirement for better storage is repeated again and again by our researchers. If we were to provide 100GB/person to all 500+ academic staff, that’s over 50TB of online storage required, plus backups. I’m told that the university currently has 24TB of online, centrally managed storage in use, with 13TB of this backed up offsite, and 1 TB ‘archived’. Clearly introducing research data into the mix will significantly increase the amount of storage that is being managed centrally.