I’m at the DevCSI #mrdHackday with Nick, Harry and about 30 other people interested in hacking around research data. One of the user requirements identified among some MRD projects is the need for personal and shared networked workspaces i.e. a desktop drive for dumping, organising and sharing research data.
In our recent survey of researchers at Lincoln, we learned that many academics (myself included!) are using Dropbox as a way to share project files and research data among partners. It has the advantage over the FTP ‘H Drive’ that Lincoln staff are given in that Dropbox offers more storage and folders/files can be shared among people both inside and outside the university. The first couple of GB of storage is free and the pricing is clear when you need more space.
Just as researchers surveyed said they were using Dropbox, they also acknowledged in the survey that this isn’t an ideal situation. It’s being held by a third-party service, it’s runs unreliably on our university desktops, there’s a 30 day version history, but there’s no information about what changes were made and no way to compare versions. Part of the Orbital implementation plan is to provide an alternative to Dropbox and other similar network drives to Lincoln researchers. One that (probably) runs over HTTP, does version control properly, can be accessed through a web interface if necessary, and can be shared securely. The DataFlow project at Oxford has gone down the route of using WebDav for remote file storage and sharing and it’s an area we should investigate, too. There is a WebDav extension that provides versioning, too.
Of all the comments by researchers who responded to our survey, the clearest message which united them was for more, secure, backed up, and flexible storage. Within Orbital, we’ve been thinking about how Git (or a similar versioned source code repository tool) could be used to provide this functionality. Git is a proven and popular repository tool for managing text files, developed for the Linux kernel project and the basis for the popular Github ‘social network’ for developers. Jez Cope from Bath mentioned that there was an open source desktop tool called SparkleShare that provides a folder on your PC, just like Dropbox, Google Drive and Ubuntu One do, and uses Git as its backend. Jez and I have been playing with SparkleShare the last couple of days, having installed the Mac client on our laptops and it shows some promise but also needs some further consideration and effort to meet our immediate requirements for RDM. Jez has written a companion post about this, too.
SparkleShare for RDM
Pros
Multi platform GUI client
Easy to install
Relatively mature, actively maintained open source project
Version control built into backend (Git)
Notifications of changes to folder contents
Cons
Git isn’t built for handling large, binary files
Version control not built into desktop client (shows a high-level history of changes, but no roll-back functionality)
Sharing folders not built into desktop client
Next steps?
If Git isn’t the right choice of backend, SparkleShare can use something else. Whatever the underlying versioned repository technology, SparkleShare currently lacks detailed versioning information and roll-back functionality, which is in the backend repository. Presumably it could be surfaced and further functionality built around it. Likewise, a more convenient way to share repository folders with other people could be added to the client. Currently, you need to share the repository with them outside of the client.
Windows Explorer integration
Most researchers are using Windows as their OS, so it’s worth looking at the integration with Windows Explorer that other tools use. The DATUM project selected Bazaar over Git because they found the integration (TortoiseBZR) with Explorer to be better. I have found the standard Git tools for Windows Explorer to be pretty good, too. Neither provide the transparent functionality of SparkleShare or Dropbox.
Handling big files
The git architecture simply sucks for big objects. It was discussed somewhat durign the early stages, but a lot of it really is pretty fundamental. The fact that all the operations work on a full object, and the delta’s are (on purpose) just a very specific and limited kind of size compression is just very ingrained… Personally, I think the answer is “git is good for lots of small files”. It’s very much what git was designed for, and the fact that it doesn’t work for everything is a trade-off for the things it _does_ work well for.
So says Linus Torvalds, the creator of Git (and the Linux kernel). Git and other source code repository software were not designed to handle big files. However, there are other Git-based and alternative projects that are addressing this. git-annex is a mature well-documented and maintained project that
allows managing files with git, without checking the file contents into git. While that may seem paradoxical, it is useful when dealing with files larger than git can currently easily handle, whether due to limitations in memory, time, or disk space. Even without file content tracking, being able to manage files with git, move files around and delete files with versioned directory trees, and use branches and distributed clones, are all very handy reasons to use git. And annexed files can co-exist in the same git repository with regularly versioned files, which is convenient for maintaining documents, Makefiles, etc that are associated with annexed files but that benefit from full revision control.
git-annex includes a use case on its home page that speaks to the RDM domain:
use case: The Archivist
Bob has many drives to archive his data, most of them kept offline, in a safe place.
With git-annex, Bob has a single directory tree that includes all his files, even if their content is being stored offline. He can reorganize his files using that tree, committing new versions to git, without worry about accidentally deleting anything.
When Bob needs access to some files, git-annex can tell him which drive(s) they’re on, and easily make them available. Indeed, every drive knows what is on every other drive. more about location tracking
Bob thinks long-term, and so he appreciates that git-annex uses a simple repository format. He knows his files will be accessible in the future even if the world has forgotten about git-annex and git. more about future-proofing
Run in a cron job, git-annex adds new files to archival drives at night. It also helps Bob keep track of intentional, and unintentional copies of files, and logs information he can use to decide when it’s time to duplicate the content of old drives. more about backup copies
The git-annex website has a useful page that discusses what it is not and it points to Sharebox as a FUSE filesystem built on top of git-annex. The project doesn’t look as mature as SparkleShare, but it’s good to see work being done on this, as the use case for Sharebox is very close to what I think several RDM projects are looking for. The git-annex website also points to other projects that are worth considering:
git-annex is more than just a workaround for git limitations that might eventually be fixed by efforts like git-bigfiles.
git-bigfiles does not tackle the same use cases that SparkleShare and Sharebox are focused on, but could perhaps provide the backend to such tools.
git-media has the advantage of using git smudge filters rather than git-annex’s pile of symlinks, and it may be a tighter fit for certain situations. It lacks git-annex’s support for widely distributed storage, using only a single backend data store. It also does not support partial checkouts of file contents, like git-annex does.
git-media is also a command-line tool and therefore provides only part of the solution to a ‘Dropbox alternative’ for big files. It doesn’t look like there’s been very much activity on the project in the last couple of years.
Boar implements its own version control system, rather than simply embracing and extending git. And while boar supports distributed clones of a repository, it does not support keeping different files in different clones of the same repository, which git-annex does, and is an important feature for large-scale archiving.
Boar does not use git, but is an alternative “version control and backup for photos, videos and other binary files.” It is not a distributed version control system either, but “does however work well with repositories on mapped network file systems, such as Windows shares and NFS.” The rationale for Boar is worth reading as it addresses many of the problems found in the RDM domain. It’s a well-maintained and well documented project, which, like git-annex, was clearly written to tackle genuine archival problems.
Boar aims to be the perfect way to make sure your most important digital information, like pictures, movies and documents, are stored safely.
- Boar makes it possible for you to restore any or all of your files from any point in time.
- Boar makes it easy to maintain verified backups of your data, including file history.
- Boar imposes no limits on file or repository sizes.
- Using boar is an effective way to prevent data loss due to human or machine error.
If you are familiar with vcs software such as Subversion, you might think of boar as “version control for large binary files”.
This sounds like an ideal tool for expert users willing to use the command line for managing large research datasets and binary files and it would be worth looking at how much work it would take to write GUI client or Windows Explorer integration as an alternative to Dropbox.
In summary, there are robust command line tools suitable for managing workspaces for research data over a network, but more work is required to build effective, simple graphical clients that can be used by any researcher.