Web scale data management

Not research data as such (although it could be the subject of research), but a long and interesting blog post about how Tumblr manages huge amounts of user generated data. It’s interesting not just because of the scale of the task day-to-day, but also because it offers some lessons learned about how to scale up to managing an ingest of several terabytes a day. When we talk about ‘big data’ in the sciences, is it this big? Bigger? How is big science actually managing data on this scale? I really don’t know.

  • 500 million page views a day
  • 15B+ page views month
  • ~20 engineers
  • Peak rate of ~40k requests per second
  • 1+ TB/day into Hadoop cluster
  • Many TB/day into MySQL/HBase/Redis/Memcache
  • Growing at 30% a month
  • ~1000 hardware nodes in production
  • Billions of page visits per month per engineer
  • Posts are about 50GB a day. Follower list updates are about 2.7TB a day.
  • Dashboard runs at a million writes a second, 50K reads a second, and it is growing.

The Business Case for Open Source

Sander van der Waal from OSSWatch, JISC’s open source software advisory service will lead a meeting on March 7th.

The term ‘open source’ is increasingly being used to refer not only to the development of software, but also in other disciplines, such as design, education and even government.

This meeting is an opportunity for attendees to learn exactly what ‘open source’ means and its effect on our understanding of property and the production of knowledge, goods and services.

In the context of the Orbital project, we will also discuss the use and application of open source licenses by universities and consider how open source can contribute to innovation and the development of new business models.

The meeting will be held on March 7th, 9.30-12pm, MB1005. Refreshments will be provided. Staff and students wishing to attend should RSVP Joss Winn.

Orbital and the OAIS reference model

In the Orbital project bid, I wrote,

We intend to re-use and develop some of the underlying tools we have built to provide an institution-wide service for the ingest, description, preservation and dissemination of research data, which is informed by the OAIS reference model.

My first encounter with OAIS was about seven years ago, when I was designing a digital archive for Amnesty International’s image, film and video archives. If you begin to do any design work in the digital archiving domain, you come across the OAIS model very quickly. It is the standard and though somewhat daunting, when you get your teeth into it, you realise that it does an excellent job of describing what any decent digital archive should be doing anyway.

The mistake to make with OAIS is looking at the model and thinking that you have to create a system that is designed in such a way, rather than functions in such a way. The OAIS standard is a tool that allows Archivists, Designers and Developers to share a common language when discussing and planning the implementation of a digital archive and what that archive should do, not how it should be designed.

Here is the high-level OAIS model. (Here is the full, composite model).

OAIS Functional Entities
OAIS Functional Entities

Here is a high-level model of Orbital.

Orbital design
Orbital design

They’re very different because they should be. Remember, the first is a functional model, the second is a logical model of Orbital’s server requirements.

Here’s another model I published recently relating to building staff profiles.

Building staff profiles
Building staff profiles

 

The task ahead of the Orbital Team is to consider how our on-going designs like these two above, relate to the OAIS functional model. For example, the staff profile diagram (which is not simply an abstract model, but a retrospective design document) tells us where some of the OAIS Submission Information Package (SIP) information will be derived from. When a ‘Producer’ (i.e. a Researcher) signs in to Orbital and uploads a dataset, that content and the information we know from their university profile, as well as further information that they provide, constitutes the SIP.

As I was reading around the OAIS standard recently, I came across a nice piece of work done by a collaborative project between Cornell and Göttingen State universities. Their MathArc project was run using the XP agile project management methodology and as part of their development process, they broke down the OAIS reference model into a deck of 33 cards and tackled each one as a specific iteration. You can download the cards, here [.doc].

Although we’re not planning on 100% OAIS ‘compliance’ during this pilot stage of the Orbital project, it is our intention that Orbital is informed by the OAIS standard and these cards provide a useful and provocative set of functional requirements that we have added to our project tracker. Nick’s currently working on authentication and security for Orbital, or rather, card #3 ‘Provide Security Services’:

Protect sensitive information in the system, including authentication, access control, data integrity, data confidentiality, and non-repudiation services.

Cards #1 and #2 (‘Provide O/S Services’ & ‘Provide Network Services’) are requirements that go beyond the Orbital project itself and are largely met by the wider IT infrastructure that we’re working in at Lincoln. However, our work on ‘piloting the cloud‘ also addresses issues relating to the operating environment and networking of our future RDM infrastructure.

Nick and I met a couple of weeks ago to look at the OAIS model in detail and consider it in light of what we’re beginning to implement. As we’d hope, the high-level stuff you see in the diagram at the top of this post is easy to ‘tick off’. You couldn’t really say you were building an infrastructure to manage research data if you weren’t clear on the basic functional entities of OAIS. The lower-level components of the OAIS standard are where the work gets interesting and more demanding, and where the deck of cards is useful. Along the way, we’ll be creating diagrams like those above to show how we’re iteratively addressing the OAIS standard so that by the end of the project, we should have a model of our own that maps reasonably well onto the OAIS composite model.

I’d be very interested to hear from other MRD projects that are looking at the OAIS standard in detail, as well as people from earlier projects that have been through this process before. I know there have been a number of such efforts in the JISC community. A couple of the documents I found useful were Alex Ball’s Briefing Paper and Julie Allinsons’s OAIS as a reference model for repositories. The paper I remembered from my time at Amnesty is Brian Lavoie’s Introductory Guide. If you’re new to OAIS, I’d recommend that you at least read Lavoie’s report before tackling the full OAIS standard document (PDF).

Tracking progress

Did you know that you can watch our user requirements gathering and see how Orbital development is progressing by following our Github and Pivotal Tracker activity? Here are the key links:

Orbital Manager (the front end) (RSS)

Orbital Core (the back end) (RSS)

Pivotal Tracker (RSS)

Updates are also merged in a single stream of activity on Splendid Bacon.

Internally, we watch all of this activity through Campfire, thanks to Hubot and a bit of plumbing. Commits to Github, new stories and other activity on Pivotal Tracker, fire off API notifications which Hubot (‘Zakia’), delivers to Campfire. Here’s what this afternoon’s activity looked like.

Campfire
Watching Orbital progress on Campfire, using Hubot (Zakia)

Using a mixture of friendly APIs, asynchronous messaging and a chat bot provides us with a handy method of keeping track of what’s going on when we can’t all be in the same room.

Gluing people together

In December, colleagues in the Web Team (who manage the corporate web site in the Department of Marketing and Communications) approached a few of us about building a tool to allow staff to edit their profile for the new version of the lincoln.ac.uk website. We suggested that much of the work was already done and it just needed gluing together. Yesterday we met with the Web Team again to tell them that our part of the work is pretty much complete. Here’s how it works.

Quick sketch of profile building at Lincoln
Quick sketch of profile building at Lincoln

This requires a bit of explanation, but let me tell you, it’s the holy grail as far as I’m concerned and having this in place brings benefits to Orbital and any other new application we might develop. Here’s a clearer rendering.

 

Building staff profiles
Building staff profiles

The chart above strips out the stuff around authentication that you see in the bottom right of the whiteboard photo. That’s for another post – something Alex is better placed to write.

Information about staff at the university starts with the HR database. This feeds the Active Directory, which authenticates people against different web services. Last year, Nick and Alex pulled this data into Nucleus, our MongoDB datastore, and with it built a new, slick staff directory. Then they started bolting things on to it, like research outputs from the repository and blog posts from our WordPress/BuddyPress platform. To illustrate what was possible, they started pulling information from my BuddyPress profile, which I could edit anytime I wanted to. It got to the point where I started using my staff directory link in my email signature because it offered the most comprehensive profile of me anywhere on a Lincoln website.

By the time we first met with the Web Team about the possibility of helping them with staff profiles, Alex and Nick had 80% of the work already done. What remained was to create a richer number of required fields in BuddyPress for staff to edit about themselves and a scheduled XML dump for the Web Team to wrangle into their new templates on www.lincoln.ac.uk.

So the work is nearly done. The XML file is RDF Linked Data, which means that we have a rich aggregation of staff information and some simple relationships, feeding the Staff Directory, being refreshed every three hours and then being output either as HTML, JSON or RDF/XML.

For the Orbital project, all this glue is invaluable. When staff login to Orbital (Nick’s working on this part right now), we’ll already know who they are, which department they work in, what research outputs they’ve deposited in the institutional repository, what their research interests are, what projects they’re working on, the research groups they’re members of, their recent awards and grants, and the keywords they’ve chosen to tag their profile with. It’s our intention that with some simple AI, we’ll be able to make Orbital a space where Researchers find themselves in an environment which already knows quite a bit about their work and the context of the research they’re undertaking. Once Orbital starts collecting specific staff data of its own, it can feed that back into Nucleus, too.

This reminds me of our discussion last month with Mansur Darlington of the ERIM/REDm-MED project. Mansur stressed the importance of gathering data about the context of the research itself, emphasising that without context, research data becomes increasingly meaningless over time. Having rich user profiles in Orbital and ensuring that we record data about the Researcher’s activity while using Orbital, should help provide that context to the research data itself.

Orbital, therefore, becomes an infrastructure not only for storing and managing research data, but also a system for storing and managing data about the research itself.