The Importance of Useful Data

During the development of Orbital (specifically the Researcher Dashboard) we’ve been trying (with mixed success) to make it integrate smoothly with various other University systems. Fortunately, a design decision made by some of the LNCD team a couple of years ago means that we’ve got our own institutional data store (codename Nucleus) with which we can almost exclusively interact to get hold of everything we needed. Where we’ve been integrating with new systems such as the University’s Awards Management System we’ve taken the approach of hooking the data into Nucleus first, so that it’s not only available to Researcher Dashboard but also to any other system which needs it.

Nucleus has quite a powerful framework for managing, structuring and presenting data in a rigorously managed format. It validates things at various points during data entry to make sure that it’s not gibberish, and then at the point of rendering it’s put through another set of functions which ensure it’s presented consistently and in as useful a manner as possible. As a result (using Nucleus, our PHP library, the CWD and our OAuth 2 authorisation server) we can go from a standing start to a fully featured, integrated application in a couple of days. A big part of the reason we can do this is that we make extensive use of dogfooding to ensure that our data is useful.

It saddens me, therefore, that during integration with some other applications both inside and outside the University we are forced to tackle data – often purported to be “machine readable” or “ready for reuse” which has clearly not been looked at by the eye of somebody who wants to reuse it. As an example, one source of data provides a date range which is stored internally (as far as I can gather) as two distinct values; there is a “start date” and there is an “end date”. These are provided through the UI as structured inputs (a date picker) which ensures they’re entered (and presumably then stored) in an expected format which can be manipulated as necessary. The API then chooses to express this date range not as a distinct “start date” and “end date”, but instead as a single “dates”.

You may think that this isn’t such a big problem – after all, how difficult can it be to parse 04/02/2013 - 07/03/2014? In that example it’s actually pretty easy once you’ve decided if you’re using UK or US style dates. The ISO date format can solve this though, giving us 2013-02-04 - 2014-03-07. Sadly, this isn’t what we get. In fact, here are the four (yes, four) distinct ways that “dates” can be represented:

2013-02-04 to 2013-02-04 becomes 4 Feb 2013
2013-02-04 to 2014-03-07 becomes 4 Feb 2013 - 7 Mar 2014
2013-02-04 to 2013-03-07 becomes 4 Feb - 7 Mar 2013
2013-02-04 to 2013-02-07 becomes 4 Feb 2013 - 7 Feb 2013

So, the rule becomes that if the dates are the same you just show the single date, but if the dates are different then you show two dates, unless they are in the same year in which case you only show the year in the final date, unless they are in the same year and the same month, in which case you show two dates. And then you format all the dates with a locale-specific short form of the month name.

Parsing this is understandably more difficult than it should be. Please, think about how your data will actually be used when building outputs.