A JISC-funded Managing Research Data project

Posts tagged data

With the Orbital project at its end, and plans for a University research information / research data service afoot, I’m reviewing the excellent work carried out by our (now-departed) developers Harry Newton and Nick Jackson – work which linked up CKAN, the Orbital ‘bridge’ application, and the Lincoln Repository (EPrints) using SWORD – described in earlier blog posts here and here.

“One important piece of work that we’re undertaking at the moment in Orbital is the facility to deposit the existence of a dataset, from CKAN and the University’s new Awards Management System (AMS), into our (EPrints) Repository via SWORD – at the same time requesting a DOI for the dataset via theDataCite API. The software at the centre of this operation is what we refer to as Orbital Bridge.”

This deposit workflow is now broadly working as it should – I think only a few tweaks would be necessary now to turn this into a working tool for the University of Lincoln.

Most urgent is the need for the University to sign up with the DataCite DOI service, which would secure a DOI for each dataset record deposited from CKAN and hence formally published by the University. This subscription should form part of the new research information service.

The underlying code could be used for other SWORD-enabled deposit from sources of metadata (e.g. the Library’s discovery system, Find it at Lincoln), to the Lincoln Repository as the University’s bibliographic ‘system of record’.

Warning: this is an extremely screenshot-heavy blog post! Click on any one of the screenshots below to view a larger image.

Here’s a step-by-step walkthrough of the entire process of adding a dataset to CKAN, and depositing it as a record in the Lincoln Repository.

  1. Go to the Researcher Dashboard at: https://orbital.lincoln.ac.uk/ and click on “Sign In”.
    Screenshot from the Researcher Dashboard
  2. Enter your staff accountID and password to sign in to the Researcher Dashboard.
    Screenshot from the Researcher Dashboard
  3. Once you have been signed in and returned to the Researcher Dashboard, click on your name (in the top right-hand corner) and then click on “My Projects”.
    Screenshot from the Researcher Dashboard
  4. You will see an overview of your research projects – both funded projects (derived from the AMS), and unfunded projects you have added locally. Click on the name of the project you want to add data to.
    Screenshot from the Researcher Dashboard
  5. You will be taken to a page for that research project. On the right-hand side of this page, under the heading “Options”, click on “Create Research Data Environment”.
    Screenshot from the Researcher DashboardImage7
  6. You will be taken to the University’s CKAN research data platform, where a page/group will have been created which corresponds to your project in the Researcher Dashboard. Sign in to CKAN using your staff accountID (there is currently no single sign-on between the Researcher Dashboard and CKAN) and password and you should be returned to the same page. However you will probably be sent instead to the CKAN home page, in which case you will have to look again for your project under the “Groups” menu.
    Screenshot from CKAN
  7. Toward the top of the project screen in CKAN, click on “Add Dataset” > “New Dataset…”.
    Screenshot from CKAN
  8. Fill in the form with information about the overall dataset, including the following fields:
    • Title
    • URL
    • License (N.B. US spelling!)
    • Description
      Screenshot from CKAN
  9. Then click on “Add Dataset”
    Screenshot from CKAN
  10. If you now click on “Further information” tab on the left-hand menu, you can add the following additional information about the dataset (this is not obvious from the initial dataset form):
    • Author
    • Author email
    • Maintainer
    • Maintainer email
    • Version
    • Summary [of changes]
      Screenshot from CKAN
  11. To attach individual data document(s)—which CKAN refers to as “resources”—to the dataset, scroll down the page and click on “Upload a file” (there are other options) > “Choose file” > “Upload”.
    Screenshot from CKAN
  12. Then fill in the form with the following basic information about the “resource”:
    • Name
    • Description
    • Format
    • Resource Type
    • Datastore enabled (ticked by default)
    • Mimetype
    • Mimetype (Inner)
    • “Extra Fields” (user-defined, or used by Orbital)
      Screenshot from CKAN
  13. To deposit a record for this dataset in the Lincoln Repository, go back to the Orbital Researcher Dashboard at: https://orbital.lincoln.ac.uk/ and navigate to your project. Toward the bottom left of the page you should now see a table containing the dataset(s) you have created in CKAN for this project. Choose which dataset you want to deposit, and hit the “Publish to Lincoln Repository” button.
    Screenshot from the Researcher Dashboard
  14. The Researcher Dashboard will then display a deposit form containing the following fields (some of which should be being autopopulated from CKAN fields but which do not appear to be):
    • Title
    • Description
    • Type of Data
    • Keywords
    • Subjects
    • Divisions
    • Metadata visibility [Show|Hide]
    • People
      Screenshot from the Researcher Dashboard
      “Publishing will publicly announce the existence of your dataset on the Lincoln Repository, as well as start the process of long-term preservation of your data.“Usually you should only publish a dataset either at the end of a research project, or if the data is being cited in a paper. Publishing a dataset will place some restrictions on the changes you can make to the dataset in the future, such as removing your ability to delete the data. It will also generate a DOI, which allows your dataset to be uniquely identified and located using a simple identifier.“Please check the information in this form and make any necessary changes, as this is the information which will be entered into the published record of the dataset.“If you have any questions about this process please contact a member of the research services team for advice or assistance.”
  15. When you hit the “Publish Dataset” button, the dataset record from CKAN will be used to create a record in the Lincoln Repository. The record will be submitted for review by the Repository team, who will then make it live. N.B. for the time being, you will see an error “Validation errors: [doi] is a required string” – this happens because the University does not currently have access to the live DataCite DOI service, which would secure a DOI for each dataset record deposited from CKAN. This should form part of the new research information service.
    Screenshot from the Researcher Dashboard
  16. Here’s an example of a record in the Lincoln Repository, created from a CKAN dataset and made live by the Repository team.
    Screenshot from the Lincoln Repository

Problems with the deposit process as it currently stands:

  1. Permissions are not correctly cascaded from a project the Researcher Dashboard to a group in CKAN.
  2. There is currently no single sign-on between the Researcher Dashboard and CKAN.
  3. When CKAN challenges a user to log in to a group, they should be redirected back to the group page after logging in – instead they get sent back to the CKAN home page, in which case they will have to look again for their project under the “Groups” menu.
  4. A minor one – in CKAN “License” (noun) appears in US spelling (should be “Licence”).
  5. In order to add all the information needed to deposit a dataset from CKAN, user has to click  “Further information” tab on the left-hand menu (this is not obvious from the initial dataset form).
  6. Some of the field labels in CKAN are a bit opaque or use technical terms (“Mimetype”) which could do with explanation.
  7. When depositing to EPrints, some of the deposit fields should be being autopopulated from CKAN fields – this does not appear to be happening. The fields affected are:
    • “Description” (could be derived from CKAN dataset/resource Description fields)
    • “Type of Data” (could be derived from CKAN resource Format field)
  8. Repository records created from CKAN have the data “Creator” attached, but not the “Maintainer”.
  9. Repository records created from CKAN don’t have a link back to the CKAN dataset (should go in the EPrints “Official URL” field) – this will be required to provide access to the data.
  10. After deposit, users see the error message “Validation errors: [doi] is a required string” – the University does not currently have access to the live DataCite DOI service, which would secure a DOI for each dataset record deposited from CKAN.

For a project which is essentially about storing data, we’ve not actually done that much talking about it. This may seem sensible to some — after all, everybody knows what data is, don’t they?

It turns out that what people define as ‘data’ is a hugely wide ranging topic (you can find a myriad of research on how different people define it), and what we’re trying to do is basically trying to fit mis-shapen data into a one-size-fits-nothing storage system. Allow me to elaborate.

First of all we had to look at what data was currently available to us. Fortunately we have some awesome project partners in the School of Engineering who provided us with some of what they’re researching on, and thus presented the first problem: The data doesn’t exist in any kind of standardised format. We’ve got to content with flat text database formats, weird (often invalid) XML, Excel spreadsheets, CSV files (again often invalid), folders of images or audio files, proprietary binary formats, non-binary flat files which nonetheless need parsing to be made understandable, plain strings of data, and the occasional random file format which even the source of the data can’t explain.

The solution to this problem is fairly simple in principle, yet complex in practice. First of all when it comes to archive storage of files (ie without any pre-processing) Orbital is designed to be file type agnostic — if you give it a random stream of bytes and say it’s a file then a Orbital will duly store the file as provided, with no further work needed. It doesn’t care if your XML file has no DTD and has unclosed tags, since it doesn’t do any work inside the stream. You will later be able to retrieve the file exactly as it was first loaded into the system without any changes or alterations. It’s worth pointing out, however, this does mean that if Orbital is given a corrupt file to store then it will do so blindly without any attempt at validation.

(more…)

This is a proposal for a paper at the Open Repositories 2012 conference in July.

The JISC-funded Orbital project is building on earlier work at the University of Lincoln to develop a state-of-the-art research data management infrastructure, piloted with the first purpose-built School of Engineering in the UK in over 20 years.

Orbital (figure c) differs from traditional database applications in three significant ways:

  1. Orbital Core uses MongoDB, a document-oriented, schema-less, so-called ‘NoSQL’ database. MongoDB offers flexibility in that it is capable of accepting an object representing any kind of data (e.g. tabular data, survey results, images) without the need to develop a schema beforehand. MongoDB also includes useful features which can boost performance and resiliency, namely sharding – slicing data across multiple servers so a request may be processed by multiple servers in parallel – and replication — keeping multiple identical copies of data on different servers in case one of them fails. Orbital is also designed to be able to spread the ‘core’ – the application which does the heavy lifting – and the ‘manager’ – the front-end user interface – across multiple servers without causing stress. In our experience MongoDB, combined with the Sphinx search engine to perform full-text searching, is also extremely fast and allows us to develop simple, attractive APIs which we can expose to user applications.
  2. Orbital Core mediates access to the data via an open source OAuth 2 server we have developed and implemented at Lincoln.  The use of OAuth 2 allows access to the data from multiple authorised systems providing that the owner of the data has given permission, instantly opening the Orbital application to third-party extension. This method establishes the identity, authentication and authorisation of users, providing direct access to individual data sets or portions of data sets (e.g. specific rows/columns) through APIs on Orbital Core.
  3. The design and development of Orbital Core is API-driven, resulting in an application that offers 100% of its functionality through APIs, whether to our own Orbital Manager or a third-party application, each of which are treated equally by Orbital Core (figure c). As far as Orbital Core is concerned there is no functional difference between Orbital Manager (the front-end) and an application that a researcher has developed to meet a specific need; they are subject to the exact same access controls, restrictions, sanity checking and limitations. We have therefore eschewed some of the traditional approaches of building a database application, where access to the database is either provided via a stand-alone application (figure a) or via an API bolted on to the database (figure b). Orbital is also designed to be both stateless, i.e. all of the API functions are RESTful and thus represent a complete transaction with no requirement for session affinity, are not reliant on SQL features like transactions and joins, and have a reduced requirement for referential integrity.

Under this design, the API is the only way to interface with the data and functionality of the system. This API-driven approach offers several benefits:

  • Architecture is better: We are forced to think about data types and methods early on. Consistent behaviour across the application is easier to achieve.
  • Development is easier: Calling a well designed API is simple; error messages become cleanly captured by design; APIs encourage code reuse at both API and application end.
  • Updates become simpler:  We can run two or more API versions concurrently; tweak the API back-end and all front-end applications (‘official’ and 3rd party) benefit at once.
  • The APIs are better: The APIs must include everything we want our application to be able to do. Reliability of the API is now critical which encourages better design of resiliency and error handling; and usability of the API is essential which encourages better documentation.

The challenges of this approach are that every time we want to build user-facing functionality we have to assess our APIs and work out where the functionality belongs as well as ensuring that we have lightweight data transfer and reliable error handling designed into the application. We also have to double up on some areas of development, writing both the respective Core and Manager parts of the system.

Illustrations

Figure a: The only way to interact with this application is to either be a user, or pretend to be one (for example via screen-scraping).
Figure b: The most common form of API, consisting of a ‘second view’ on the data and functionality of an application. This style of API often exposes a limited subset of the application’s functionality.
Figure c: In an API-driven model the API is the only way to interface with the application.

As I’ve been looking closer at various requirements for Orbital, as well as other research data management projects, it’s becoming increasingly apparent that Orbital has taken a different tack when it comes to defining what research data actually is. Whilst not a problem, it does lead to a certain disconnect when talking to people with a different idea about what data means. When it comes to storing data the disconnect is even bigger, caused by people experiencing problems breaking the transit format of the data away from the data itself. In true engineering/computing style, it’s time for an analogy. I’m using sweets because hey, sweets are awesome.

Sugar! Sugar!

Imagine a tube of Smarties (or sugar-coated chocolate beans of choice). When I talk about research data I’m talking about the individual smarties, the individual nuggets of information. You could tip 100 tubes of smarties into a bowl and you’d just end up with a big pile of smarties. You could then go through and sort the smarties by colour, or perform some other type of organisation. Since you’ve got the individual smarties out of their containers it’s a lot easier to see a whole overview and work with them all at once.

 

Taking this approach makes sense to me, because if I want to throw in a couple of bags of Peanut M&Ms I can do without suddenly having a tube saying “Smarties” which contains nuts. I can still sort my pile of sweets into colours, or into types. I can orient them by the little letters on top. I could throw in a handful of jelly beans and a bar of chocolate broken into squares, and then order by sugar content, colour, and number of artificial flavours. The possibilities are quite literally limited only by my tolerance for sugar highs.

(more…)