Come and work with us on research data at Lincoln #jiscmrd

I’m immensely excited that the following Grade 7 developer job at the University of Lincoln (initially for a fixed term of two years) is now open for applications. Please contact me if you’d like to discuss the role. If you don’t know Lincoln, it’s an interesting, historic small city and the University’s waterside Brayford Pool campus is a very nice place to work.

You can download the job description document, and apply online, at:

http://jobs.lincoln.ac.uk/vacancy.aspx?ref=LR4068

Research Information Services Developer

Brayford Library Team

Location: Brayford
Salary: From £30,424 per annum
This post is fixed term for two years
Closing Date: Sunday 30 June 2013
Reference: LR4068

We are seeking to appoint an innovative and enthusiastic software developer, with demonstrable experience and understanding of research in an HE environment.

Based in the Library, and reporting to the Head of Electronic Library Services, this exciting new role will lead on coordinating and developing the University’s services and resources for the researcher community, including support for Open Access publishing and research data management.

You can expect to contribute towards significant institutional change in the way research information and research data is managed, analysed and disseminated at the University of Lincoln.

Working closely with other colleagues within the Library, ICT and the Research Office, you will be responsible for leading the technical design and development of research information services at Lincoln, including research data management, bibliometrics and research intelligence, research dashboarding, and the University’s Institutional Repository.

You must have an excellent understanding of the technologies and programming languages used in developing data-driven web services to support research. You will also have successfully managed projects, have good communication skills, and enjoy working as a member of a team in a busy environment.

You must able to take initiative, be well organised and have a proven ability to prioritise and meet tight deadlines. A familiarity with the current UK research environment is also essential.

Potential applicants are encouraged to contact Paul Stainthorp <pstainthorp@lincoln.ac.uk>for an informal discussion on 01522 886 193 or pstainthorp@lincoln.ac.uk.

Throw down the SWORD

With the Orbital project at its end, and plans for a University research information / research data service afoot, I’m reviewing the excellent work carried out by our (now-departed) developers Harry Newton and Nick Jackson – work which linked up CKAN, the Orbital ‘bridge’ application, and the Lincoln Repository (EPrints) using SWORD – described in earlier blog posts here and here.

“One important piece of work that we’re undertaking at the moment in Orbital is the facility to deposit the existence of a dataset, from CKAN and the University’s new Awards Management System (AMS), into our (EPrints) Repository via SWORD – at the same time requesting a DOI for the dataset via theDataCite API. The software at the centre of this operation is what we refer to as Orbital Bridge.”

This deposit workflow is now broadly working as it should – I think only a few tweaks would be necessary now to turn this into a working tool for the University of Lincoln.

Most urgent is the need for the University to sign up with the DataCite DOI service, which would secure a DOI for each dataset record deposited from CKAN and hence formally published by the University. This subscription should form part of the new research information service.

The underlying code could be used for other SWORD-enabled deposit from sources of metadata (e.g. the Library’s discovery system, Find it at Lincoln), to the Lincoln Repository as the University’s bibliographic ‘system of record’.

Warning: this is an extremely screenshot-heavy blog post! Click on any one of the screenshots below to view a larger image.

Here’s a step-by-step walkthrough of the entire process of adding a dataset to CKAN, and depositing it as a record in the Lincoln Repository.

  1. Go to the Researcher Dashboard at: https://orbital.lincoln.ac.uk/ and click on “Sign In”.
    Screenshot from the Researcher Dashboard
  2. Enter your staff accountID and password to sign in to the Researcher Dashboard.
    Screenshot from the Researcher Dashboard
  3. Once you have been signed in and returned to the Researcher Dashboard, click on your name (in the top right-hand corner) and then click on “My Projects”.
    Screenshot from the Researcher Dashboard
  4. You will see an overview of your research projects – both funded projects (derived from the AMS), and unfunded projects you have added locally. Click on the name of the project you want to add data to.
    Screenshot from the Researcher Dashboard
  5. You will be taken to a page for that research project. On the right-hand side of this page, under the heading “Options”, click on “Create Research Data Environment”.
    Screenshot from the Researcher DashboardImage7
  6. You will be taken to the University’s CKAN research data platform, where a page/group will have been created which corresponds to your project in the Researcher Dashboard. Sign in to CKAN using your staff accountID (there is currently no single sign-on between the Researcher Dashboard and CKAN) and password and you should be returned to the same page. However you will probably be sent instead to the CKAN home page, in which case you will have to look again for your project under the “Groups” menu.
    Screenshot from CKAN
  7. Toward the top of the project screen in CKAN, click on “Add Dataset” > “New Dataset…”.
    Screenshot from CKAN
  8. Fill in the form with information about the overall dataset, including the following fields:
    • Title
    • URL
    • License (N.B. US spelling!)
    • Description
      Screenshot from CKAN
  9. Then click on “Add Dataset”
    Screenshot from CKAN
  10. If you now click on “Further information” tab on the left-hand menu, you can add the following additional information about the dataset (this is not obvious from the initial dataset form):
    • Author
    • Author email
    • Maintainer
    • Maintainer email
    • Version
    • Summary [of changes]
      Screenshot from CKAN
  11. To attach individual data document(s)—which CKAN refers to as “resources”—to the dataset, scroll down the page and click on “Upload a file” (there are other options) > “Choose file” > “Upload”.
    Screenshot from CKAN
  12. Then fill in the form with the following basic information about the “resource”:
    • Name
    • Description
    • Format
    • Resource Type
    • Datastore enabled (ticked by default)
    • Mimetype
    • Mimetype (Inner)
    • “Extra Fields” (user-defined, or used by Orbital)
      Screenshot from CKAN
  13. To deposit a record for this dataset in the Lincoln Repository, go back to the Orbital Researcher Dashboard at: https://orbital.lincoln.ac.uk/ and navigate to your project. Toward the bottom left of the page you should now see a table containing the dataset(s) you have created in CKAN for this project. Choose which dataset you want to deposit, and hit the “Publish to Lincoln Repository” button.
    Screenshot from the Researcher Dashboard
  14. The Researcher Dashboard will then display a deposit form containing the following fields (some of which should be being autopopulated from CKAN fields but which do not appear to be):
    • Title
    • Description
    • Type of Data
    • Keywords
    • Subjects
    • Divisions
    • Metadata visibility [Show|Hide]
    • People
      Screenshot from the Researcher Dashboard
      “Publishing will publicly announce the existence of your dataset on the Lincoln Repository, as well as start the process of long-term preservation of your data.“Usually you should only publish a dataset either at the end of a research project, or if the data is being cited in a paper. Publishing a dataset will place some restrictions on the changes you can make to the dataset in the future, such as removing your ability to delete the data. It will also generate a DOI, which allows your dataset to be uniquely identified and located using a simple identifier.“Please check the information in this form and make any necessary changes, as this is the information which will be entered into the published record of the dataset.“If you have any questions about this process please contact a member of the research services team for advice or assistance.”
  15. When you hit the “Publish Dataset” button, the dataset record from CKAN will be used to create a record in the Lincoln Repository. The record will be submitted for review by the Repository team, who will then make it live. N.B. for the time being, you will see an error “Validation errors: [doi] is a required string” – this happens because the University does not currently have access to the live DataCite DOI service, which would secure a DOI for each dataset record deposited from CKAN. This should form part of the new research information service.
    Screenshot from the Researcher Dashboard
  16. Here’s an example of a record in the Lincoln Repository, created from a CKAN dataset and made live by the Repository team.
    Screenshot from the Lincoln Repository

Problems with the deposit process as it currently stands:

  1. Permissions are not correctly cascaded from a project the Researcher Dashboard to a group in CKAN.
  2. There is currently no single sign-on between the Researcher Dashboard and CKAN.
  3. When CKAN challenges a user to log in to a group, they should be redirected back to the group page after logging in – instead they get sent back to the CKAN home page, in which case they will have to look again for their project under the “Groups” menu.
  4. A minor one – in CKAN “License” (noun) appears in US spelling (should be “Licence”).
  5. In order to add all the information needed to deposit a dataset from CKAN, user has to click  “Further information” tab on the left-hand menu (this is not obvious from the initial dataset form).
  6. Some of the field labels in CKAN are a bit opaque or use technical terms (“Mimetype”) which could do with explanation.
  7. When depositing to EPrints, some of the deposit fields should be being autopopulated from CKAN fields – this does not appear to be happening. The fields affected are:
    • “Description” (could be derived from CKAN dataset/resource Description fields)
    • “Type of Data” (could be derived from CKAN resource Format field)
  8. Repository records created from CKAN have the data “Creator” attached, but not the “Maintainer”.
  9. Repository records created from CKAN don’t have a link back to the CKAN dataset (should go in the EPrints “Official URL” field) – this will be required to provide access to the data.
  10. After deposit, users see the error message “Validation errors: [doi] is a required string” – the University does not currently have access to the live DataCite DOI service, which would secure a DOI for each dataset record deposited from CKAN.

The Researcher Dashboard

At the JISC MRD final programme meeting, I demonstrated the work we’ve done to integrate disparate research information systems at Lincoln and begin to develop a workflow between them for the deposit of datasets. Below, are two videos which run through the website and application that was called ‘Orbital Bridge’ and is now referred to as the ‘Researcher Dashboard‘.

The screencasts are a bit rough and ready – I’m no voice actor – but it should give you an idea of what we’ve done and the way that this work points forward to further integration and an increased aggregation and re-presentation of research information, of which datasets are a part. Feel free to post questions in the comments box below. Thanks.

The first video (7 mins) discusses the website in general and the ‘My Profile’ section of the Researcher Dashboard.

The second video (13 mins) discusses the ‘My Projects’ section of the Dashboard and gives an overview of the integration and workflow between different research information systems. I haven’t gone into any detail on the actual use of CKAN as we’re using a fairly vanilla 1.7.2 version, with just some additional branding and authentication work. If you’re interested in how CKAN works, I recommend that you try http://demo.ckan.org. Much of our development with CKAN up to now has been through interaction with its APIs to set up groups and users from the Researcher Dashboard and pull information about datasets from CKAN into the Dashboard.

We’re just about to advertise for a Research Services Developer post (c. £30K/yr), and look forward to that person picking up this work and developing the application and CKAN further. More details on the new role will be posted here in due course.

There are a couple of things that I left out of these videos that I should also mention:

Information about a project in the Researcher Dashboard can be edited from the project page and provides a place where researchers can add further information about the project which is not being collected by the Awards Management System. It also allows the Researcher to add people to the project and assign them roles so they can edit that information, too.If I haven’t emphasised it enough already, one of the ‘features’ of the Dashboard is that it’s an application that allows us to collect more information about research at Lincoln and starts to link it all together. For the first time, information about people, projects, funding, research outputs, datasets and metrics are being brought together in a structured way. With this information, we can go on to build other applications (e.g. a database of research expertise) based on information provided by researchers themselves, enhanced by some simple text mining and clever semantic tagging.

Finally, the documentation on the website is managed by a simple ‘content management system’ that we built for Librarians and other staff who support research at Lincoln. All of the training materials are easily accessible to a non-technical user and can be edited on the website or optionally, managed through Github. This way the site’s content can remain up-to-date and accessible to content authors without having to ask developers of the site to add/edit content for them.

Robotics data now stored in CKAN

I’ve had to delay this post until confirmation of Tom’s project funding came through, but I’m pleased to be able to say that we’ve published our first complete research dataset(s) on CKAN.

Some months ago, Researchers, Tom Duckett and Feras Dayoub, came to us asking if we could host their data to support two publications and an EU grant application they were about to submit. We quickly stuck the data on one of our servers, they knocked up some HTML pages and we advised them on licensing the data so that it could be re-used. It was a temporary solution but we assured them that their root domain name would always act as a proxy to the final resting place of their data and so they started to tell the world about it. I’m told there was much interest in their data on specialist mailing lists and we were invited to submit a paper which discussed the data and the process of its publication. Their consortium bid for EU funding was also successful. Here’s what Tom had to say:

I believe that publishing our datasets for long-term robotic mapping has helped us: 1) to achieve greater awareness of our work (we were among the first groups in the world to study long-term mapping by mobile robots, in research from 2004-present), enabling other researchers worldwide to use our data, 2) to increase citations to our REF-able research papers in this area, and 3) to play our part in successfully applying for a 4-year FP7 IP project in collaboration with 7 other partners, by showing that we already have a track record in hosting such datasets. (STRANDS project – joint PIs at Lincoln: Marc Hanheide and Tom Duckett). One of the requirements of this project will be to publish even larger datasets of robot data, so we look forward to collaborating with Joss and colleagues again in future to address the challenges of hosting and curating “big data” for robotics research.

Prior to switching to CKAN, we were just about to move Tom and Feras’ data across to our own Orbital software, which met their minimal requirements, but having now switched to integrating with CKAN, we’ve moved the datasets to their permanent home at https://ckan.lincoln.ac.uk.

Just as we promised, Tom and Feras are still able to direct people to the original web address we gave them which points to their research pages, but the data itself is now hosted on CKAN. Having seen Tom’s data presented in this way, his colleague Greg published his data in the same way, using our WordPress platform to build a site explaining the data and CKAN as the actual data store.

This all happened before we had our Orbital Bridge publishing workflow in place (a post on that in a couple of weeks) and in the absence of a working Orbital application, I uploaded the data on Tom and Feras’ behalf. I spent quite some time using CKAN and can make the following observations about version 1.7.x, which is what we currently use.

  • Batch uploads: The data was zipped up into four collections of zip files. My task was to duplicate the organisation of the data which made sense to the researchers. This was possible as you can see, but it was tedious uploading each of the 29 zip files, many of which were over 1GB each. There were no problems doing so, it was just tedious and better batch upload/edit operatios in CKAN would make this much easier. Ideally, I’d like to have uploaded the zip files from each of the four collections of data, catalogued them by batch where they shared the same information and then individually edited attributes like the title of each zip file, for example. Having been an Archivist on and off for the last decade, this is one of the main gripes we have with library and archive systems. When dealing with collection of things, we need to be able to operate on them as collections and not have to deal with each object individually. I’ve spoken to CKAN developers about this and there are work-arounds, using scripts and a form extension, but it’s not something CKAN offers to most users with ease. Yet! 🙂
  • Research Groups and projects: The v1.7.x version of CKAN understands the concept of ‘dataset’ e.g. https://ckan.lincoln.ac.uk/en/dataset/ltmro-1 and of that dataset containing discreet resources. e.g. https://ckan.lincoln.ac.uk/en/dataset/ltmro-1/resource/92cbf22b-3293-45a3-b1de-f7782e581fe8 CKAN also understands the concept of ‘groups’ e.g. https://ckan.lincoln.ac.uk/en/group/lincoln-centre-for-autonomous-systems which datasets can be attached to. Groups are simply a label you apply to a dataset. You can add people to a group with specific read/write permissions over the group and you can add datasets to the group, too. CKAN also maintains a history of the actions of that group e.g. https://ckan.lincoln.ac.uk/en/group/history/lincoln-centre-for-autonomous-systems However, currently, CKAN does not (yet) understand ‘projects’, i.e. an organisational concept that is role-based and allows a user to administer other users and work. Groups are not synonymous with projects, but we think that a new feature in CKAN v2.0, due for release in a month or so, will resolve this. As I understand it, CKAN organisations will work like Github organisations and if so, that’s good. On Github, our research group, LNCD, is an ‘organisation’ and within that organisation I can add/remove people, give them roles, create private and public repositories (‘datasets’) and we can be members of more than one organisation, too. e.g. http://github.com/lncd and http://github.com/josswinn There is already a CKAN extension that implements organisations, but we’re waiting for this work to be merged into the core code.
  • Citations: If you look at Tom’s original web pages for their data, they are pretty clear in providing details about how to cite their data. This is so important to academics. CKAN does not offer a way to automatically generate a suggested citation for people who use the data. EPrints, on the other hand, offers the citation details of a research paper right at the top of the publication record e.g. http://eprints.lincoln.ac.uk/6046/ Some work on citations for CKAN has happened – there were conversations a few weeks ago on the IRC channel – but it’s something we need to work on, too. As a temporary solution, I have added the paper citation details as additional fields in the dataset record. CKAN is nice in that it allows you to add adhoc key-value pairs when cataloguing. However, this doesn’t address the citation details for the actual datasets themselves, but rather the publications.

In the near future, our ‘Researcher Dashboard’ application (codenamed ‘Orbital Bridge’) will handle the data deposit workflow from project creation to grabbing a datacite DOI to setting up a CKAN environment, to depositing a record of the data in ePrints for curation and preservation by the university. However, the upload and cataloguing of data will still be done by the researcher using CKAN, with Orbital aggregating information about the project, publications and data into a ‘dashboard’ for the researcher. Something like this  below, which is an actual screenshot of another project that we’re using to test the ‘Researcher Dashboard’. More on this soon…

Example research project overview
Example research project overview

APIs first!

After testing the new SWORD2 endpoint for our new ePrints 3.3 instance, we found that a significant change was needed for the SWORD library. Minor changes included the endpoint, which became …/id/contents instead of …/sword-app/deposit/inbox, and the structure of the XML changing from <eprint> tags to <entry> tags. The main change was the implementation of how the XML was posted. The SWORD library swordappv2-php-library was forked from the github repository so that an XML string could be posted. This was because our current method posted a string, which the endpoint read as a file rather than metadata. So the dataset had the XML attached to it as a file, with no metadata. We have made additions to the library, changing it to post a string of XML metadata rather than a file. This fixed the problem, giving the dataset metadata once posted rather than attaching it as a separate file.

Now heres the main problem. The dataset gets posted to ePrints in a deposited state, which ePrints classes as ‘in Review’. Now, ePrints requires a minimal set of metadata before a dataset can be ‘in review’. But only if the dataset is made manually within ePrints. Not via the API. Over the API, you can post a dataset straight to ‘in review’ without the mandatory set of metadata. Which brings me to the title of this post; APIs first! API driven development would mean that the APIs are built first so this kind of situation would be avoided.

Another problem we came across during the change was that the test account we had for testing deposits no longer existed due to the migration of user accounts skipping it. This is fine, as an unauthorised response should be received on an attempted deposit. This was not the case, as we got an ‘Invalid XML’ response. Which was unusual, as the XML was valid and everything we tried was to no avail. It was by chance that we found the solution, by switching to an account we knew existed and the deposit working as planned. What had happened was that the depositing had failed, due to the account not existing, but the wrong error message being sent back.

So I reiterate; APIs first. Knowing what the response is, and that the functionality of the application works first, is the most important aspect of said application.