A JISC-funded Managing Research Data project

Posts tagged EPrints

With the Orbital project at its end, and plans for a University research information / research data service afoot, I’m reviewing the excellent work carried out by our (now-departed) developers Harry Newton and Nick Jackson – work which linked up CKAN, the Orbital ‘bridge’ application, and the Lincoln Repository (EPrints) using SWORD – described in earlier blog posts here and here.

“One important piece of work that we’re undertaking at the moment in Orbital is the facility to deposit the existence of a dataset, from CKAN and the University’s new Awards Management System (AMS), into our (EPrints) Repository via SWORD – at the same time requesting a DOI for the dataset via theDataCite API. The software at the centre of this operation is what we refer to as Orbital Bridge.”

This deposit workflow is now broadly working as it should – I think only a few tweaks would be necessary now to turn this into a working tool for the University of Lincoln.

Most urgent is the need for the University to sign up with the DataCite DOI service, which would secure a DOI for each dataset record deposited from CKAN and hence formally published by the University. This subscription should form part of the new research information service.

The underlying code could be used for other SWORD-enabled deposit from sources of metadata (e.g. the Library’s discovery system, Find it at Lincoln), to the Lincoln Repository as the University’s bibliographic ‘system of record’.

Warning: this is an extremely screenshot-heavy blog post! Click on any one of the screenshots below to view a larger image.

Here’s a step-by-step walkthrough of the entire process of adding a dataset to CKAN, and depositing it as a record in the Lincoln Repository.

  1. Go to the Researcher Dashboard at: https://orbital.lincoln.ac.uk/ and click on “Sign In”.
    Screenshot from the Researcher Dashboard
  2. Enter your staff accountID and password to sign in to the Researcher Dashboard.
    Screenshot from the Researcher Dashboard
  3. Once you have been signed in and returned to the Researcher Dashboard, click on your name (in the top right-hand corner) and then click on “My Projects”.
    Screenshot from the Researcher Dashboard
  4. You will see an overview of your research projects – both funded projects (derived from the AMS), and unfunded projects you have added locally. Click on the name of the project you want to add data to.
    Screenshot from the Researcher Dashboard
  5. You will be taken to a page for that research project. On the right-hand side of this page, under the heading “Options”, click on “Create Research Data Environment”.
    Screenshot from the Researcher DashboardImage7
  6. You will be taken to the University’s CKAN research data platform, where a page/group will have been created which corresponds to your project in the Researcher Dashboard. Sign in to CKAN using your staff accountID (there is currently no single sign-on between the Researcher Dashboard and CKAN) and password and you should be returned to the same page. However you will probably be sent instead to the CKAN home page, in which case you will have to look again for your project under the “Groups” menu.
    Screenshot from CKAN
  7. Toward the top of the project screen in CKAN, click on “Add Dataset” > “New Dataset…”.
    Screenshot from CKAN
  8. Fill in the form with information about the overall dataset, including the following fields:
    • Title
    • URL
    • License (N.B. US spelling!)
    • Description
      Screenshot from CKAN
  9. Then click on “Add Dataset”
    Screenshot from CKAN
  10. If you now click on “Further information” tab on the left-hand menu, you can add the following additional information about the dataset (this is not obvious from the initial dataset form):
    • Author
    • Author email
    • Maintainer
    • Maintainer email
    • Version
    • Summary [of changes]
      Screenshot from CKAN
  11. To attach individual data document(s)—which CKAN refers to as “resources”—to the dataset, scroll down the page and click on “Upload a file” (there are other options) > “Choose file” > “Upload”.
    Screenshot from CKAN
  12. Then fill in the form with the following basic information about the “resource”:
    • Name
    • Description
    • Format
    • Resource Type
    • Datastore enabled (ticked by default)
    • Mimetype
    • Mimetype (Inner)
    • “Extra Fields” (user-defined, or used by Orbital)
      Screenshot from CKAN
  13. To deposit a record for this dataset in the Lincoln Repository, go back to the Orbital Researcher Dashboard at: https://orbital.lincoln.ac.uk/ and navigate to your project. Toward the bottom left of the page you should now see a table containing the dataset(s) you have created in CKAN for this project. Choose which dataset you want to deposit, and hit the “Publish to Lincoln Repository” button.
    Screenshot from the Researcher Dashboard
  14. The Researcher Dashboard will then display a deposit form containing the following fields (some of which should be being autopopulated from CKAN fields but which do not appear to be):
    • Title
    • Description
    • Type of Data
    • Keywords
    • Subjects
    • Divisions
    • Metadata visibility [Show|Hide]
    • People
      Screenshot from the Researcher Dashboard
      “Publishing will publicly announce the existence of your dataset on the Lincoln Repository, as well as start the process of long-term preservation of your data.“Usually you should only publish a dataset either at the end of a research project, or if the data is being cited in a paper. Publishing a dataset will place some restrictions on the changes you can make to the dataset in the future, such as removing your ability to delete the data. It will also generate a DOI, which allows your dataset to be uniquely identified and located using a simple identifier.“Please check the information in this form and make any necessary changes, as this is the information which will be entered into the published record of the dataset.“If you have any questions about this process please contact a member of the research services team for advice or assistance.”
  15. When you hit the “Publish Dataset” button, the dataset record from CKAN will be used to create a record in the Lincoln Repository. The record will be submitted for review by the Repository team, who will then make it live. N.B. for the time being, you will see an error “Validation errors: [doi] is a required string” – this happens because the University does not currently have access to the live DataCite DOI service, which would secure a DOI for each dataset record deposited from CKAN. This should form part of the new research information service.
    Screenshot from the Researcher Dashboard
  16. Here’s an example of a record in the Lincoln Repository, created from a CKAN dataset and made live by the Repository team.
    Screenshot from the Lincoln Repository

Problems with the deposit process as it currently stands:

  1. Permissions are not correctly cascaded from a project the Researcher Dashboard to a group in CKAN.
  2. There is currently no single sign-on between the Researcher Dashboard and CKAN.
  3. When CKAN challenges a user to log in to a group, they should be redirected back to the group page after logging in – instead they get sent back to the CKAN home page, in which case they will have to look again for their project under the “Groups” menu.
  4. A minor one – in CKAN “License” (noun) appears in US spelling (should be “Licence”).
  5. In order to add all the information needed to deposit a dataset from CKAN, user has to click  “Further information” tab on the left-hand menu (this is not obvious from the initial dataset form).
  6. Some of the field labels in CKAN are a bit opaque or use technical terms (“Mimetype”) which could do with explanation.
  7. When depositing to EPrints, some of the deposit fields should be being autopopulated from CKAN fields – this does not appear to be happening. The fields affected are:
    • “Description” (could be derived from CKAN dataset/resource Description fields)
    • “Type of Data” (could be derived from CKAN resource Format field)
  8. Repository records created from CKAN have the data “Creator” attached, but not the “Maintainer”.
  9. Repository records created from CKAN don’t have a link back to the CKAN dataset (should go in the EPrints “Official URL” field) – this will be required to provide access to the data.
  10. After deposit, users see the error message “Validation errors: [doi] is a required string” – the University does not currently have access to the live DataCite DOI service, which would secure a DOI for each dataset record deposited from CKAN.

After testing the new SWORD2 endpoint for our new ePrints 3.3 instance, we found that a significant change was needed for the SWORD library. Minor changes included the endpoint, which became …/id/contents instead of …/sword-app/deposit/inbox, and the structure of the XML changing from <eprint> tags to <entry> tags. The main change was the implementation of how the XML was posted. The SWORD library swordappv2-php-library was forked from the github repository so that an XML string could be posted. This was because our current method posted a string, which the endpoint read as a file rather than metadata. So the dataset had the XML attached to it as a file, with no metadata. We have made additions to the library, changing it to post a string of XML metadata rather than a file. This fixed the problem, giving the dataset metadata once posted rather than attaching it as a separate file.

Now heres the main problem. The dataset gets posted to ePrints in a deposited state, which ePrints classes as ‘in Review’. Now, ePrints requires a minimal set of metadata before a dataset can be ‘in review’. But only if the dataset is made manually within ePrints. Not via the API. Over the API, you can post a dataset straight to ‘in review’ without the mandatory set of metadata. Which brings me to the title of this post; APIs first! API driven development would mean that the APIs are built first so this kind of situation would be avoided.

Another problem we came across during the change was that the test account we had for testing deposits no longer existed due to the migration of user accounts skipping it. This is fine, as an unauthorised response should be received on an attempted deposit. This was not the case, as we got an ‘Invalid XML’ response. Which was unusual, as the XML was valid and everything we tried was to no avail. It was by chance that we found the solution, by switching to an account we knew existed and the deposit working as planned. What had happened was that the depositing had failed, due to the account not existing, but the wrong error message being sent back.

So I reiterate; APIs first. Knowing what the response is, and that the functionality of the application works first, is the most important aspect of said application.

The researcher Dashboard has been expanded to interface with the Lincoln Repository, ePrints. From it, a researcher can deposit their datasets directly to the repository, complete with DOI.

In previous posts, I spoke about how the CKAN and ePrints APIs can interface. We have finally implemented both APIs for use with the Researcher Dashboard and created a useable workflow for depositing datasets from CKAN to ePrints via the dashboard.

The workflow goes as follows:

  1. Hit ‘Publish’
  2. Get latest metadata from CKAN
  3. Prompt user to complete form
  4. Generate DOI
  5. Send metadata to Datacite
  6. Mint DOI
  7. Post SWORD2 to ePrints
  8. Get ePrints ID from response
  9. Add ePrint to SQL database as minimal data
  10. Update dataset in database with ePrint link

When a researcher views their project, they are presented with a list of datasets lifted from the project environment in CKAN. If they want to deposit one into ePrints, they can select the deposit button and are prompted to finish the dataset metadata. ePrints requires a minimal set of metadata before the dataset can be deposited. It can be put into a users inbox with merely a title, but requires a minimal specific set before depositing.

The DOI is minted for a unique identifier, by sending the metadata to Datacite along with the generated DOI. A DOI has to be generated first before it can be minted. Again, this is another field that is input to ePrints via the metadata.

The inclusion of ePrints metadata gives an all in one approach to the Research Dashboard. As otherwise, users would have to go into ePrints and fill in the data there. An annoyance easily avoided by having all the necessary steps taken care of on one site. This completes the toolset, so projects now have a central hub of activity. Data is brought into Orbital via the AMS (Awards Management System) for importing funded projects and CKAN for datasets, and exported to ePrints for the depositing into the Lincoln repository.

The original plan for this workflow was published by Paul Stainthorp. The workflow as it stands currently is as written in this post. It is, however, still in the finishing stages and polishing to make sure the process is solid.

Last summer, we adopted CKAN as our data store/repository/catalogue. At that time, I noted that much had happened in the CKAN project in the few months since the start of the Orbital project in November 2011 that made CKAN a more attractive proposition for managing research data.

Recently, someone on the CKAN mailing list pointed to the graph below, which shows that the interest in CKAN has exploded. In November 2011, interest in CKAN was at just a quarter of its current peak, which is double that of September 2012, when we made the switch to CKAN. Following the European Commission and the UK government, the recent decision by the US government to adopt CKAN for the next version of data.gov will only drive interest in and the development of CKAN even further.

It is an exciting time to be observing and part of this explosion of interest. However, it is worth remembering that the interest in CKAN and data management is still very small compared to interest in other, more generic, content management systems. Publishing structured open data remains a niche interest compared to other open practices on the web, such as blogging. Here’s the graph comparing CKAN to WordPress.

Perhaps a fairer comparison would be that of CKAN with open access repository software, such as ePrints and DSpace.

Of course, the cumulative interest of DSpace and of ePrints over the years is greater than that of CKAN, but right now, there is clearly more interest in CKAN and publishing open data, than there is in open access repository software. The open access movement has matured, while the open data movement is growing rapidly. It will be interesting to follow these trends to measure (in part) the maturity of the open data movement, too.

Further to yesterday’s blog post about linking our CKAN datastore with our EPrints Repository (to allow researchers to deposit permanent, public, citable records of their datasets), here’s a fleshed-out diagram of the proposed dataset deposit workflow process.

At the moment, this assumes a one-time “fire and forget” deposit. At some point, we’re going to have to tackle versioning.

The original diagram is available on Lucidchart. See the table in my previous blog post for details of which data fields are involved in the process (i.e. passed between CKAN, Orbital Bridge, the DataCite API, and EPrints).

This is a proposal and still has to be road-tested. Comments welcome.

Diagram of the dataset deposit process

Stages in the proposed deposit process:

  1. User enters project metadata in AMS
  2. AMS creates project container in CKAN
  3. User creates dataset record in CKAN
  4. Nucleus adds user metadata to CKAN
  5. User deposits data in CKAN
  6. User presses “DEPOSIT DATASET” button in CKAN
  7. Orbital Bridge requests DOI
  8. DataCite API returns DOI
  9. Orbital Bridge adds DOI to dataset record in CKAN
  10. User reviews and approves dataset metadata (making changes if necessary)
  11. Orbital Bridge writes changes back to dataset record in CKAN
  12. Orbital Bridge creates a new EPrints record via SWORD
  13. EPrints confirms existence of new record
  14. Orbital Bridge writes EPrints record URL back to CKAN dataset record