APIs first!

After testing the new SWORD2 endpoint for our new ePrints 3.3 instance, we found that a significant change was needed for the SWORD library. Minor changes included the endpoint, which became …/id/contents instead of …/sword-app/deposit/inbox, and the structure of the XML changing from <eprint> tags to <entry> tags. The main change was the implementation of how the XML was posted. The SWORD library swordappv2-php-library was forked from the github repository so that an XML string could be posted. This was because our current method posted a string, which the endpoint read as a file rather than metadata. So the dataset had the XML attached to it as a file, with no metadata. We have made additions to the library, changing it to post a string of XML metadata rather than a file. This fixed the problem, giving the dataset metadata once posted rather than attaching it as a separate file.

Now heres the main problem. The dataset gets posted to ePrints in a deposited state, which ePrints classes as ‘in Review’. Now, ePrints requires a minimal set of metadata before a dataset can be ‘in review’. But only if the dataset is made manually within ePrints. Not via the API. Over the API, you can post a dataset straight to ‘in review’ without the mandatory set of metadata. Which brings me to the title of this post; APIs first! API driven development would mean that the APIs are built first so this kind of situation would be avoided.

Another problem we came across during the change was that the test account we had for testing deposits no longer existed due to the migration of user accounts skipping it. This is fine, as an unauthorised response should be received on an attempted deposit. This was not the case, as we got an ‘Invalid XML’ response. Which was unusual, as the XML was valid and everything we tried was to no avail. It was by chance that we found the solution, by switching to an account we knew existed and the deposit working as planned. What had happened was that the depositing had failed, due to the account not existing, but the wrong error message being sent back.

So I reiterate; APIs first. Knowing what the response is, and that the functionality of the application works first, is the most important aspect of said application.

SWORDs and Citations

The researcher Dashboard has been expanded to interface with the Lincoln Repository, ePrints. From it, a researcher can deposit their datasets directly to the repository, complete with DOI.

In previous posts, I spoke about how the CKAN and ePrints APIs can interface. We have finally implemented both APIs for use with the Researcher Dashboard and created a useable workflow for depositing datasets from CKAN to ePrints via the dashboard.

The workflow goes as follows:

  1. Hit ‘Publish’
  2. Get latest metadata from CKAN
  3. Prompt user to complete form
  4. Generate DOI
  5. Send metadata to Datacite
  6. Mint DOI
  7. Post SWORD2 to ePrints
  8. Get ePrints ID from response
  9. Add ePrint to SQL database as minimal data
  10. Update dataset in database with ePrint link

When a researcher views their project, they are presented with a list of datasets lifted from the project environment in CKAN. If they want to deposit one into ePrints, they can select the deposit button and are prompted to finish the dataset metadata. ePrints requires a minimal set of metadata before the dataset can be deposited. It can be put into a users inbox with merely a title, but requires a minimal specific set before depositing.

The DOI is minted for a unique identifier, by sending the metadata to Datacite along with the generated DOI. A DOI has to be generated first before it can be minted. Again, this is another field that is input to ePrints via the metadata.

The inclusion of ePrints metadata gives an all in one approach to the Research Dashboard. As otherwise, users would have to go into ePrints and fill in the data there. An annoyance easily avoided by having all the necessary steps taken care of on one site. This completes the toolset, so projects now have a central hub of activity. Data is brought into Orbital via the AMS (Awards Management System) for importing funded projects and CKAN for datasets, and exported to ePrints for the depositing into the Lincoln repository.

The original plan for this workflow was published by Paul Stainthorp. The workflow as it stands currently is as written in this post. It is, however, still in the finishing stages and polishing to make sure the process is solid.

CKAN and ePrints APIs

For each application that Orbital interfaces with, be it CKAN, ePrints or anything else, it is abstracted through a ‘bridge_application’ library. Orbital is built predominately in PHP. Using CKAN as an example, we have a Ckan.php file in the folder ‘bridge_applications’ containing all the functions needed to interface with CKAN. If one of the functions it contains is needed, it is called on the page where the result of the function is used.

If a dataset is read, it can be stored as a variable, as the function returns an object. It can be output to the page in Orbital to show what the dataset contains, or saved to a variable to used with another function.

Example:

$this->load->library(‘../bridge_applications/ckan’);
$datasets = $this->ckan->read_datasets();

$datasets is set to the result of the ckan function. What it is set to depends on the datasets in CKAN. In this example, it returns:

array(1) {
  [0]=>
  object(Dataset_Object)#362 (6) {
    ["_title":protected]=>
    string(11) "********"
    ["_uri_slug":protected]=>
    string(38) "********"
    ["_creators":protected]=>
    array(1) {
      [0]=>
      string(17) "********"
    }
    ["_subjects":protected]=>
    array(0) {
    }
    ["_date":protected]=>
    int(1358507313)
    ["_keywords":protected]=>
    array(3) {
      [0]=>
      object(stdClass)#95 (6) {
        ["vocabulary_id"]=>
        NULL
        ["display_name"]=>
        string(12) "********"
        ["name"]=>
        string(12) "********"
        ["revision_timestamp"]=>
        string(26) "2013-01-18T11:16:59.137985"
        ["state"]=>
        string(6) "active"
        ["id"]=>
        string(36) "********"
      }
    }
  }
}

*Some results are starred out.

As this example only includes one dataset, the result is an array with the dataset as its only entrant.

This is converted to the standard format used in Orbital. This standard format is used so that every application Orbital links to has a standard input for data to be sent to. so any application can theoretically talk to any other application through Orbital.

The SWORD library, used for SWORD endpoint data entry into ePrints, takes this standard format as input and formats it to the appropriate format before sending it to the ePrints endpoint. The theory here is the same as before; it is a php library for a bridge application. It takes the data and uses the endpoint to create a record via SWORD.

Example:

$this->sword->create_sword($dataset);

The dataset taken from CKAN is fed into the SWORD library and sent to ePrints to create a new ePrint from the dataset. This is done by using simpleXML to build an XML SWORD compliant object that can be sent via a http curl request to the ePrints SWORD endpoint. The result of this is a new entry in ePrints, via SWORD, from the data retrieved from CKAN.

The code is hosted on GitHub and can be found here:

https://github.com/lncd/Orbital-Bridge/tree/develop/src/application/bridge_applications

MongoDB and SQL

Initially, the Orbital project has been developed around MongoDB. CodeIgniter uses SQL as its built-in sessions storing database but MongoDB has been used for everything else. In terms of the creation and modification of projects, this has been no problem in Mongo. However, when a project is deleted it also needs to delete any permissions to the project and any related files, which is where Mongo gets messy. Each collection needs to be individually accessed and the items relating to the project found and deleted, so no collections of data are related in any way, unlike in SQL.

A decision has been made to switch from MongoDB to SQL for the handling of project details, permissions and related files as the relational database functionality it provides would mean a more structured set of data which would be much easier to work with. This decision has been made because although Mongo works perfectly well and arguably enables much more functionality to be deployed on the tables later on, the programming would be increased and become less structured as each table would need checking for items related to projects. In SQL, this would no longer be a problem. As SQL is relational, if a project is deleted, any related items such as permissions and files are guaranteed to be deleted. The permissions a user has for each project as well as the files associated with a project also need to be linked to the project collection/database.

Mongo will still be used across the rest of the project for the actual research data sets, simply not the metadata.

As Orbital has a modular development, the switch to SQL requires little code to be changed. This is because only the models in Orbital Core connect to the database. Therefore, only these files will require changes to interface with SQL.