A JISC-funded Managing Research Data project

Posts tagged workflow

Further to yesterday’s blog post about linking our CKAN datastore with our EPrints Repository (to allow researchers to deposit permanent, public, citable records of their datasets), here’s a fleshed-out diagram of the proposed dataset deposit workflow process.

At the moment, this assumes a one-time “fire and forget” deposit. At some point, we’re going to have to tackle versioning.

The original diagram is available on Lucidchart. See the table in my previous blog post for details of which data fields are involved in the process (i.e. passed between CKAN, Orbital Bridge, the DataCite API, and EPrints).

This is a proposal and still has to be road-tested. Comments welcome.

Diagram of the dataset deposit process

Stages in the proposed deposit process:

  1. User enters project metadata in AMS
  2. AMS creates project container in CKAN
  3. User creates dataset record in CKAN
  4. Nucleus adds user metadata to CKAN
  5. User deposits data in CKAN
  6. User presses “DEPOSIT DATASET” button in CKAN
  7. Orbital Bridge requests DOI
  8. DataCite API returns DOI
  9. Orbital Bridge adds DOI to dataset record in CKAN
  10. User reviews and approves dataset metadata (making changes if necessary)
  11. Orbital Bridge writes changes back to dataset record in CKAN
  12. Orbital Bridge creates a new EPrints record via SWORD
  13. EPrints confirms existence of new record
  14. Orbital Bridge writes EPrints record URL back to CKAN dataset record

As part of Orbital’s development we need to keep what we’re doing on track, and ensure that what is produced is actually what people are after. We’re building the project using agile development methods, which mean that instead of generating a load of documentation and exacting requirements up front and then building software, we generate a basic set of requirements, start developing and then return to look at new or changed requirements at regular intervals.

Keeping tabs on this kind of thing requires a management tool, and in our case we’re using the wonderful Pivotal Tracker, and here’s why.

Pivotal allows us to break down user requirements (gathered through a variety of means, including meetings, surveys, observation and so-on) into discreet bundles called ‘stories’, each of which represents something that a user needs (or wants) to be able to do with the final product. An example may be “project administrators must be able to assign roles to project users”, or “users must be able to manually add a data point”. By creating these stories it starts to become clearer what actually needs to be done.

From there we can start to fully analyse each of these stories, providing them with information such as a ‘score’ of how difficult to achieve each story will be, or including sub-tasks for actual development purposes. Stories can be assigned to various people based on who needs to be involved, and go through a clearly defined workflow of being started, being finished, being delivered in a product version and being approved by the customer.

On top of this management of user stories we can also pack out Pivotal with higher-level package deliverables and deadlines, along with bug reporting and general project chores. Once we’ve got all these things into the Tracker we’re able to re-order them as priorities shift, giving us an instant overview of what’s happening in the current iteration (a 2-week long development cycle) as well as what’s going to be happening in future iterations. At this point, Pivotal Tracker comes into its own with something called ‘emergent planning’.

Emergent planning takes a look at how we’re actually performing in terms of crunching through our list of user stories and dynamically adjusts which stories we’re going to be tackling in upcoming iterations. If we’re doing well we begin to see more points worth of development per iteration, and if we’re slipping then Tracker gives us fewer. Since we’ve told Pivotal what needs to happen before certain deadlines are met (when we ordered stories and tasks), and since Pivotal knows roughly how fast we’re working, it’s easy to see if we’re predicted to hit or miss development milestones.

Want to see what we’re up to? Our Pivotal Tracker project is open for you all to see.

Last week Joss and I had a chat with one of our engineering researchers about the kinds of data he handles in his research. This was an incredibly useful meeting, leading to a whole bunch of notes on data types, requirements and workflows. The one I’m taking a look at today is the flow that data takes from its source, through storage and processing, and into a useful research conclusion.

The existing workflow looks something like that shown above. Source data is manually transferred (often using ‘in-the-clear’ methods) from its point of origin to local storage on a researcher’s machine, where it will reside on the hard disk until it’s used. From there the data is processed (Engineering love using MATLAB, as do a lot of other science disciplines, so that’s the example here) and potentially the results of that analysis are recombined with the local storage for further work. At some point the processing will arrive at conclusions for the data, and from those an output can be drawn.