Thursday, 3 July 2008

DSpace for Data

The Edinburgh DataShare repository has implemented a customised metadata schema for deposit of datasets. Although we do not consider ourselves metadata experts (yet!), we have been diving into the Dublin Core standards documents to try to follow best practice and to select and qualify a set of DC elements that will work for description of a broad range of datasets, in a way that is user-friendly and compatible with DSpace version 1.5.

Here is a table that describes each element in use and how it appears in DSpace. We would appreciate having feedback, as it's not too late to change our minds about some of these things.

One of the obvious fields that we wanted to change from the default DSpace setup is "author". Calling someone the author of a dataset makes it sound like a work of fiction! More common for datasets is the term Principle Investigator, generally used by funders for the lead on a research study. Fortunately, in Dublin Core, there are only creators and contributors. It gets translated as author for the depositor in the user interface. So we call creator 'data creator' and we
have two kinds of contributors: the depositor (who may or may not be the one of the data creators), and the funder. We assist the user by providing a drop-down menu of the most common funders in UKHE, namely the research councils and a few others. Data Publisher is another agent, which may be interpreted as the department or institution of the data creator.

We used the two DC qualifiers for description - abstract and table of contents as well. Abstract is used to summarise the content of the dataset as one would expect. Table of contents we aim to use to document the contents of the files within the dataset. Every dataset should have data and documentation files. These may be in various formats, so we hope that users will describe the files that the user will download along with their purpose. Time will tell if this is too onerous, but we are sympathetic to the user who downloads a zip file containing a large number of cryptic filenames, and needs to know where to begin. This is the equivalent of the traditional readme.txt file.

We have both subject keywords (depositor can add as many free text terms as they wish) and subject classification using JACS, (Joint Academic Coding System) the subject classification used by the Higher Education Statistics Agency and others. We followed JORUM and the Depot's decision to use JACS because it reflects contemporary UK Higher Education and is fairly straightforward for non-cataloguers to identify an appropriate subject (unlike other classification systems such as DDC (Dewey), the Universal Decimal Classification, or Library of Congress classification.

Our software engineer, George Hamilton has written a class to convert JACS to the XML format expected by the DSpace controlled vocabulary mechanism, and as this is not yet implemented in the new DSpace version 1.5, he has written a workaround in JSP for the DSpace Manakin client. Please contact us if anyone would like to use this code.

For the DC coverage field, both the temporal and geographic qualifiers are important for data. We would like to use Time Period in which the data are collected, rather than a static date, but we may change our minds because DSpace does not support a date range. It would in effect be recording two dates rather than a range.

For spatial coverage, we think we have an innovative solution. In order to 'standardise' input, we use a lookup to GeoNames, which is a collaborative, open access database of placenames. The depositor selects a country from a drop-down list and in the next field begins entering placenames, at any granularity. As they type, placenames associated with that country are filled out automatically, so as to avoid typos.

Again, contact us if you would like to use this code. Another solution for geospatial coverage is being used by the ShareGeo repository (formerly GRADE) based at EDINA, in which a map is provided within the user interface for the depositor to select a bounding box to represent their area of data coverage, which gets translated into Ordnance Survey grid reference points (X,Y). Both of these solutions will be demonstrated in the Repository Fringe event coming up at the University of Edinburgh.

We use the "relation" element in a couple of ways. First, to establish links with publications, we use "is referenced by". Second, we use "is version of" to establish if the dataset is held by another repository, or exists in an earlier version within this repository. If the latter, to make that reciprocal, we have a "replaces" field (supercedes) which can only be used after the item is already in the repository. (The original record needs to remain, or at least the metadata for it, in case others are using it and citing the source.)

Finally, we use "source" and "rights" to allow attribution and copyright statements for data sources derived from other sources. This is separate from the user license (hopefully creative commons) that the depositor will be able to assign to the item being deposited.

Robin Rice


Richard said...

Hi Robin. I've not followed metadata for datasets for a while now: interesting to see you map it to DC. At NDAD, for TNA, tried to do it, in quite a verbose way, with ISAD(G) (e.g. I don't know whence the UKDA schema, though that looks a little closer to yours. What other approaches are you aware of? Is there scope/need for convergence?

Robin Rice said...

Hi Richard,

Interesting, yes your application of ISAD(G) looks verbose but thorough. We are hoping that a form of 'self-archiving' will be used, so had hoped that Dublin Core would keep it simple, though we also expect to do assisted deposits on behalf of the data creators.

Other approaches:
You mentioned UKDA, they use the DDI (Data Documentation Initiative), developed for social science quantitative datasets. We have a deliverable called The Data Documentation Initiative (DDI) and Institutional Repositories written by Luis Martinez Uribe.

Also, Jane Greenberg at UNC and her colleagues have been working on a
DC application profile for biosciences data,

There is a
DC application profile for Crystallography data
done by the eBank project.

And EDINA is leading a project on Geospatial Application Profile (GAP) for DC at the moment. Also the Data Audit Framework Development project led by the Digital Curation Centre is using a layered approach which includes a stage of describing "data assets' which is a form of Dublin Core.

I guess to the extent things map to DC there is some convergence. But inevitably each repository will need to suit its own requirements. I hope to learn more at the upcoming Dublin Core conference in September.