Thursday, 31 July 2008

Digital Repository Services for Research Data Management

More notes from the Edinburgh Repository Fringe:

Luis Martinez Uribe discussed the findings of the Oxford Research Data Management Project - Scoping digital repository services for research data management (heard here for the first time!).

He explained the relative complexity of Oxford University's devolved infrastrucure and the Oxford Federation of Digital Repositories containing a whole range of digital objects and collections

The overall aim of the project was to scope requirements for repository services to manage and curate research data i.e. using RIN classifications of observational; experimental; derived; simulations; reference data

He conducted scoping study interviews in order to establish how researchers use their data - interviewing 37 researchers across disciplines.

His findings looked at 4 key areas:

Funding: he found that there was no detailed planning in place in funding bids, funding came from a variety of sources, any plans that did exist were fairly generic, although researchers did indicate an awareness of need to make data available although no retention plans etc were in place.

Data collection: Luis reported the varied origins of research data, format and file size variations,indeed much data collected was collated from printed sources

Processing: the majority of research data was stored on departmental servers or desktops, no/little metadata, poor annotation, sharing by email/portable media, with ensuing storage and sharing problems particularly with big datasets

Publication - he established that there were few deposits in national archives (time/effort on the part of researchers), some data was published on the web, and there was general agreement on the usefulness of linking data and publications

One of the main issues that emerged from the exercise was SUPPORT, or rather the lack of it. No mention of the role of librarians with any support that did exist coming directly from departmental IT officers. It was also evident that researchers were very much in favour of a 'federated' approach rather than centralised models with the impositions that could/would be entailed.

As a result of the scoping study Luis identified that there was a need for secure and friendly solutions to store and share data which can only be achieved by concerted thinking at policy level about a sustainable infrastructure(s) to publish and preserve research data. However to reiterate, he also identified that a major requirement is support, support, support - support for data management plans, data formats, data sharing policies and guidelines, storage and curation etc.

Luis sees the next steps being to consult more with service providers, libraries, information and computing services departments in order to address and validate requirements - He also plans to implement the DAF (Data Audit Framework) methodology currently in development by HATII/DCC at the University of Glasgow to audit oxford's data resources in conjunction with DISC-UK DataShare.

Perhaps much of what was said is already in many of the minds of practitioners however this first public airing of these findings offers a degree of clarity and articulation on the issues surrounding institutional data auditing.

I'm hoping to catch the end of Prof. Dave de Roure's (Southampton University) presentation on 'How repositories can avoid failing like the Grid' - if I don't get round to blogging this all the main session 's have been recorded and will be available at a comfortable desktop Edinburgh Repository Fringe cinema near you!

Stuart Macdonald
DISC-UK DataShare

Notes from Edinburgh Repository Fringe

Notes from a small island! Or rather the Edinburgh Repository Fringe event (31 July - 1 August, 2008).

Dorothea Salo's keynote speech provided a controversial overview of why IRs are dead and need to be resuscitated!

She claims "we built it but they didn't come!" and that we (as in the repository community) ignored or didn't quite comprehend the world that academics are immersed in - 'their narrow field of battle'. Or their 'paranoid' (her words not mine) and legitimate questions such as plagiarism of their academic works, is this the institution acting as big brother?, what is authoratative version?, will my publisher be happy? etc.

So, as she states - it is not as simple to say - yes, lets have open access. For example, the software platforms that repositories are based on don't have download statistics nor versioning; they won't let you edit your metadata, there are no facilities to digitise analogue material, can't stream videos etc etc. She painfully admits "I helped to kill the IR - it is dead! - so lets mourn the death of IR".

However she sees the shape of opportunity in the ashes. Repository software made the same bad assumptions as we did; workflows that don't work for born digital materials; protocols that don't do enough; there's services that could/should be offered but aren't; there's a stunning amounts of redundant effort aimed at redressing these problems. What we should be doing is putting effort into better software and better services before the web whizzes past us as we try to catch up! Currently the 'Institutional Repository' is not mashable, they are deemed ugly and to concur with abovementioned criteria can be regarded as almost unusable.

She thinks that we're missing opportunities - basically the various stakeholders (data curators, metadata librarians, grant administrators, researchers, system administrators) need to work more closely together. She asks us, the repository community to 'take one step back - then two steps beyond' - beyond the idealism and the 'green OA'. Our experience is now telling us that peer-reviewed research is not all we care about, that useful research products happen long before publication and as such open access is a by-product not an end product. She wants us to look beyond the silos of digital resources and do a good job with the 'stuff', not to be too obsessed by where it goes, to be profligate with our 'stuff' - mash it up, expose it, manage it, mainstream it - no matter where it eventually ends up.

She highlighted the fact that self archiving doesn't have a management component, she's 'tired at watching good code fly past' i.e open utilities that could be utilised within the repository environment but aren't.

So rather than adhering to an opinion of the IR that 'everybody knows they all fail!' lets re-think, re-innovate, re-invent the IR as a suite of services and solutions.

For example, regarding harvesting - the content is out there - just have to get our hands on it; lets have more APIs, allow programmers to be more flexible, lets learn from and invest in relations with commercial services and disciplinary repositories. Let's start healing wounds, start make issues surrounding metadata less confusing, ally with other institutional efforts, placate the bruised egos and mend fences.

She concluded by stating that idealism isn't enough - the IR community have to make themselves useful - to administrators, to faculty; there has to be investment in add-on services - it's time to be responsible, responsive and reactive in rethinking our initial assumptions!

More soon!

Stuart Macdonald
DISC-UK DataShare

Friday, 11 July 2008

Research Data: who's problem is it?

As DataShare project manager I've been asked to attend the strand on "Research Data: who's problem is it?" at the JISC Innovation Forum next week.

Preparation for the 3 sessions has begun by starting with some questions posed in blog posts and comments by attendees. I thought some of this might be of interest to readers of this blog.

The parent URL now includes "live blogs" or notes of each session.

Legal and policy issues
Capacity and skills issues
Technical and infrastructure issues

-Robin Rice

Monday, 7 July 2008

My Faves for Sunday, July 06, 2008

The DOE Data Explorer (DDE) locates scientific research data - such as computer simulations, numeric data files, figures and plots, interactive maps, multimedia, and scientific images - generated by DOE-sponsored research in various science disciplines. The DOE Data Explorer includes a database of citations prepared by the Office of Scientific and Technical Information (OSTI). It is intended for students, the public, and to researchers who are looking for experimental or observational data outside their normal field of expertise.

[tags: data sharing, science, research data]

See the rest of my Faves at Faves

Thursday, 3 July 2008

DSpace for Data

The Edinburgh DataShare repository has implemented a customised metadata schema for deposit of datasets. Although we do not consider ourselves metadata experts (yet!), we have been diving into the Dublin Core standards documents to try to follow best practice and to select and qualify a set of DC elements that will work for description of a broad range of datasets, in a way that is user-friendly and compatible with DSpace version 1.5.

Here is a table that describes each element in use and how it appears in DSpace. We would appreciate having feedback, as it's not too late to change our minds about some of these things.

One of the obvious fields that we wanted to change from the default DSpace setup is "author". Calling someone the author of a dataset makes it sound like a work of fiction! More common for datasets is the term Principle Investigator, generally used by funders for the lead on a research study. Fortunately, in Dublin Core, there are only creators and contributors. It gets translated as author for the depositor in the user interface. So we call creator 'data creator' and we
have two kinds of contributors: the depositor (who may or may not be the one of the data creators), and the funder. We assist the user by providing a drop-down menu of the most common funders in UKHE, namely the research councils and a few others. Data Publisher is another agent, which may be interpreted as the department or institution of the data creator.

We used the two DC qualifiers for description - abstract and table of contents as well. Abstract is used to summarise the content of the dataset as one would expect. Table of contents we aim to use to document the contents of the files within the dataset. Every dataset should have data and documentation files. These may be in various formats, so we hope that users will describe the files that the user will download along with their purpose. Time will tell if this is too onerous, but we are sympathetic to the user who downloads a zip file containing a large number of cryptic filenames, and needs to know where to begin. This is the equivalent of the traditional readme.txt file.

We have both subject keywords (depositor can add as many free text terms as they wish) and subject classification using JACS, (Joint Academic Coding System) the subject classification used by the Higher Education Statistics Agency and others. We followed JORUM and the Depot's decision to use JACS because it reflects contemporary UK Higher Education and is fairly straightforward for non-cataloguers to identify an appropriate subject (unlike other classification systems such as DDC (Dewey), the Universal Decimal Classification, or Library of Congress classification.

Our software engineer, George Hamilton has written a class to convert JACS to the XML format expected by the DSpace controlled vocabulary mechanism, and as this is not yet implemented in the new DSpace version 1.5, he has written a workaround in JSP for the DSpace Manakin client. Please contact us if anyone would like to use this code.

For the DC coverage field, both the temporal and geographic qualifiers are important for data. We would like to use Time Period in which the data are collected, rather than a static date, but we may change our minds because DSpace does not support a date range. It would in effect be recording two dates rather than a range.

For spatial coverage, we think we have an innovative solution. In order to 'standardise' input, we use a lookup to GeoNames, which is a collaborative, open access database of placenames. The depositor selects a country from a drop-down list and in the next field begins entering placenames, at any granularity. As they type, placenames associated with that country are filled out automatically, so as to avoid typos.

Again, contact us if you would like to use this code. Another solution for geospatial coverage is being used by the ShareGeo repository (formerly GRADE) based at EDINA, in which a map is provided within the user interface for the depositor to select a bounding box to represent their area of data coverage, which gets translated into Ordnance Survey grid reference points (X,Y). Both of these solutions will be demonstrated in the Repository Fringe event coming up at the University of Edinburgh.

We use the "relation" element in a couple of ways. First, to establish links with publications, we use "is referenced by". Second, we use "is version of" to establish if the dataset is held by another repository, or exists in an earlier version within this repository. If the latter, to make that reciprocal, we have a "replaces" field (supercedes) which can only be used after the item is already in the repository. (The original record needs to remain, or at least the metadata for it, in case others are using it and citing the source.)

Finally, we use "source" and "rights" to allow attribution and copyright statements for data sources derived from other sources. This is separate from the user license (hopefully creative commons) that the depositor will be able to assign to the item being deposited.

Robin Rice