Friday, 30 January 2009

Data Walkabout: Wellington

The second stop on my data walkabout was New Zealand's capital. I spent the morning with the Information Management team at Statistics New Zealand learning how they take initiative on documenting and archiving legacy datasets for long-term preservation. I'd heard Euan Cochrane's clever presentation at last year's IASSIST conference, and so I knew Stats NZ is unusual as a national statistical agency for adopting the XML-based DDI standard (Data Documentation Iniative).

I had previously only heard of DDI being used as a dissemination tool before, as within the software invented by the national data archive community, Nesstar, which allows the user to select cases and variables and do basic online analysis before downloading the entire dataset. So I was surprised to hear that while the team marks up datasets in DDI (ver 2), using an XML editor such as Stylus Studio, they don't disseminate them that way, but simply store them, basically in a dark archive which a handful of people have access to, for posterity.

As for dissemination, survey tables and other aggregate datasets are published on the website. For individual-level microdata, there are three ways to obtain them: a personal visit to the secure Data Lab, by requesting and obtaining a "CURF" - Confidentialised Unit Record File, or via Remote Access through ATOM (Access to Microdata). Access is restricted, reviewed on a per request basis, and all involve a cost recovery charge. Individual data on New Zealanders, it is felt, must be carefully guarded since the population is so small and people have unique attributes to which they could be identified.

A newly formed team,led by Hamish James, who along with Euan ensured my visit was hospitable and informative, has a mission of maintaining an enduring national statistical resource. Passage of time has proven that a) data are meaningless without metadata, and b) that there is reluctance from business units to part with data even to an organisational data archive. So the team is working hard to build trust with statisticians who collect and analyse data through effective preservation of legacy datasets. Eventually, workflows adopted by statisticians will ensure that newer data are properly documented and cared for from the start, hopefully making the archiving process easier.

The team collaborates with other preservation organisations in the city, Archives New Zealand and the National Library, who all meet regularly to exchange best practice. They use tools such as JHOVE (to produce checksums for checking data integrity), DROID, which provides a PRONOM identifier that gives a full description of the file format, and the National Library of NZ Metadata Harvester, which produces an XML file from which an XSLT stylesheet is produced. A local script then helps to fill in a PREMIS preservation metadata record.

In the afternoon I had the pleasure of meeting with Isabella Cawthorne and Julia Watson from the Ministry of Research, Science and Technology (MoRST) over coffee near the Beehive Parliament building (pictured). Isabella, as a policy-maker for research funding, is concerned about incentivising researchers to manage and share data to avoid having to fund projects that "reinventing the wheel". She says New Zealand needs coordination to get the best value out of environmental research. Julia is working on the e-Research front: the high speed Karen network has been set up in New Zealand, but applications and middleware still needs to be developed. They both believe BESTGrid is a good "bottom-up" example that could be an exemplar for further collaboration and development.

Their ideal scenario for environmental data sharing is a federated approach (rather than a central archive), but with authenticated access, based on levels of quality assured data. (New Zealand is considering joining the Australian Access Federation, which would offer a Shibboleth-based approach to authentication.) They shared a discussion paper commissioned by MORST called Environment Data 2.0: building the digital platform for a sustainable future, which sets out this vision.

I found this substantial food for thought: what can policy makers and funders put in place to best encourage data sharing in research?

No comments: