DataShare Blog

Last post to this blog

2010-04-01T16:20:00.005+01:00

The DISC-UK DataShare project ended last year, and so this marks the end of this blog. We did write up our self-evaluation for JISC, our funder, in December, which any project followers might be interested to read.

As for followup activities, there are many. Oxford is undertaking the Embedding Institutional Data Curation Services in Research (EIDCSR), funded by JISC. Southampton also has a new JISC-funded project, Institutional Data Management Blueprint, which aims to 'to create a practical and attainable institutional framework for managing research data that facilitates ambitious national and international e-research practice.' Edinburgh produced online guidance for University staff on research data management and continues to host and develop the Edinburgh DataShare repository service developed as part of the project.

All three partners are interested in pursuing institutional policies for research data management. We'll continue to collaborate actively through DISC-UK.

DataShare presentation at DSpace User Group Meeting 2009

2009-11-12T16:31:00.002+00:00

Robin Rice gave a presentation to the DSpace User Group Meeting in Gothenburg Sweden on 15 October, 2009, entitled "Edinburgh DataShare : Tackling research data in a DSpace institutional repository."

The presentation is saved in the University of Gothenburg repository, GUPEA, along with a video of the live presentation.

2009-10-02T16:44:00.003+01:00

As follow-up activity to the DataShare and Data Audit Framework projects at the University of Edinburgh, a new set of web pages have been written as guidance for university researchers on research data management, sharing and preservation, published as part of a re-launched University of Edinburgh Information Services website.

The shortcut URL is:
http://www.ed.ac.uk/is/data-management

Similar initiatives have been collected and linked on a page from the Australia National Data Service.

My Faves for Thursday, September 10, 2009

2009-09-11T01:38:00.000+01:00

Data Sharing : Specials : Nature News

Nature data sharing special edition including detailed on pre-publication and post-publication data sharing and tools

[tags: Data curation, data sharing, data management]

See the rest of my Faves at Faves

Repository Fringe 2009: Round Table Sessions

2009-08-03T15:07:00.004+01:00

We have an extra treat from Repository Fringe courtesy of Robin Taylor of Edinburgh University Main Library who has kindly allowed us to share his notes on Round Table Sessions A and C:

Round Table A: "Practical impact and experiences of institutional OA mandates for IRs" (Helen Muir, Queen Margaret University)

Opening remarks from Helen Muir indicated that there had been resistance to the mandate:

Perception that it was a time burden.
Researchers resented feeling pressurised.

There are two approaches available:
Carrot approach

Emphasise benefits with Google Analytics.
Ease researchers concerns by utilising the ePrints 'request a copy' feature.
Using Scopus to find missing articles.

Stick approach
Should there be any penalty for not complying with a mandate ? The consensus was no.

Without exception all those present had a mediated deposit process. This might be library staff but might also be department research administrators depositing on behalf of staff. A perceived advantage of using research administrators was that they know the staff and therefore also know who to chase up etc.

Other observations from participants:

There is a difficulty in getting 'final' versions.
Targeting 'star' researchers appears to have a knock on effect.

Additional thoughts on how to emphasise benefits to the researcher included:

Publications lists for web pages seen as an important incentive.
Statistics to demonstrate increased exposure.
Advocacy appears to work. Increased deposit rates amongst those that had had the benefits explained.

Amongst those institutions that have a mandate there was agreement that it could not been seen as the final solution. One such institution reported a 25% compliance rate. Nevertheless there was agreement that mandates were generally a good thing. It raised the profile of the IR and generated debate on the subject. Mandates are considered another useful tool in the battle for deposit.

Round Table C: "Where will repositories be in 5 years time ?" (Ian Stuart, EDINA)

The key points in this session were that:

IRs will become part of a wider research management process.
Research pools are difficult to manage - cross institution, cross discipline.
Identity management and naming authorities are important factors in delivering trusted and consistent repositories.
The hope and/or expectation is that IRs provide the core management of data with agile services using the data in varied ways. There is a need to avoid IR software becoming monolithic.
The expectation is that there will ultimately be many copies of items available from different sources on the internet. Is this in itself a valid form of preservation?
Will current peer review practices change ? Could a more informal model work where works are 'reviewed' in the public domain (the internet)?
Who will manage the content? University departments? Do they have sufficient resources to do so?
Will IRs hold content or could they just be 'virtual' repositories pointing to the content held elsewhere? This brings us back to the role of IRs and preservation.

This was a lively and useful session but it was therefore only possible to capture an idea of the questions and ideas raised.

Repository Fringe 2009: Closing Plenary and Thanks

2009-07-31T15:45:00.010+01:00

Peter Burnhill is introducing our last session of the Fringe with thanks and acknowledgments of organizers and supporters. He has also paid tribute to Rachel Heery who was a friend and colleague of many at this event and who sadly passed away last week. A tribute to her from her colleagues can be found on the UKOLN website.

Finally Peter has handed over to our Closing Plenary by Clifford Lynch, Director of the Coalition for Networked Information (CNI).

Clifford arrived this morning and has been able to attend some very useful Round Table sessions and he's been reading up on the previous sessions so there will be some reflections on what's been said in the last few days in his talk.

Cliff opened by saying that, as some of us will know, he conducted a low profile reconnaissance of the scene here about 10 days ago: one of the things I learned was that edinburgh has been a hot bed of repository development since the 18th century. If you go to the records office you can get a copy of a document about the building of General Register House called "A proper repository"!

Today I want to talk about four major areas at length:

Repository services
Repository services in the broader ecosystem. I think we are also challenged to understand the scope and ambitions of what we call repositories
Repositories and the life-cycle of what goes in repositories. There are some interesting questions there that we're probably not thinking about quite enough.
And finally I wanted to talk about selling or building support for repositories. It's not really the right heading. Some of the things I want to talk about are support, some are about the purpose of repositories and there are some other thoughts in there.

I want to start by commending Sally and Ben for their talk which, based on the blog, touched on many of the issues I've been thinking about and talking about and seeing as key issues for repositories for some time. I'll have a few things to say which will lead on their talk but I think there is a great deal of abiding knowledge in that talk and it is a great picture of where we are now and where we might go.

Repository Services
One of the things I found so interesting about the discussions this morning has been the distinction that many people seem to be making between data repositories and what they call repositories of scholarly outputs, e-prints or similar terms. They are perceived, with some justice, to have some different characteristics and behaviors. You could spend a lot creating a much wider repository than you would ever need to for an e-print repository: people just can't write that fast. But when you get other types of data in there the storage is consumed rapidly and you hit real policy issues about curation and the rationing of storage.

I think in the States people often think that a repository can do everything. It's interesting to me to turn it around and talk about e-print repository services, data repository services: essentially thinking of suites of services rather than the underlying implementation, servers, storage etc. You get a very different view of needs when you see it that way. When you look at data we may see repositories that range from generic data to repositories that deal with specific genres of data in particular areas. We already see some of the latter in subject repositories. In molecular biology for instance most of the repositories just take very very specific types of objects in specific formats that are disciplinary in nature. Stressing the emphasis on services here is quite helpful.

Repositories in the Ecosystem
Let me say a bit about repository services as part of a bigger and much more complex ecosystem. There was an International Repositories Workshop that JISC and others sponsored in March in Amsterdam that got into this a little. Repositories connect to things that provide support for e-science, workflow management, high performance storage and computing, etc. What we have to be thoughtful and careful about is the scope of what goes on in a repository. I've seen people say that we should put repositories in the workflow for e-science. I've seen put them in the context of high volume, high reliability storage. I'm not sure we want our repositories to be that volatile. I'm sure that there is a space for lots of high quality storage facilities to support academia doing IT-enabled work. But is the space for repositories? Finding the boundaries will be very delicate and very messy.

Also as we put more kinds of things in repositories we have to look at how much of a use or computational environment that repository becomes.

For example what happens if you want to put video in a repository. If I have a big video file I can deposit it into a conventional repository and I can take it out of the repository and use whatever tools I have to view it. On the other hand I might deposit it into a smart video repository that will fix the format, calculate the bandwidth needed to view (and perhaps adjust the format of delivery according to users' bandwidth). That would be a complex repository tailored to the types of objects deposited.

But what about objects like videogames? Do we embed a videogame environment in the repository or do we leave it to the user. These issues have a lot of implications about the development and ongoing support costs of repositories. As we put in data or e-prints in to repositories we have to work out how much computational work you can do in the repository. Can you do complex or only basic computational work? Or do you take data out to do computational work on materials? These issues get at how you build interfaces, how semantic they are, how flexible are they.

Replication and the semantics of replication we are currently very sloppy about. We have rule-based processes about submissions for instance and the idea of propagating multiple repositories - this seems sensible and reasonable. We also know replicated geographically distributed storage is a good thing. We have a suspiciion that not going it alone on storage is a good thing - it's better to share and collaborate in preservation and curation. How replication for reliability and propogation to multiple collections will work still needs a lot of work, similarly provenence and version tracking. This is an area that requires some proper consideration. There is also an issue of storage and the cloud.

Further there is the issue of software. People seem comfortable with the concept that software has a place of some sort in repositories, but if we think of repositories in the wider information ecosystems, people have been thinking about replication and preservation already. Software designers have their own repositories of codes and versions and it's unclear to me how our repositories relate to these.

The last ecosystem point I want to make - and it was discussed at some length in a round table earlier - is how we connect repositories and especially institutional, but also disciplinary, repositories into the new name environment that is emerging. From publishers, web of science, journals etc. You have a number of identity management and federation changes that all result in IDs being created. All of these things are at play here. Essentially IRs set you up for the challenge of doing name authority for your local community, but in the context of things like identity management.

One of the fascinating byproduct things we found in a recent workshop was that institutions view your name as being your name according to humans and payroll services. That name may change occassionaly. They have not gone as far as saying that names may not be in roman characters anymore - that what appears in roman characters could just be an extra name version or transliteration of their name (the original may be chinese characters for instance). There is some capacity for name change aliases but it hasn't occured that people publish under versions of their name that may not be what is on their paycheck. People are not always consistent and there is little provision for "literary names" in university identity management. We'll regret that soon if not already. We need to think big about names. I am becoming more and more convinced that personal names are a really important part of scholarly infrastructure and we'll see some fascinating convergence of geneology databases and ID databases into review and scholarly literature. We have the strong compotent of faculty biographies that are linked to scholarly work already and all of that integrates into giant biographical and factual networks. I use factual to differentiate from mental state and influences. I mean the dull facts like jobs held, papers written, very factual sorts of data. I think there's a really big set of developments here that connect into repositories. It would be very helpful to sit back and think hard about this.

There's one other spanning service that we talked a little about in the grand challenge session. And theat's the issue of annotation and annotation of information resources across data and repositories. Herbert Van de Somple has got funding to work on this lately and Zotero is doing some interesting work here. There will be some interesting findings to tie into scholarly dialogue.

Repositories and the life-cycle of what goes into repositories
Repositories to date have been about getting stuff in, making it accessible. We are not far in enough for stewardship and preservation and management of what is in there to be our major concern but it is important that we don't forget about these. It will be important over time. So for example how much specific disciplinary knowledge do you need to classify a specific scholarly knowledge? The depositor knew and understood these things. If there is a serious issue the user can contact the depositor. Short term that's fine. Very long term the contributor may be dead. You may have to make assessments of whether it's worth keeping the data set in light of subsequent developments. You may have to look at formatting and presentation and linking of data depending on what else is published and occuring. The equation really changes when you look at the near term versus the long term.

I also think it's helpful to move away from the notion of preserving things forever. It seems more helpful to say "my organisation will take responsibility for preserving this for 20 years. After 15 years we will reassess if we keep it; whether the responsibility will transfer to another party; or whether perhaps we discard it at the end of the period". This is a much more structured kind of thing than just saying we'll keep it forever. These more realistic timescales are probably a very good way to structure arguments about preserving items in repositories right now.

Repositories in the bigger economic environment
I note a couple of things. Firstly I have a feeling that we have to get a lot more serious about the idea of "do what you can with bulk ingest now and deal with it later". There are a lot of materials coming at us now that we are not prepared for. We have to do our best to take them in and pass them on. In the current economic climate corporations of long historical status are evaporating and some sort of memory organisation needs to take on their corporate records and archives. At least in the states some government records - especially at local levels - are in peril and we have to think about sudden onslaughts of material not covered by "give us £50 million pounds for this work" and that instead need instant very low cost action.

I'm struck by the question of the extent to which we are trying to market services to a user community rather than promoting collections created by the services. I think we present both sides of that in a confusiong way. Are we doing something for our users as the service or is the value in the resultant collection? I think that ties closely to mandates for research data and open access - they favour the service side. But there is also the every popular critique that "I looked in your repository and there's really nothing interesting there".

There is a thought I'd like to share that at the rate that the economic structure of scholarly communications is changing, and given the kinds of economic pressures at some institutions, repositories at the instititional or disciplinary level, may take on far more importance far more quickly than we imagined: they may become the main access route to scholarly information. Given serials budgets and the pressures to channel money into data curation I can see a time when repositories are a primary - not a supplementary - access path to scholarly materials.

I will close with two fringey propositons.

The first I have been thinking about on and off for a couple of years. About how to make the case for repositories at an institutional level. It may be wrong to think of lots of transitional deposits, it may be that departments interact with repositories infrequently but on a big scale. Suppose we set up a programme for a distinguished academic so that we sent someone to collect scholarly materials and move copies of those into the repository in an orderly fashion creating a legacy of material from the subject honouring their work. I think that would be very attractive to scholars and could be seen as a real way to honour individuals and highlight institutional achievement. This may be a better way to gain support for a repository than chasing everyone for every paper.

Secondly IRs need to move from just being for academia and research centres. A public variation, perhaps local repositories or interest based repositories which may be separate to the academy but may connect in some ways, and we need to think about how we connect to those and build integrated access to information that has scholarly impact for the long term. As well as making scholarly materials more available to the wider public.

I throw these ideas out for the future.

Its very striking how much progress has been made with repository deployment. The software still has rough edges, deployment strategy has rough edges but I come away from discussions like this with the feeling that there is very substantial momentum. And we need to use that wisely especially when we think about where to put investment effort into repositories I think that trying to do everything will dissipate momentum but there is enough momentum that many things are achievable at this point.

And I think that's where I should finish.

Q & A

After an enthusiastic response from the audience Peter Burnhill reflects on Cliff's talk. This is the second repository fringe - they are supposed not to be too formal. Peter asks that when we all blog about this event we should think about what is or is not working at these events. The idea we have is that we try new things and see how things can change etc. So on that note, how do we start the questioning? There are so many threads here to explore - we'll be teasing some of these out at the pub later I think but lets get that started here...

Q (Sheila Cannell): On the economic point, at the moment repositories are very tied up with scholarly economics and the relationships with publishers. How could the economic crisis lead to a sudden change? I can't quite see how this would happen. My fear is that we continue with a number of models not knowing what we should be doing.

A (Cliff): I think it's unlikely to change for all disciplines at once or in the same way. I am struck by just how bad the economic situation is in the states. There have been huge cuts to the library budgets. I live in California and the library and University financial cuts are astounding at the moment. I can't imagine getting to a point in a few years where you might ask the high energy physics folks if they could pay for the physics archive or if they would rather keep the journal subscriptions. They'd obviously prefer both but pushed they'd probably go for the archive. Second tier journals will fall off subscriptions lists but from time to time they publish important articles and these become invisible if not accessible some other way. I'd like to think we'll see an explosion of overlay journals cherry picking from the content in IRs but I'm pretty sceptical about that at this point. My personal, and rather unpopular, view is that we will back off a lot on peer review in a lot of disciplines. I think the biomedical sciences will hang onto it but I think we have been profligate in using unneccassary manpower and increasingly the systems just don't work properly. We're not being strategic enough about where we do peer review and where we don't. My guess is that you'll see more direct distribution of scholarship through repositories or reconstituted university press organisations rolled into libraries than you've seen in the past.

Q (Ian Stuart): You had a comment on the peer review process. Is it the concept or the method that you think causes the problem?

A: There is a bunch of problems: the volume of literature; the person hours in the process; the tendency to send stuff out for peer review even when it is oviously trash or not good candidate work; there is a culture of materials being submitted, rejected and submitted for the next tier journals so the same piece of work is reviewed multiple times. Can we afford these processes? If people did their fair share of peer review it would be something like 8 weeks per year out of someone's productivity.

Q (Les Carr): I was initially horrified by the idea that you think of the repository as a place to store work at the end of careers - a mausaleum concept crept into my head - but you could equally talk about the other end of careers and the proof of your value when trying to get tenure. Perhaps using repositories as the way to show the value of contribution is what I will take away to think aout.

A: I think your choice of the tenure review as a potential intervention point is a really interesting one. At a well run institution you marshall transactional work, you not only look at your publications but you also talk about what you see hapenning in the future and you could represent that in a repository in a very useful way. You could even move tenure cases into the repository too. It is a really interesting idea.

Comment (Peter Burnhill): Two things you spoke about, archival and appraisal responsibiity are very interesting. And also that it is not just our stuff - neccassarily the repository movement has been concerned with academia, with our stuff but we need to look out to the wider digital world and representation of that for future scholars.

Response (Cliff): I think there is an interesting example here. The Library of Congress put a bunch of pictures on Flickr. What they thought they were doing was seeing if tagging is a useful retrieval mechanism. What was interesting was that they invited people to comment (as well as tag) these photos. These were images of the history of American life - locomotives, airplanes etc. It turns out that there are people around who will tell you the entire history of a locomotive, and the maintenance manual, and the book they wrote on it, etc. just from seeing an imge. When you put up these pictures all that stuff comes into play as surrounding documentation. We don't have a good place to house that right now. This is stuff of schizophrenic scholarly use. Scholars aren't much interested until it is useful for their research some how. I think there's some really interesting stuff to look at here.

And with that further food for thought Peter closed the questioning there by thanking Cliff for an excellent summary of his excellent keynote.

And finally Peter handed over to Ian Stuart for the results of the Grand Challenge which took place earlier today.
Ian Stuart (EDINA), Balviar Notay (JISC) and Ben O'Steen (Oxford University Library Services) were the judging panel. There were 4 very different submissions.

The question asked for enhancements for a repository for the interest of the researcher. After much discussion we really liked what Patrick McSweeny had done. Basically it was a mash-up of information from the e-print repository and his university website and managed to produce a single information page about a specific researcher which was automatically kept up to date.

Congratulations Patrick!

And with that the 2009 Repository Fringe draws to a close.

Repository Fringe 2009: Afternoon Tutorial (Open Data)

2009-07-31T13:29:00.004+01:00

Fresh from lunch, a brief chat with Clifford Lynch (he's been reading the blog en route to the event) and some overheard checking of cricket scores we're moving onto the afternoon Tutorial.

Jordan Hatcher (ipVP) & Jo Walsh (EDINA): Implementing Open Data

A chat on the work of the Open Knowledge Foundation, including CKAN and Knowledge Forge (led by Jo) and then an in-depth session on the legal side of open data, including the new legal tools available through OpenDataCommons.org, including a database specific copyleft license.

Jo opens with an overview of the Open Knowledge Foundation. Rufus Pollock started the Open Knowledge Foundation and it's entirely volunteer run. The idea is look at what has been successful in free and open source software communities and take the most successful parts of that and apply it to open data and open knowledge. It's not a campaigning organization but very much a building foundation. It's all about how things can be reused. Various projects exist, some are about publishing data, some are standards type efforts, others are exemplars and pilots. There are four principles that characterise and open approach to knowledge management

Incremental development - not one big master release.
Development is decentralized.
Development is highly collaborative.
Development is componentized - so data is in modules and packages that can easily be reused.

For example CKAN (Comprehensive Knowledge Archive Network) is very much about a component type model for maintaining knowledge. The volumes of data involved can be a problem though so we're now looking at something called the Open Grid to find ways to deal with major usage of shared open knowledge.

You hear a lot about "Open" but what does that actually mean? People call their data open or free but how do these principles apply to open data. Rather than starting from a licence one of the first moves was to build an open knowledge definition. So these specifically are less restrictive than some Creative Commons licenses.

Jo hands over to Jordan is going to talk about how these principles and licenses relate to other projects. Jordan is an IP lawyer as well as being a director of the Open Knowledge Foundation. A project that has been worked on with Edinburgh University is Open Data Commons. Jordan will talk about about various aspects of this work.

The origins of Open Data Commons came from Talis. They are part of the semantic web community and open innovation. They knew that open data was important for access to knowledge community (e.g. Open Government). In the sciences data sharing is particularly important as the cost and effort of data is huge (for example space exploration is far too costly to replicate). Talis knew that IP relates to databases so the importance of data + the legal restrictions on data = a need for a legal solution.

So, copyright is really 3 things. It covers a lot of stuff (films, buildings, programs, etc.) although it doesn't cover facts BUT it is an ownership restriction and licenses how you unpackage what you own and what restrictions you put on sharing work with others. So a database is a blend of schema, tables, data entry/output sheets and tables. And that sits in a database software. Some material will be copyright of the software, some will be content agnostic. Copyright covers data, the database and the database software - each in different ways. Challengingly the database rights apply to both the database AND the database content.

So by Open Data Jordan is talking about applying open data to databases and data in databases.
Open means use, reuse, etc. Sharealike is ok (as per GPL or CC licenses) but when other open licenses were created there wasn't a lot of licencing work around open data. You can kind of think about it like a market adoption curve. Linux and the concept of open source software is mature. CC (Creative Commons) is in the early majority stage. Open data is at the early adopters stage.

The reason data gets it's own category is that it is different from software and content both legally and in what you do with it. So, a word about Creative Commons licences:

Not all relavent rights are licenced (the database right in particular) in most CC licences.
CC licences are not consistent across database rights.
CC is not written for databases. For instance attribution can be different for databases.
ShareAlike is unclear when combining one work with another (collective work).
CC doesn't address software as a service.

Creative Commons don't really want you to use their licence for databases and have stated as much on the Science Commons website.

Open Data Content: Legal Tools
The Open Database licence (ODbl) intends to solve shortcomings of Creative Commons. Quite recently completed. The licence has gone through 4 comment rounds including various legal experts across the world.

You use the database to produce a subset of data - a produced work. The ODbl works on both the database and the produced work. So out of the gallaxy of IP rights it picks recognised licences in the agreement but in an open way. ODbl is a worldwide license (following the example of the GPL). It is a human readable licence - not a huge block of legalese.

Attribution wise ODbl requires copyright notices to stay with the database. Produced work requires a brief notice of the source and that it is licenced under the ODbl - this is effectively consumer protection. You will not discover an interesting produced work only to find that your access to view or work with the original data is blocked (or vice versa).

Share-Alike allows derivative databases but they are required to be under ODbl but you can licence a derivative differently to the original database.

Database Content Licence (DbCL) aims to licence the content of databases. It's at the content layer and covers individual copyright that might be present.

OpenStreetMap is considering moving from a CC licence to a OpenData type licence.

The Public Domain Dedication & Licence (PDDL) allows you to wholly hand over your copyright. Where this is not possible (some geographical territories) you instead license users to do what they like with your data. PDDL + Community Norms - these are complementary components to let you indicate expected behaviours for the database you have opened to the community. You can't sue someone for plaguerism. Your protection is the academic norms from social pressure and maybe an honour code. Jordan uses as an analogy The Bluebook. This is the standard system of citation - you can't get published in the US if your citation doesn't match up. Not a legal issue but a community norm issue.

Science Commons is based in Boston. They are the science end of Creative Commons. They were addressing science issues in about 2005 (but not very publically). They came up for a protocol for implementing open access data with maximum flexubility without licencing hang ups. So this standard was developed online. The protocol is not a licence but sets a standard for licences for science data.

Comparing where we're at (this is a Venn diagram but I haven't grabbed a picture I'm afraid): OpenData Commons has ODbL and PDDL. PDDL overlaps with Science Commons. Creative Commons sits within Science Commons.

Closing Thoughts
There has been a suggestion of a rift between Creative Commons and the Open Knowledge Foundation. Both organisations agree that public domain databases are a good idea and that science in particular in the public domain is super. There are some licence details we don't agree on but broadly we're aiming at the same issue.

A little lesson to close on: if you look at the lunar landing video that is currently preserved you find that the quality is terrible despite the fact that high quality images were recorded at the time. NASA had head budget issues regarding the cost of storage data and (accidentally) recorded over the original in the 1980's. My point is that you should not let IP get in the way of preservation - sharing can be much safer.

Q & A

Q (Peter Burnhill): Was "Copyleft" a joke when it was thought up, is the concept useful at this point?

A (Jordan): The GPL has some fairly formal structure. We have evolved to version 3. It has a formal structure that includes the legal requirement to share. There is a rich history of jokey attempts to subvert the notion of copyright but Copyleft is fairly well established and understood in the IP world now.

Q: What is the difference in those licenses between derived database and produced work?

A: Produced work might be the contents of the field and not the field names or structure. There is a little cross over as the definition of the database in the licence is fairly broad. So there is some blurriness about whether it's a produced work or a derived work. This has come up in the openstreetmap context. This can be quite a technical issue to look at. The licence acts as a constitution - a guiding document on how you should apply things. If a community isn't served by the licence they can also come up with their own guidelines that sites with it.

Q (Robin Rice): Could you talk a little more about the norms and particularly how that could work in a repository environment. When I think of my typical users - researchers - they may be happy with the public domain licence if they don't think about it too much. But if they felt someone could make commercial profit out of that data they would upsert and maybe Copyleft would be better.

A (Jordan): So GPL is not automatically not commercial. A non commercial clause can cause confusion about what counts as commercial. Who thinks that anything done by a charity is non commercial? It's not clear cut. Some view non profit or charity usage as non commercial BUT the law generally sees the type of use not the type of user as key. People have different views. People generally licence under CC licences and it is up to them (not CC) to enforce the licence they choose. Educational use is an interesting issue too. Is a university non commercial? Are CPD courses non commercial?

Comment (Peter Burnhill): In some ways the Tax Man decides what is and is not commercial. So the unievrsity is a charity, but their catering arm is a commercial company...

Response (Jordan): Really it is the person who shares their work that decides what is commercial and non commercial. This is a contract and a legal one so you need to know what all parties think the definition of the terms is when they make their agreement. But we (the Open Knowledge Foundation) are not going to do a non commercial licence because that's not our wish.

Follow Up (Robin Rice): So what does sharealike mean for content in these contexts?

A (Jordan): So I share a photo with a sharealike attritubtion open knowledge type licence. A blend of that photo with another would be a derivative work. Think of it in layers. Layer one is my copyright, layer two is your copyright. Sharealike makes you use the same sort of licence for what you do with my image. It creates a sense of comments. You have to give back to the community if you use the data. Now with the non commercial CC licence no-one can use an item derived from data for profit.

Q (Robin Rice): How do you communicate norms to your users?

A (Jordan): There is a document which provides an example of how you might want to lay out your expectations how people behave with your database. You can be at the high end (like making a website work in such a way that users have to read terms and conditions) or you can be more casual about what you want. There are ways to make sure you are cited properly for instance - you can have rules or community pressure for that. If people publish on the web it is (semi) possible to detect a use that isn't fairly citing or attributing the originator. BT had a project with open source code in it and they didn't reveal it and this was revealed through naming and shaming to the community online. Social pressure is very useful even without any legal restrictions.

Q (Peter Burnhill): If you do want to put restrictions on use you are into Copyright. If you go towards Copyleft and Open - it's not that commercial use is improper but that it is not attributed. Attribution is a really big issue if you go properly open. Modifying and changing work is scary. Hopefully there is an indication of the original version - data acknowledgement forms really.

A (Jordan): I have had Flickr photos with CC licences used in commercial contexts. I'm not that bothered as it's not my primary profession. Now academics do have a gut reaction to people making money for their work but there is an advantage to stuff being open. It's really useful to stop data being stored in just one place. Why wait around for the government to release a reformatted website of data? Just put it out there so that it can go out there. There are distinct benefits for use and reuse. There are economic benefits about making material available in this way: 1 + 1 = 5 in this context sometimes.

Comment (Jo): One of the issues for academics is impact, profile and reuse. Turning scholarly citation into data sharing and collaboration. Real benefits there.

Comment (Peter Burnhill): Statements that travel (or should travel) with data. One is a data disclaimer - public data in the US often includes a document about how to cite this data and also information that responsibilities of data producer stops at the original data set. All you do is support what you put out there.

Comment (Jordan): The UK government does this too. it states that data is in the public domain but you can't misattribute this data to be official government data if it has been changed or analysed. Basically they disconnect their authority when you work with the data. Effectively it's a licence. Different licences can be a nightmare when you get past one dataset. Licenses may conflict and you might have a 100 datasets to deal with. With data in the public domain you know what you can do and how you can treat the data. the problem in the academic domain is that people can end up licenceing data they don't have the rights for.

Repository Fringe 2009: Tutorials

2009-07-31T10:04:00.004+01:00

This morning the Repository Fringe breaks into 3 groups: the DCC Network is holding an event, there are Round Table sessions and we'll be blogging from the Tutorials. If you want to see the comments from other sections you can please take a look at the CoverItLive stream on the Repository Fringe website.

Simon Bains is our chair today and he's introducing our tutorial from Talat and Stephanie from UKOLN.

Talat Chaudri (UKOLN) - Building application profiles in practice: agile development and usability testing (2 hours)

Talat is talking about a specific set of application profiles funded by the JISC, particularly Dublin Core profiles. These were funded but have not been taken up as expected so Talat's post has been created to find out why these profiles are not in more use. For instance:

Practical usability testing (JISC DCAPS: SWAP, GAP, IAP, TBMAP, LMAP, SDAP) - some of these are broad (and possibly unwieldy) and some are extremely specialized profiles.
There is a sense that JISC built these profiles and they should be in use. Talat feels what we don't want to do next here is to build untested, preconcieved ideas of user requirements - we should consult users instead!
Substantiate need for complex metadata structures/terms: convince developers.
All outcomes possible (minor alterations, radical reformulation, status quo). We may learn that what was done isn't perfect. They may be great and have certainly been created by specialists but it may be that changes are required to make the profiles more useful. The profiles may be good but we may need tools for implementation.

Method
We know that different institutions work differently and may implement these things differently so we need

Start where we can do most (and feedback is most immediately available). We are starting with SWAP as we have the most experience there but the other profiles are also based on SWAP so it's a good starting point. It's not that it's the most important profile.
Iterative development of methodology (different per resource/DCAP?). We may not get things right the first time. We'll do this like software design: deliver, test, take feedback, make tweaks etc. I don't think it's an approach that's really been taken with metadata before.
Work with what we have (DCAPs as they currently stand, not further hypothesis).
Test interfaces and paper prototyping. We'll be doing some of this today with the Card Sorting method.
Include everybody.

Outcomes

Clearer idea of what APs do for real users (repositories, VREs, VLEs etc). The idea is to take bits of metadata schema from different places and tailors them for a specific purpose.
Methodology for building APs that work (how to make them better in future). We build an application from scratch - what's the best way to do that again. And what is the process for assessing and improving an AP? We need to capture this too.
Toolkit approach to facilitating APs for which there is a demonstrated use. We want to test different parts separately and to see what people really want to do and how they really want to use the APs.
Show the community how to do it! This is part of our role. We can't just through APs out there we have to show people how they use these APs and relate it to their work.

What are we Testing and Why?

Metadata elements that are useful (what you need to say and what for). This is probably the most important issue.
Relationships between digital objects (what you need to find next). You want useful connections but you don't need absolutely every single connection. What do users need. This is a jump off point from Google really. Making a key connection to related items or citations etc. might be most useful. There are other connections you could make that may have no value for users.
Structure (what related things have in common e.g. article & teaching slides). What is the same? Effectively this is about entities. They are about sections of metadata. If things have a relationship what is it?
Don't build what you don't need. If users don't want or need it you shouldn't develop it.

Example: "Event Metadata"
What do you want and need from this data?
Event Type: Workshop

Workshop Location
Workshop Type
Workshop Date
Opinion of Workshop {boring; interesting} etc.

So... we have a practical task now.

Task: Free-listing, card sorting and user testing a data model
We split into groups and, as this was hands on, I'll just tell you the highlights of the task:

Each group selected a type of thing to build a data model for. Our group picked "Room Type" - basically a room as a resource that requires a data model.
Then we individually listed all the properties of that resource type for a few minutes.
Then we pooled all our ideas and wrote the agreed terms on post it notes - one for each quality.
Then we sorted those post its into groups/clusters. In our case this was tricky. Where did location fit? Where did price fit? What were our top level categories? We were thinking of the model as relating to a faceted browse of a website...
Then we created usage scenarios for our data model. This immediately raised a few qualities we hadn't thought of in our model.

Next the groups each switched a member so that an outsider to the data model could run through one of the scenarios and spot the problems. We immediately found that our outsider was looking at the discovery of a room for booking from completely different priorities/qualities. This meant our top level qualities probably needed a rethink. But of course this was only one user. We'd need to test the same model with lots of users to know what works and what doesn't. Additionally we were thinking of our model as relating to websites but was the model the website structure or the data structure? We weren't sure...

Final Comments from Talat

User level view of things should be simplified.
Developer will see structural complexity. You need to know why it's useful to use a data structure.
Good documentation makes happy developers. And happy users. It's important to be clear about what you can use this model for. Showing and telling the models really helps.
Test early, test often. Card if cheaper than code! People need to know what it's for! Things can't be perfect for everyone but you cannot foresee things in theory. And metadata needs to be a flexible living structure.

This was a really interesting hands on session and raised lots of contentious metadata issues. It also highlighted how useful Card Sorting could be for website and interface design - simple though a bunch of post its are they are super fast for paper profiling and seeing where different elements should and should not sit on a page.

After this session we have a little coffee break (and such edible perks as micro eclairs!) and then back for the next tutorial:

Richard Jones, Symplectic: “AtomPub, SWORD and how they fit the repository: going beyond deposit”

This will be about how to create tools that work over many repositories. The relationships between dynamic and static systems. A bit about SWORD, AtomPub and how they are used. Richard will also be talking about Repository Object Models and their equivelences across platforms as well as Repository workflows and their equivelences across platforms.

How do we provide deposit tools that are usable and simple for depositors. And, from the institutional perspective, what the bigger picture is and how data is deposited and reused.

In this tutorial we'll look at how the static meets and clashes with the dynamic. Scholarly publications are not completely static but repositories expect static content. There is a notion of the golden record of a piece of research. There is no sense of version management other than through metadata notes (nothing like Subversion which is often used for version control in developments).

Richard is showing us an Architectural overview of solution provided by Symplectic. It is a matter of combining lots of flexible repository tools and interfaces provided by Symplectic that sits on the repository that the university runs.

Now we're having a nice demo of Richard depositing materials as an academic (on the left hand screen) and viewing them in DSpace as a repository manager (on the right hand screen): Interestingly Richard's demo with a single user account for an institution rather than each academic needing an account for the repository is increasingly common. Each academic has a uun for the university system so the symplectic part of the system just slots into their whole academic login/homepage. There is a real time relationship between the DSpace repository and the Symplectic tool in the academic's homepage/webpage management space.

What is SWORD?
The key thing Richard would point out is that SWORD has an error handling specification. It's really for create only and package deposit. There is a requirement to retrieve a URI and that can be something that allows you to access things more flexibly (through AtomPub say).

"AtomPub is an application-level protocol for publishing and editing web resources"

It is a RESTful protocol - no need for a grand web service frameoworl and HTTP requests and
The spec for AtomPUb allows more flexible create,replace, modify and delete
All communications in Atom format - designed for web type content not binary type content.
HTTP POST in AtomPub is designed to allow an Atom feed document to be published ay a specific end point.

At Atom item for a deposit is administrative data and item level information alongside a reference to file content.

So Publications does a POST with SWORD Headers Pub ATOM feed file content to the repository. Then repository return an atom feed and Publications provides a feed.

It's interesting to get into how repositories describe their deposits. items (DSpace) are eprints (EPrints) are objects (Fedora) are a feed in Atom. All the API calls need to respect the different naming used here. No real equivalent of bundle or document for Fedora or Atom but they can be added to bitstream/entries. This may be a ubiquitous structure.

Repositories aren't just structured but also workflows (Richard is referring back to his "Machu Picchu" session yesterday). Terms for each part of the process are named differently. The work-flow is flexible until the archive stage when you are expected to reach a static stage.

So if you wanting a dynamic system sitting on a static system how do you get the feedback you need? There are different stages you can ask about changes - in this stage at the archive stage there is no ability to modify. You have to create a clone copy to work on. Richard will now demo this. For instance if a preprint was deposited and you want to update it with the print or post print version. You can revoke the right to publish on something already in the repository and that kicks it back to the workflow which mean you can then replace the copy, change it, etc. Academics can add and remove files in Publications as they like. New items are created in the repository. In the archive there is a versioning system that allows references forward and backward in the repository. So you replace and update clones, items never leave the repository. There are some wrinkles - especially to do with level of change. You can have minor changes that do not trigger a review process of the deposited item, or major changes can trigger a re-review of the item. This is a matter of configuration. However the jist is that you create linked duplicates - marked up in the metadata. You can choose to delete items or directly edit metadata but the automatic processes creates multiple clones. EPrints has a version control structure already and ideally that is what you would want in repositories in general.

Additional note: the academic workspace includes a publishers policy area but as many organizations want no barriers to deposit this is often ticked by default or set to be a question mark to indicate that items can be deposited OR that it is not known whether or not items can be deposited.

Deduplication we're allowing an item to have multiple parents so that here is automatic process for many depositors putting in the same item to the same repository.

Fire and forget deposit is not sufficient to interact with the full ingest life-cycle of an average repository system. SWORD and AtomPub offer most of what you need for a deposit life-cycle environment. ATOM can be used effectively to describe content going in and out of repositories. Repositories have analogous content models of and ingest processes. De-duplication of items is very much non-trivial.

Q & A

Q: How does this work for depositing for multiple repositories?

A: It does not support multiple simultaneous deposits yet. You can pick between different ones to deposit to though. We want everything else to work first. We hope to have it working for all three systems mentioned (DSpace, Fedora, EPrints) soon. After that we want to look to deposit in multiple places. We need to crack policy and process issues there though as different repositories have different agreements etc.

Q: The license in Publications, where is it pulled from? Are there different licenses for different collections?

A: It's not pulled exactly yet. We're moving to a choice of licenses or combination of licenses. Different licenses for different collections isn't possible yet. We provide a collection mapper to put content into a given collection at the moment. There is also a more complex tool to map collections to who may wish to deposit in them. It is very complex to set up and local specific so it's not part of the standard Symplectic offering.

Richard concluded by welcoming comments and questions over lunch. Off to grab some food for me too...

Repository Fringe 2009: Show and Tell

2009-07-30T13:59:00.005+01:00

After a very nice lunch the Repository Fringe split into two groups. The DataShare meeting took place in another room in the Informatics Forum (and I hope that a wee post summarizing that session will be possible) whilst I stuck with Ballroom A for the Show and Tell sessions.

Morag Watson, Digital Library Development Manager, Edinburgh University Library

Morag will be talking about Open Journal Software (OJS).

We bid for the Repository Enhancement Strange JISC strand as Corpus:

We partnered with st andrews.
The project was to do a review of software.
User surveys were also planned to be part of the work.
The project would look at OJS and commercial softwares like ScholarOne and Bepress.

We wanted to look at the role of libraries as a publisher. Edinburgh University library doesn't publish journals any more. JISC didn't fund the bid so this project has gone ahead under our own resources. We've only really looked at OJS. It is an open source system

OJS is a Public knowledge project.
It is widely used.
c. 2000 journals already registered but many have not published many issues.
Open access publishing platform.
Locally installed and locally controlled - this is a big benefit for us in terms of what you can do with the journal and how you control it.

The Journal of Legal Access is the first journal in 30 years from Harvard University (as publisher) and it is built on OJS.

Software is licenced under a GNU public licence.

There is lots of support and documentation
Very active support forums
Developer forum
Video tutorials
OJS in an hour (although it's 178 pages!)
Active and involved worldwide community.

So what does OJS do?

A journal management and publishing system - from author submission to editorial control and formating.
Policies, processes etc. can all be set by the journal editors.
Full indexing for searching inside the system and also indexed by Google.
Can add keywords and geotags.
Very low cost to access.
Lots of journals in Africa use OJS - it is a helpfully low bandwidth system.
You can have reader registration. And you set up journal subscription in the system.

You click through the screen and when you're finished you've got a journal. Very easy and very menu driven. And it's easy to drag and drop stuff into the system. You do not need to be technical to use it.

It has a very flexible design (via CSS) for single and multiple journals. You can have one design for all journals or different designs for each journal you publish. You can add audio and video etc. One of the journals we're working with wants to add these types of elements in fact.

We've chosen a flexible design approach but some publishers on OJS reuse templates. Common templates can be reused to save effort. But there is an issue with users and academics wanting/not wanting a common design. If you do very customised designs you may need multiple instances of OJS.

Edinburgh is running a one year pilot implementation (from March 2009). Currently 2 journals: Critical African Studies (launched in May 2009) and Concept (launching October 2009). Critical African Studies was previously in print but there had been no issues for a while so the online version revives it.

We have branded OJS as part of the Edinburgh University Library website. Editors pick logos, abstracts etc. that then show up on the page.

The journal itself fits with what had already been publicised. Mostly formatted through templates and CSS to format. We support the tools and training and set up but not the day to day running of the service.

Morag demos the system as a user who has a Journal Manager role. There are lots of roles that can be set up ut you can turn off a number of these roles to make things simpler. You can have it set up for one person across the whole journal. The system emails users whenever items for review come in. All items are held in the system so if something changes or someone leaves all that data is safely stored. You can also set up future issues in advance quite easily. A lot of nice management features.

Pretty straightforward to use. There is also the possibility of RSS as part of the system. Also you don't have to stick to strict publishing schedules. RSS means people don't miss the next article even if publication schedules are changables.

Implementation Issues

Software setup - we are trying out one installation for all the journals.
Workflow for journals are understood by academics but managing that from an electronic system is different and we now have a nice template for this.
Interface configuration.
Who owns the data? The journal? The academic? The library? The institution? It's a tricky balancing issue.

What next?

More journals. We haven't been publicising this but lots of enquiries have been coming in at this stage all the same.
Also looking at Institutional Journals (e.g. Current research in Edinburgh via repository),
Integration with the Institutional Repository - some work in Australia looks at this and we'll be taking a look at this shortly.
Planning/costing for live services.
SDLC - other organisations in Scotland may want to offer this...

Q & A

Q (Richard Jones): On repository integration: can you only put the content in the repository or do you have to duplicate the data in both systems?

A (Morag): We need a way to publish data from those outside our institution as well and that wont go into our institutional repositories so right now it is a matter of us wanting to back up what we want.

Q (Jo Walsh): You mentioned costings for a live service - does OJS take payments?

A (Morag): OJS hasn't charged us for anything yet. You can subscribe but not pay for journals. And no one we've been working with wants payments. As far as I know OJS doesn't mandate that and we're not interested in doing that.

Q: What formats can you publish in? HTML only? Or PDFs, docs etc.?

A: HTML and pdf are supported. you can submit in any format but have to publish as either pdf or webpage.

Q (Les): Is this for lo-fi basic things or could there be some Nature competitors using OJS

A: Here we are interested in the lo-fi stuff right now. But there are some bigger journals in the US that probably use a lot more functionality. Most of our academics want to switch off a lot of the additional features. We've only worked with 3 departments so far.

Les asks the crowd a question: How many folk have their own publications in the university? Only two people: OUP (which wouldn't use OJS) and UKOLN (who publish an international journal with OJS).

Morag: How much of the review process you want is important to how you set this stuff up.

Les: We find dissemination is the thing many of our acadamics want but I think we should look at this stuff as a complementary service.

Hugh Glaser and Ian Millard: RKB, sameAs and dotAC

Hugh starts by telling us: "This is all very exciting to me. It's a community I don't usually engage with and it's all very exciting - there's so much data - I want to eat it all!"

Hugh's work wasn't originally funded by JISC but this work is now being supported by the Rapid Innovation fund.

Linked Data - Tim Berners Lee says it's the Semantic Web done right and the Web done right. This is linking open data and the idea is that you can name things. Once that's done you can use and work with them regardless of what those things are.

Use URIs as names for things
HTTP URIs so that people can look up those names
When you look up the URI, provide useful information
include links to other URIs for further discovery.

So RDF:

Hugh shows us LOD datasets on the Web mapped in March 2009 - this is a great diagram to illustrate 4.5 billion triples and 180 million data links and it includes DBPedia (the linked data version of Wikipedia) and a big cluster of material on publications from work that Southampton has been grabbing from metadata for various projects etc.

Hugh is showing an amazingly complex data architecture image of Southampton's system which links faceted browsing of multiple knowledge bases, ontology mapping etc. Sometimes the system goes out to find connecting data.

Hugh is also showing a serious of source links and knowledge sources - it's a wide range and he warns us: "don't just look for what you expect to find". To follow up we have a hugely complex web of materials.

The interface to the system is available to view on the web at RFBexplorer.com - connections appear in this system from the semantics of the linked data. So you can track the connections. The interface uses loads of background services to deliver the service. Some people publish usable stuff. Others might want to use our RESTful outputs. You can use these URIs as part of applications or webpages etc. to pull together lots of information.

A postgraduate in chemistry who had done Google Gadgets before made a few gadgets with the RFB service. We are also thinking about doing something similar with the iPhone so you can, for instance, search around and get a sense of who you meet when you are at conferences.

The system ties different URIs together. The geonames issue mentioned this morning ties in here.

Concluding remarks

We gather data from all the e-prints today and from other systems tomorrow
You should worry about your identifiers - can you reliably match your publications to a consistent author id?

dotAC.info takes the previous work forward with a very UK focus. We'll do the geographic side of it, better co-reference, connect to live project databases etc. It will change the face of research in the UK and how it's developing.

Q & A
Q: Is there a REST interface /API and what's the URL for it?

A: The demos page is the best place to go but I'm currently working on better documentation and will probably put that on SourceForge. It's mostly quite simple but then you might want to get it back in different formats. We want to link in the linked data world but we want to be able to pull stuff out of that world too.

Q (Robin Taylor, UoE): What do you mean by ontologies?

A: I mean structured descriptive data standards like Dublin Core. We can cope with multiple ontologies but using a standard one can be a huge benefit to getting information out of systems and making the best use of linked data. It's nice if it's more powerful than Dublin Core but standard ontologies are the key thing.

Hugh: We are very very keen to get input on the the dotAC project. Please drop us an email.
Q (Les): What benefit can we get back from your project?

A: One benefit is quite direct: we will be able to link data in your repository to project data and other data on the web around the world. An interesting thing is: what's the vision of the interesting stories or uses of this data you could make? It's not about the journal articles - there are lots of other possibilities and emergent properties. This is a new world and you need ideas and visualizations to see where it's going next. This is why we're pushing the service side of things:

Q: Where you have countries harvesting repositories into one place: if there isn't a consistent naming strategy you have a problem of duplication. Your service could help with this I think.

A: Yup we can link things and aggregate them. We aren't talking about this as a naming authority though. You can have your URIs and manage those. Other people do their own thing. The service then decides whether to believe your relationship to those URIs or someone elses. This adds some reliability to search.

Gavin

Henrick, Enovation: Demonstration of some UI customization for a Repository, including implementation of ajax keyword taxonomy.

Enovation is a company that supports multiple types of repositories for various organizations so we do things like developing the UI.

So for Trinity COllege Dublin for instance Enovation took DSpace and added:

Two-way integration to student research system.
Custom e-theses workflow from submission to printing.
Hiding empty collections.
Enhanced advanced search.
Integration extended of the TAPIR code to cater for TCD licensing and restrictions.

SDCC (South Dublin County Council) Learning Circle:

Site was a CMS with document management plug-in.
Wanted a more strictly controlled repository with set meta data requirements and simple processes.
Specific design and user interface changes.
Streamlined submission process.
Much improved browsing experience.

Some of the solutions delivered include:

Make adding items (deposit) simpler with a tree based category selection, a simpler submit page encompassing all aspects on one page.
Standardize the author names - implement a find people interface connected to central user system.
Standardize keywords - implement a taxonomy driven lookup with pre-checks keywords.
Make adding news/event information easier - embedding an HTML editor for news areas and we added news and events blogs to the site.

So we're now seeing the simple deposit page. Gavin indicates that they added an AJAX lookup for keywords and this is something others are asking us for.

There is a slightly different dynamic to this use of a repository but DSpace is being used effectively for what they want. It's not a full CMS but the WYSIWYG news editor makes a big difference to usability and flexibility. None of the IR managers we've dealt with can write HTML but the interface forces you to do that. So the standard WYSIWYG approach we've added here works far better.

SDCC came with exact specifications so we tailored DSpace to meet that. The homepage is lively and bright to look at and looks pretty unlike DSpace - and required quite a lot of development. This site isn't available publicly but it is a really useful system for the 50,000 staff to use internally. Giving numbers of items in each section makes navigation quicker and easier.

Various projects Enovation are currently working on (but can't unfortunately be named at this stage) include:

A Digital Learning Object Repository which will include:

Recommend a resource (positive system only - things will be more recommended than others).
Add a review/comment.
Subject based communities of practice (using Mehara).

Fedora Preservation Layer

Use of Escidoc front end.
Customized services.

We looked for interfaces and found Escidoc from Germany which is really good.

Social Studies "Dark Archive"

Preserve original image formats internally.
Generate and publish web formats.
Export of system to open access repository.
Link to map server for rendering.

This is an interesting example. There is a very serious archive that is internal but also a public facing version that people can deposit into with a one way connection between the two.

Open Access

Harvester for the National funded project.
Linking seven universities in Ireland.

And that's all I wanted to say. We work on making things easier so that people actually feel able and comfortable with depositing materials.

Q & A

Q: What is the take home message here? Do all these repository systems have problems that repository developers should look at?

A: Well, as was said this morning, when IRs were thought of they were thought of by developers for admins and managers NOT for researchers and you really need them to suit researchers. This is a marketing tool in many ways so it has to be good and simple and fit with the wider web interfaces much better.

Q: One question from 2 angles. With my DSpace hat on: is this an XML UI or a customized UI? And are you going to get code back into DSPace?

A: Yes, we are in talks to do this. This is a customized UI. Some parts can be easily added. Things like the people finder would be an obvious thing to add to more repositories and would be easy to break out.

Q (Les): You make things better for researchers but the researchers don't commission this sort of work - it's the managers or the librarians...

A: Researchers can be core to the UI needs: for Trinity College Dublin the visionary management let it be user led. The Learning Circle site was also user driven.

Fred Howell, TextSensor:

Using personal publications lists for batch repository deposit using SWORDs (and some more peaceful approaches) -

the EM Loader Project

The idea was a project to connect publicationslist.org to the Depot, a nationwide repository being run by EDINA.
Researchers care about their own web page but they don't want to fill in metadata forms.
So we started from the point of seeing how we could make a personal home page really easy. And once that's been done maybe we can connect that to the repository.

Publicationslist.org allows you to maintain personal publications lists of material in any format. There are about 11,000 researchers worldwide. You can then embed the data into your own webpage from publicationslist.org.

For this project we mixed that nice solution to deposit those items identified to repositories. We added a deporist button to the interface and a page element to let you know the status of items submitted to repositories.

There is an automated process that detects your publications and whether or not they have been deposited, lets you do simple SWORD deposit and pressing the button deposits the item.

Initially we thought that SWORD would solve all problems. But as Richard mentioned earlier it is tough to change anything once deposited with SWORD so we had to do some additional API work. So the status of the items is checked and, based on timestamps, will redeposit an item as needed.

APIS required:

Initial single item deposit - SWORD (with json-bibtex metadata).
Send updates/new version of metadata.
Check status of deposited items.
Search for authors in repositories.
Make the process simple and automated.

You can look at the live demo on the test version of publications list.org.

One of the things we have built in is a PubMed search that lets you find your papers. Once you have a publications list you click deposit - and this creates an instant account on the depot. YOu then click to send the articles to the repository. We haven't typed anything in but you have deposited your papers etc. So part one of the process is great. But then you've reached the limits of SWORD. Would be great to build updating etc. into future iterations of SWORD.

So we felt that academics and researchers don't stay in one institution so what happens when they move to new academic institutions. So we move to a model of "here are my publications with an Atom feed (or similar) and this feeds all the relavant repositries.

PULL is better than PUSH because it:

Removes the need for researchers to do anything extra to deposit.
Is much simpler for publications list providers to support.
And it's so much easier to hit multiple repositories at once.

So from SWORD to PLOUGHshares. Researchers pick their website and lets their repository do the key work.

Q & A

Q(Jim downing Uni of Cambridge): Have you had people move institutions yet? I can see far more issues being raised than just where the files are.

A: we have people who have moved institutions on publications list but this EMLoader service has not been tried as a live service yet.

Les: It is really exciting to see these problems being looked at and tackled!

Comment (Peter Burnhill): I wanted to point out that Fred and I met on the internet. A Blogger in Ohio reviewed publicationslist.org and the depot and suggested they should be connected and that's what actually kicked off this whole process. By weird coincidence both Publicationslist and EDINA were in the same city anyway. We were mated on the internet by a blogger in Ohio!

Q: If something changes something at the web end it changes in the repository but does it work in the other direction?

A: No, we have only done unidirectional but it would be good to do the two way thing.

Daniel Hook, Symplectic

: “AtomPub, SWORD and how they fit the repository: going beyond deposit”

One of our main motivations is that we wanted a way for academics to publish homepages with minimal effort. There is real variance in enthusiasm for this but increasingly there is a need for accountability.

So what we have done is automatic aggregation of data from key data sources to automatically generate lists of publications. There is a balance to be found between all data sources and useful ones. We also let people search Google Books to find publications. The idea is to minimise manual data input for these webpages.

Ben and Sally talked this morning about disproportional feedback and this is a great thing to have for academics. We need to make pages so much more useful than what they are used to. We need citation and usage statistics and we're working with Thomson and Scopus to get at this sort of data. There is also the matter of research strategy. There is so much bibliographic data that can be used to show the bibliographic strengths of the institution automatically.

So this is a demo showing lots of different items - books, chapters, conferences, journals etc. - which can be approved or not depending on whether that aggregation has found your work. The idea is any sort of RAE/REF item can be grabbed.

So the system searches for various terms - you can set up default search terms, check settings etc. so that your persona can be used to find your work essentially (although in this demo it is different settings for each repository/data source). Additionally categories are key in archives so you can restrict your search to expected categories. We talked about URIs and IDs so you can use your archive identifyer as one of your settings.

The search looks for publications regularly and alert you if/when items are found so that you can approve (or not) of them. Approved items may include multiple copies - perhaps a manual source and an automatic Web of Science data. You can add a manual link to override any data errors.

You can also see if other authors have a approved an article. And this means we have a co-author list for the service and can see who has worked with who.

The pluses of this system are:

Easy reference list of your work when it comes to filling in grant applications etc.
It does feed a repository but searching is quite different as you can have a wider range of material represented here.
You can find articles by text found in organisational papers so if you are looking for possible collaborators you have a good starting point to finding out more and making contact. It's a useful tool to browse the organisation as well as helping you find collaborators.

On the pilot system you can see various tabs including one for full text and publisher policy. USe the ROMEO ShERPA system to check copyright of a given article. and this tab is colour coded to indicate if there is an item and what the licence is. The licence is cached in the repositoy with the item. From an academic perspectife this is all about easy deposit.

There are a few wrinkles: you can deposit articles from other authors. Physics are a bad subject for repositories as researchers have often already deposited in archive.org.

Finally Daniel shows us a diagram of the organisation connecting collaborators together. There is much potential to see the organisational research strategy. Visualisations will be going in the next version of the system.

Q (Les Carr): Yes there is a lot of research intelligence bound up in repositories that we need to harvest and look at. How many of you are working on this sort of service - do you see yours connecting to this service or as stand alone services?

A (Daniel): Richard will talk more about Symplectic tomorrow at his tutorial. He will be talking about Sword and Atom and how those map onto Symplectic and other digital repositories connect to content. And how to deal with the layers of processes in front of repositories.

Daniel's presentation concluded the Show and Tell Sessions for the day so we ajourned to the Atrium for a drinks reception. The winning Pecha Kucha was also announced to be Julian Cheal of UKOLN and he gamely modeled his prize and glittery cowboy hat (the sticker you can just see shows his name and talk title - we had to drop our golden nuggets in our favoured hat!):

And with that it was off in our various directions for the evening...

Repository Fringe 2009 : Pecha Kucha

2009-07-30T11:41:00.007+01:00

What is Pecha Kucha? A formal presentation style, which gives rise to some very inspiring talks. In essence, you have 20 slides in your talk (not 19, not 21), and each slide is displayed for 20 seconds. The whole presentation therefore lasts for 6 minutes and 40 seconds.

This morning's sessions are in three bundles of three speakers each and we'll be voting them (using golden nuggets and cowboy hats to indicate our preference!).

Group A

James Toon, University of Edinburgh: ERIScotland

ERIS is: Enhancing Repository Infrastructure Scotland
IRIS Scotland was the predecessor to this project and it looked to see if central rather than local repositories would be more useful. The project was considered a success and we wanted to build on the momentum
Grant Funding Call 12/08 - Strand A5 - Repository Enhancement
IRIS took a top down view of the issue, ERIS is taking a bottom up view of the issue working with Research pools
Reasearch Pools have issues too and we need to get a better understanding of their needs.
The community developing repositories haven't engaged as much as they should have done with research users and that leads to problems
They are working and talking to researchers and repository managers together.
We need to find a unified way to curate and encode repositories on a national and global scale.
We also want to make recommendations and suggestions for training and policy around repositories
Not only the functionality but also the tech is being considered as part of this work.
We also want to deliver enhancements already proposed - these will be full implementations not just demonstrators.
BUT we are not developing just because we can. The development teams are working closely with the user engagement work
Most of the goals are based on work already started
We don't know what the research pools will actually want. We know local and aggregation factors will both be important.
We only have a vague idea of what is needed at this stage but we have to be more confident about this if we are making high level recommendations.
Planning, making a business case and policy are all crucial to having solid long term impact.
One big assessment of user needs and demands summarizes must of the work of ERIS.
We think this work is achievable but we have to realistic about long term preservation and access.
We need unified communities, be trusted and build a sustainable repository.

Les Carr, Southampton: "Repository Challenges"

When you set up a repository you must fit many many targets and multiple agendas.
There is value in the ability to share freely but there is a catch...
We have to adapt to the web as researchers. We are not used to doing our science in public.
Preservation and saving our material is a huge problem. From a data point of view it's an enormous bogeyman problem but we have to do what we can in the realistic way.
e-Learning - the ability for repositories to not only preserve articles but also data and materials for e-learning. There are all sorts of ways of researching that are expensive and very non textual and the only account is journal articles and notes.
Business is part of academia and funding and business cases are important.
Repositories must be efficient and effective. It's for the repository to provide these services.
We have set up repositories to be like a box of lego - so you can put data together as lots of modular components
We need to Pimp our research ride.
The cloud: there is so much power and capability both in the cloud and on our computers. Fitting into day to day working practices and activities is crucial.
A repository is like gears on a bike. It mediates between the components of the research world
We don't need to be chemically enhanced to do well but repositories DO need to be helped to be as useful as they promise to be.
No technology on it's own is the answer. The combination and the blend of people with technology that makes for so much more powerful a future.

Guy McGarva, EDINA: ShareGeo - Discovering and Sharing Geospatial Data

This is an overview of ShareGeo. A deposit tool that forms part of Digimap.
Digimap provides access to licensed geo data sources.
A lot of data exists out there and it's hard to find and use - especially if derived from licensed data.
The repository uses DSpace and allows stats and other functionality to track use and connect to services.
The data in ShareGeo can be either open access data or that derived or tied to licensed data.
Example data includes land use, grids, research generated metadata etc.
We are trying to formalize the process of sharing and reuse.
ShareGeo is based upon the work of the Grade project which found a need for this type of specialist sharing of licensed data.
We currently have a fairly high numbers of logins, users and downloads but little upload of data.
We use a map based search and spatial queries of data sets are enabled making the finding of data easier for users.
The footprints of datasets are shown on maps which is useful but quite powerful when combined with spatial queries.
ShareGeo take a single zipped file for deposit of material which means one or many files can make up a deposit. We have a size limit of 1Gb at the moment but this is just to make managing the service manageable.
The automatic ingest maps the data for ShareGeo and geospatial metadata is automatically added to the item in ShareGeo.
Licenses are part of the deposit process so that you always know what type of data and usage you are dealing with.
Issues regarding take up include the difficulties in the closed nature of the site. There are also commercial sites that also provide data sharing facilities.
Future improvements include looking at sourcing and adding more open data, creating a sister open access version of ShareGeo etc.
The main issue for us right now is how to get more data deposited and how to build up our community of users.

Q&A for Group A

Q (Balvier, JISC): How are you doing the aggregation for ERIS?

A (James Toon): The NLS leads the aggregation. Very standard aggregation in use right now...

Follow up (Balvier, JISC): Have you spoken to Paul Walk at UKOLN as they are working on something in this area.

A (James Toon): We've chatted. Right now normalizing data is really what we want to be able to do to provide a good API.

Q: Regarding ShareGeo: How hard was it to get PostgreSQL to do the Geo searching.

A: Not too bad but we do it in quite a basic way from lists. We're looking at a more geospatial extension that might allow a more sophisticated solution.

Follow up comment from the floor: You might look at LocalSOLR.

Group B

Richard Jones, Sympletic: Symplectic Repository Tools

Richard will be talking about some of the repository tools that his company Symplectic produce.
Richard works on repository integration tools to go with the repository systems they make
The aim is how do we provide a deposit tool to make things easy and efficient.
We're starting with an image of Researcher publication lists which form part of their repositories.
And a full text tab - you can see what file you have uploaded, permissions and the ROMEO publisher policies.
Publications pull in data from lots of sources. They connect the repository with SWORD and AtomPub as the method.
Why not just sword? Well it's only designed for creating not updating/removing/changing items.
AtomPub exists in a RESTful environment so extra functionality can be added in.
Some real complications though. Repositories are designed to be static but Symplectic is a more dynamic environment in a constant state of flux.
Repository workflows are a complication - there are three stages really: working copy; review; archiving.
So if you blend static and dynamic repository systems what do you get? A really complex slide - but Richard assures us that we can find out more in the tutorial tomorrow!
Benefits for the researcher - you can update and correct your data if/as needed which can also mean better metadata creation.
Where next? Lots of bells and whistles with repository tools linked to publication tools as well as helping to make standards grow.
Richard will be talking more about this tomorrow.

Julian Cheal, UKOLN: “Repository Deposit Using Adobe Air”

What is it that we're trying to capture?
Academics write things in their notebooks, it's not easily absorbed into repositories.
You could make researchers work on computers...
There's a quote from Bruce Chatwin: "Losing my passport was the least of my worries. Losing my notebook was a disaster" - data is important to researchers and they need to know it's safely archived.
But we need to make it more straightforward for depositing materials.
Adobe Air is a runtime environment and combines the world of the web with the desktop. It's cross platform. It's a rich internet application.
So who's made an AIR? Various twitter clients, The BBC and various advertising companies for a start.
Academics want stats and relationships to funders etc.
Julian has thus made a prototype that looks - deliberately - very much like Flickr uploader.
The finished product is able to drag and drop, easy to use, and pretty to look at.
It's a small application - Julian is showing us all the files involved and it's only a few.
Academics want easy repositories so drag and drop functionality on their desktop is perfect. It uses as SQLite database so can synchronize offline data as soon your machine is re-connected to the internet.
The application talks to SWORD, looks up ROMEO, the name project etc. to catch automatic metadata.
Screen shots indicate that you drag and drop, add metadata as you want. You can add lots or less metadata as appropriate. Auto-complete makes this easy.
JISC has offered to have a deposit event to combine all the deposit apps. This will take place in October.

Hannah Payne & Antony Corfield, The Welsh Repository Network: "The Welsh Repository Network: A tasty bit on the side!"

URNIP is the JISC repository enhancement project for the WRN.
Wales has a diverse HE landscape - very varied size institutions with vary different needs.
We have face to face and video conferences with institutions and we're doing site visits.
Each summer there is a library and IT development event and we are using this to communicate with our project partners
We share support calls and share work via Google code.
We are working with 4 partner institutions on deposit to see whether deposit increases with changes to the deposit process. But we are only at the pilot stage at the moment.
We will be reporting the best models, policies and possibilities as part of this project.
e-Theses and dissertations: the National Library of Wales already collects all paper copies but we want to see if they can be a hub for electronic deposit too.
The e-Thesis project will connect preservation and metadata functionality.
Auto-complete is a hit so we are looking at the work of a previous project Deposit Plait which looked at harvesting and checking data via web services.
Users wanted import and export of metadata to link to other educational databases and services.
Embedded players and multimedia deposit were the highest user priority. Holograms, art works and film are all key research outputs for various institutions in Wales.
We are not a standalone project but fit into the wider repository landscape. We want a cross-project forum so we can establish a set of services and support across repositories in the UK.
Diolch yn fawr am gwrando! (Thank you for listening).

Q & A for Group B

Q (Les Carr): The scottish ERIS project is from shared research, is the welsh one more based on library collaboration?

A (Hannah): Yes. ERIS is very research focused. We are perhaps a step back looking at collaboration and development at this point.

Q (Hugh Glaser): For the adobe air application, where do you get data for auto-complete functionality?

A (Julian): I use SWORD APIs where possible but some data you have to grab and work around as not everyone has a suitable API.

Q: You don't send data to the national centre for text mining for instance to find keywords?

A (Julian): If they have an API that can be used then I'd be very happy to tie that in to the tool.

Group C

Joyce Lewis, Southampton: "Marketing and Repositories - Tell me a Story"

Joyce is talking about the importance of stories and how repositories can help us to tell stories.
"People don't care about cold facts. They care about pictures and stories" - Nancye Green.
Back when Joyce started at the University they did news releases and the broadcast media weren't really targeted and only a few releases were picked up by the print media. Once published the story was also lost.
The university environment has changed now. Lots of universities, lots of research and lots of enthusiasm to show.
Quote being shown here is along the lines of the fact that universities do a poor job of telling investors what they get for their money.
At the moment Joyce tells the stories about the university through text with links and a picture but there is SO much more on the web that could be used. We don't want people to get tangled up in lots of unlinked resources on the web though.
Impact is key to the RAE and REF and this has to be thought about when we think about what we record and promote and how.
Project called Tell Tale is about telling the story of research. It's not funded yet but fingers crossed...
The project would catalogue adaptations to a repository necessary to capture the research story.
It would involve enhancing the content by putting it together into a story.
We want to create narratives automatically with story templates and narrative generation software that links around to items.
Then what is left is to demonstrate success through these stories and the usage of the content they talk about.
The bottom line is that we want to tell a better story.

William Nixon & Gordan Allan, Glasgow: "Enrich - Research System and Repository Integration"

William and Gordan are talking about the JISC funded project Enrich.
The project aims to bring disconnected research elements together.
Research systems are miles from repository systems...
What is a research project? It's an idea. It may or may not have funding or licenses or artifacts associated with it.
There is a sense of research alchemy. And some research MUST publish, others may not have to.
The research lifecycle includes a short burst of publishing but a lot of unpublished work.
University of Glasgow's Research System which tracks funding and licensing and we've started connecting that to the repository.
Repositories need to relate better with the research systems. Records can, when set up this way, now be pushed out via RSS and Twitter for instance. Enlighten is a service whose use has been growing more and more. They are at about 40% full text right now but a requirement to deposit material at the university should get the repository nearer the 100%.
In the old day repositories and research systems were separate silos.
Junction boxes are the future - we're about services. Turning the repository as a junction box to other resources. For example data from repositories is used to generate publications list on staff pages at the university. To add your publications you have to deposit them.
Most searches are not native - people come in via Google and other search engines.
We have freely available global open access.
But key to success are good relationships, easy clear systems and processes and the university policies really help us to be successful.
Enrich will bring together lots more data to tie into the hybrid repository/research Enlighten service.

Jo Walsh, EDINA: "Geoparsing text"

Jo wants to introduce herself to this community as she is just starting to move into a new role with EDINA and to engage with the repositories community.
Geoparser - developed in this very building - which is based on a grammar based named entity recognition technique that allows geo tags to be added to text automatically.
The recognition links to the Gazetteer service. They work well together and the more places you find, the more accurate the look up will be. Text context creates an idea of geo context.
GeoCrossWalk has been around for a while and it has a service status now. It uses Ordnance Survey to identify places. It's an enormous and useful service but it has been limited to Digimap users and licensed users. This confuses users.
This year we will also be expanding the service with an open access Gazetteer using geonames.org and the same type of system as the licensed version. The results will be variable BUT geonames has a wiki style ability to edit so errors can be identified and fixed.
Geoparser webservice will be a simple RESTful API for document placename extraction and markup. You can use OS data OR the OpenDate Gazetteer.
Jo is looking for a sense of user requirements and how this tool fits in specifically with repository needs.
Linking items across the repository seems to be one useful case.
You might use techniques to bootstrap geographic metadata for archives of textual components.
Spatially searching archives and nearby related material would be another use case.
Please contact Jo with comments, feedback and use cases.

Q & A for Group C

Q: What kind of licence do you pick for the open access geo stuff?

A (Jo): Actually it's from other sources and inherited.

Q: How do you feel about where institutional repositories are going?

A (William and Gordan): We feel more like an 8 year overnight success right about now. We've visited every department in the university. Everyone asks How not Why deposit these days. They used to ask why they should. We've seeped into the research process. We've been very supported by our Vice Principal for research. We're really started to realise the potential of all the data we have been gatehring. And we are in a post RAE, pre REF place so we're looking at how to repurpose the repository to suit that change best.

Repository Fringe 2009: Welcome and Opening Keynote

2009-07-30T09:40:00.006+01:00

Today and tomorrow this blog will be covering 2009: Beyond the Repository Fringe and we're just about to get started with an introduction from Simon Bains, Head of Digital Library at Edinburgh University.

Simon is introducing us to our lovely venue for the day - The Informatics Forum - and the fact that we are due to have a fire alarm this morning so there may be a very short gap in blogging. Simon is also introducing our official Welcome from Sheila Cannell, Head of Edinburgh University Library Services.

Sheila is warmly welcoming us with a magnificent image of the Udderbelly in Bristo Square which is one of the most visible temporary Edinburgh Festival Fringe highlights in the area this month. Sheila urges us to think about giant purple cows as a way to think outside the box and lays out some goals for the next few days from her role as a manager with a huge interest in repositories for increasing open access.

There are huge changes around us and the current financial climate will act as a catalyst for changes to the methods of scholarly communication. We've been talking about these ideas for years but the edge of the financial crisis may be the trigger for real change.

We have traditional scholarly communications in journals and we have open access. will finances change that balance. Have we normalized the processes of repos in our day to day practice. How is the funding and staffing set up - are resources permanent or short term. Do we need to normalize their place in the institution?

There is a three fold role for repositories:

curation
promotion
marketing

But does that lead to difficulties if we don't have a clear message about what our repositories are about.

You will talk about multiple repositories and depositing to multiple repositories but how do we deposit to them in a simple unified way, it's good for preservation to have many copies but how many times are we having to deposit material.

We need to think and understand whether what we do with repositories is important to the researcher and whether they still see the traditional scholarly communications as the main aim.

And disciplinary difference: biologists work and think differently to physicists or mathematicians or social sciences in terms of scholarly communication and in terms of scholarly practice.

All of the topics Sheila has talked about would warrant a paper and she hopes that some of them will be progressed through discussion and consideration in the course of the next few days.

Sheila closes by hoping that the Repository Fringe 2009 is a success and that everyone has a chance to see and enjoy Edinburgh as part of their visit.

Simon Bains returns to introduce our keynote speakers Ben O'Steen and Sally Rumsey.

Opening Keynote:
Ben O’Steen and Sally Rumsey (Oxford) – “A sneak preview at the A-list stars of future repositories: blockbuster technical developments and the cultural drivers behind them”

Sally opens by explaining that she and Ben will be handing back and forth with Sally looking at the more library view of repositories whilst Ben will be talking about the more technical whizzy end of affairs.

Sir Thomas Bodley set up the library in Oxford and Sally is taking us through the history of the library including a lovely quote from Francis Bacon that the Bodley "is an arc to save knowledge". We're are also looking at search, 1620 style: a paper list.

The original library building fast ran out of space and the Radcliffe Camera, the Radcliffe science library and the new Bodlien library were all built. By 1914 the library received a million items a Year. It continues to grow and grow and Sally shows us a preview of the storage facility in Swindon which will be helping the Bodley deal with the volume of material by 2010.

There is a usage agreement for the library - you must not kindle any fire or flame for instance - based on the traditional one that users must still sign today. And the sign on the library states that it is a "republic for lettered men". This is an interesting phrase for thinking about repositories. Are we building the digital equivalent of the Bodly? The growth curve for repositories so far is encouraging but we cannot grow such resources over night.

Realizations as a catalyst for change - the realization can be as important as change itself.
Repositories can be treated as a concept and Sally uses the term in its plural on purpose. The single repository as a thing is on it's way out. The repository as a box is no more. They may even be invisible as they are built into other services. One factor that's moving us away from the stand alone is integration with other technical systems as well as changes in soft academic systems. Repository staff can also act as catalysts for networking and collaborative working especially in the light of the REF (Reearch Excellence Framework).

As we move forward we are beginning to achieve Clifford Lynch's idea of repositories as a set of services.

Over to Ben who starts by saying that the internet is the most successful repository in the world. We are separating the service and the storage and that's what should be occuring. They should be distributed across a number of notes. There should be multiple ways to search and access content. Any service or storage can disappear or be added or upgraded without effecting the other systems unduly. If you lose your index of a repository you can be lost, you want this internet method of multple acces spoints fixing this. You want to make your repositories like the web.

"The future is here. it's just not evenly distributing yet"

- William Gibson, NPR talk of the nation 1999.

Ben is explaining that he knew the quote but it took a while to find the citation. He eventually found it on someone else website and connections and this is an example of how people look for information (find out more here: http://bit.ly/89AtD). The citation was found using Google to find where the quote was from. The web is the world. The web is usage. Following the trail of usage through searching and contacts and then through to a recording. But NPR have changed their website since the citation was originally found and they changed the metadata trail for this quote. The new URL has a single ID to that single recording now.

People search for things. It's incidental that they can find the documents not the things. Search is about full text.

There are some issues. All things have names of some sort. But there are things we need to fix. Dates and events don't always work on the web. We can do that though, this is the challenge for repositories. We can provide documents that directly relate to a thing. we can provide URLs. this will help people find things in repositories better than they currently are ale. The key is knowing HOW the document relates to a thing - critique, reference etc.

A second realization - we've been giving ourselves names on the web for a while. Ben is showing his various web IDs (on Twitter, Facebook, etc). How about we do this more evenly so that there are pages for projects and researchers that link to existing names.

Power comes from the power of relating names. This is unbelievable powerful.

And there has been a social seachange in how people appear on the web. Rather than "do you have a profile on.." to "are you on.." - people are themselves online, not some random profile but all linked to the real person. And where are we going with this...

Linked data and http names means names for things and connections between things.

Library of Congress are publishing their authority lists as linked data in RDF. That's a great way of making LCSH more useful on the web (See: http://id.loc.gov). Yahoo and Google index RDF embedded in HTML pages (as RDFa) and that's hugely useful for linking and connecting and making search more useful and items more useful and visible. You need clear ideas about what you are doing as a repository.

Back to Sally: In some areas policies are in place and well developed and they should drive everything. OpenDOAR provides a tool to help with this, but the Preserv project at Southamption found a real gap here. DISC UK DataShare are also looking at issues of repositories management (have a special session this afternoon at RepoFringe2009).

There has been a huge expansion of items and types of items that you expect to see in repositories since 2000. Originally they were for refereed published literature. But now a much much wider range of materials are being handled (eg JORUM). Some of the most successful repositories have been been single subject repositories. Academics like them and use them and continue to deposit in them. We need to be able to deposit in multiple places at once and policy must drive that. We need that to get buy in from a lot of data creators

Back to Ben: We have reinvented too many wheels already. Don't fight it, work with it. Use the standards that people are using now. Defacto standards are important so don't feel tied to the ISO standards if they are not what is actually in use. Already out there are:

Transfer

Files - http
Lists - atom, rss

Create update etc

HTTP POST, etc.

Names

URIs - these are already used by people connecting to Wikipedia for instance.

Lookups

DNS resolvers

Using what is in use means instant communities. They have techniques and tried and tested software to access our materials already. You don't have to write things from scratch. You can experiment quickly and usefully. How do these fit into researchers workflow? If they don't you need to ditch it and move on. Tools and techniques may not be perfect but might do the job.

If you genuinely are doing something new you need a community to assist you, if no one else is interested than should you be doing it that way? There are some projects who go ahead on their own when good alternatives are already available and in use.

We don't have Defacto standards for:

real time event notifications through the browser.
Simultaneous collaborative document editing (Google Wave may have some relevance here).
Data qualified and ranked by evidence. Search engines do very poorly at this as: how can they know and use what YOU trust.

So, audience participation time: Ben asks us to name some repositories:

flickr
youtube
kfupm

And adds a long list including:

slideshare
facebook
google docs

People use all these things. They are useful and links between different versions of the same documents are useful for qualifying all connected items.

There are no common standard or APIs for these repositories but they all contain a set of things. And these are useful things.

If you want to get stuff from these repositories into yours you will not get a sip. You won't get a focused package of what you want. There are mechanisms being developed but nothing so far and many repositories don't have these abilities. What's now? What's current? Use that!

Realisation: object transfer is still in a divergent state. For the moment we just have to cope with lots of containers and folders. No negotiation for the format of a SIP: you deal with what you are given. And sometimes you have to harvest what you can (e.g. Pubmed - you have to grab and keep your copy).

But we can cope because there are is a Normal Archival Process already in place (it's for physical objects but the digital issues are similar):

Accept delivery of boxed of stuff and record roughly what was received. Things get permanent IDs now. That means you have an audit trail and provenance for all items.
Triage the contents within a stable environment: deal with fragile things first, things that will deteriorate; sort out issues that arise with rights holders, depositors - this is always a dialogue; some things may stay in the box for a LONG time and just be on a shelf waiting but as long as they have been triaged the box can sit until it needs handling
Identify actions that need to be taken to ensure future access.
Characterize and catalogue the contents using relevant tools. For instance Oxford have an ephemera collection that just doesn't work with MARC so a new schema was needed. This action will sometimes be called for.
Update archival records so that people can find the content (if they are allowed to).

So we already do this. This is our accession process.

The media may be different in the digital deposit/accession version of events but the process need not be and/or can evolve as necessary.

Not all storage is the same:

The absolute biggest benefit to any repository is to separate out the concerns of storage and services . It will make your life so much easier.
Oxford have a "bitbucket" - a huge safe storage machine where things can sit until they can be dealt with.

Hardware, software, people and storage will come and go. You content is constant. And we need to respect what scholars deposit because of that.

Back to Sally now for experience of scholars and repositories:

"When it's one click deposit I'll do it"

A diagram of what researchers should do in the deposit and publication process explains the confusion and barrier to deposit. So we need a way to make things clear and easy:

Deposit by stealth and through other easy solutions
Multiple repository deposit regime (MuRDer!)
Answer related problems that worry people such as the issue of multiple versions
Automation. automation, automation

Nature publishing have recently offered to deposit items into repositories and maybe even Institutional Repositories (IRs). Will other publishers follow suit?

Copyright: wouldn't it be nice if things were uniform between publishers? Some are becoming more open but it's a very long way from consistent.

A recent BL report highlighted that "restriction thretened to lock away digital content in a way we would never countenance for printed material. ". Researchers are used to a more open way of working with print and with other (non repository) items online.

Legal deposit as a parallal to repository mandates and their role for archiving and access. In 1610 Bodley did a deal with the Stationers and Newspaper makers company to be able to request a copy of anything published. If copies ran/sold out the Bodley could be used to find/replicate an item. Strong parallels with repositories and deposit in them. Perhaps we should be aiming for universal scope, independence and size (as mentioned in a 1910 Bodley document) with our repositories?

Preservation aims towards preserving access. Assured secure storage and permanent access needs to be well managed. And aided by intra-library agreements and funding moves.

Shared and distributed expertise. Example being mentioned here is the LC putting collections on Flickr - the metadata isn't going to be perfect but you get some metadata created quickly and some will be good.

Recent RIN report: Creating Catalogues: Bibliographic records in a networked world looked towards making material available and findable in repositories but will it really come true? It would be great if it did.

And back to Ben: we are looking at/for disproportionate feedback loop

The perception that a small effort leads to a very great benefit.
This leads to the idea that more little efforts have bigger results.

Ben is showing a duck hunt screen capture. High scores are technically trivial but psychologically important in gaming. Are usage stats for scholarly items any less useful in this way? Reusage stats (trackbacks, tweets and references) are incredibly important. Vanity stats can really drive deposit. Another screen capture of the new Ghostbusters game is on screen now - 6 buttons to hit for huge feedback. How many boxes and buttons do we have on deposit forms? We need much better feedback if we actually want people to deposit their work.

Back to Sally: peer review is super important. Knowing how lab books, data etc fits in is also important. A journal article is just a summary of research and things could change very differently if other outputs become available or take over.

There are new forms of dissemination and publishing - a semantically marked up article which has been commented and colour coded and linked back to other items is being shown. The meaning of the word is highlighted and links out, the article links to figures, data and other items so that they are all available as part of the article. It's not an article it's a huge resource.

Aren't more people going to want to do this? Once authors find out it's possible they will want to do it.

Open Access
We start with the example of Sally's ancestors. They marched on land but needed a permit to access this and though the landowner under used the space he blocked access so there was a mass trespass to prove the point. Some were imprisoned for their actions but the march resulted in a law change - the introduction of the Right to Roam (updated in 2000). What many people had failed to realize was that the land had been free to access before and should be again.

There is a parallel here to Open Access. The change in legislation that the trespassers got should be a positive indicator.

There is a perception that Free isn't good and we have to change that. So many complex open access options of authors to deal with. Unfortunately they will probably hover around for some time but hopefully things will get more manageable.

And finally a preview...

We think repositories are really moving. It's going to be long slow incremental change.

But we are still waiting on

Easy multiple deposit.
Collaboration between publishers and IRs etc.
Simplifying of everything.

Final word from Ben: Print on Demand is going to be big. People will take what's useful to them and mix it up a bit. What does a book mean when it's £2 and you create it in minutes? You can have a printing machine available to open up access and mixing options.

You can print off a set of articles into a book on a librries book printer
your colleagues comments tweets and reviews are interleaved with the test
Your colleagues were found from your Professional networks
YOU can do all this already!

You can create a bookmark list of plates from 18th century books online which you believe to be the work of one anonymous artist - this list is research in itself

Permanent books, temporary magazines? Is this true? How about facsimilies etc?

We've been talking about preservation and access so some demos to close the presentation:

Ben shows us a book printed on demand, another from a facsimile, and a traditionally published one. They all look the same.
We can't preserve access to all research. Research on a computer game has to be emulated. you can't preserve the actual researched activity any other way at the moment.
We don't have all the media we want yet. People continue to create new media - Ben shows a video (included on this blog somewhere: http://blog.karagos.com/) BUT you can pan around the video: this is a new form of video. This could be a way of broadcasting. This could be an archive of a choreographed piece. Research is not just text. It can be all sorts of formats.
Ben takes a picture on his phone. And he's got a £25 mobile printer that print out stickers. You can send them from your phone and have a printed image from a wallet sized printer in a few seconds.

Laptops aren't what you carry all the time. But your phone is and the mobile printer lets you grab a paper print of a map or similar on the move. There's lots of this stuff out there already. What people say they want is NOT what they actually want. Print on Demand is going to be good and useful for research. It's what people will be doing soon!

And on that Ben and Sally conclude their session. And Simon speaks for us all in saying that that was tremendously interesting!

Q & A

Q (Ian Stuart): People have identities. Personal and Professional identities are separate for many at the moment. Can people mix identities in this sort of linked landscape?

A (Ben): The power of linking is huge. If you link identities that is a double edged sword. It's useful but can get you in trouble too. Some use consistent nicknames online for their personal presences but some are just getting better at managing their online presence in general. Some universities are leading the way to get a professional presence online but we need to encourage that and link to materials and qualify work appropriately.

Q (Les Carr): We're almost 10 years since the first meeting of the Open Archiving Initiative. We are now even less able to define what a repository is. And yet our institutions are set up to deliver applications that are well defined and look like databases. How do we deliver services that are relevant and business critical but also open and flexible.

A (Ben): You don't have to have all the data about an item in the same place as the items, you just need names and connections. You can keep some domain separation but it can be political.

A (Sally): Sometimes you need to just do something, you can't get it perfect, you have to demonstrate what is possible. Demonstrating what is possible may be what's needed to move forwards.

Preview of Forthcoming Attractions: Repository Fringe 2009

2009-07-29T15:05:00.005+01:00

My name is Nicola Osborne and I am the Social Media Officer for Edina. For the next two days I will be using the DataShare blog to cover the Repository Fringe 2009 event which is being held at the Informatics Forum of the University of Edinburgh on Thursday 30th and Friday 31st July 2009.

I'll be sitting in on most of the sessions and posting up notes and summaries here as well as Tweeting the highlights of the Fringe. If you are interested in joining in from your own desk you will be able to see streaming video of some sessions, view the Tweet stream and take part by commenting, looking at images, taking part in polls, etc. via the Repository Fringe 2009 website and our CoverItLive stream. Or just keep an eye on this blog all day Thursday and Friday.

If you have any questions or comments please leave a comment below or email me (nicola.osborne@ed.ac.uk). If you are attending Repository Fringe then please say hello and let me know how I'm doing with the live blogging. If you are blogging, tweeting or uploading films or images please use one of the hashtags #RepoFringe09 or #RF09.

Pictures from IASSIST DataShare workshop

2009-07-23T17:21:00.009+01:00

http://www.fsd.uta.fi/iassist2009/workshops.html

My Faves for Wednesday, July 15, 2009

2009-07-16T01:38:00.001+01:00

Event: The Data Imperative: Libraries and Research Data | nostuff.org

Blog post from Chris Keene on the "The Data Imperative: Libraries and Research Data" event held at Oxford in June.

Interesting summaries and comments on speakers - Paul Jeffries, Luis Martinez, Sally Rumsey, Alma Swan, Simon Hodson, Martin Lewis.

[tags: research data, data management, Oxford, blogs, data librarians, data repositories]

See the rest of my Faves at Faves

My Faves for Tuesday, June 16, 2009

2009-06-17T01:38:00.001+01:00

Migration to Intermediate XML for Electronic Data.

MIXED is a project of DANS, Data Archiving and Networked Services. MIXED is to contribute to digital preservation, by dealing with the problem of file formats. Over time, file formats become obsolete. When that happens, the information in such file types is no longer accessible. MIXED follows the strategy of converting files to XML as soon as possible, preferably when data is ingested into the archive. MIXED also converts these XML files to formats of choice by the archive user.

[tags: Data Curation, Data Preservation]

See the rest of my Faves at Faves

DataShare final deliverables

2009-06-09T18:10:00.004+01:00

The official project period has ended, though follow-up activity continues. It may be worth noting some recent deliverables:

May, 2009: DISC-UK members along with Ann Green (Yale) and Gail Steinhart (Cornell) gave a half-day training workshop - Data Requirements and Digital Repositories - to twenty-one data professionals at the IASSIST/IFDO 2009 conference in Tampere, Finland, based on the recently published Guide (see below).

May, 2009: Policy-making for Research Data in Repositories: A Guide is available for download (Adobe PDF). The guide is intended to be used as a decision-making and planning tool for institutions with digital repositories in existence or in development that are considering adding research data to their digital collections.

May, 2009: DataShare Final Report now available in JISC Repository. Executive Summary also available separately [PDF].

May, 2009: Two papers were contributed to the most recent edition of IASSIST Quarterly: the project manager summarised the work of the DataShare Project; Luis Martinez-Uribe reported on the requirements gathering exercise on researchers' needs at Oxford.

April, 2009: The project manager gave an invited paper - Lessons Learned from the DISC-UK DataShare and Data Audit Framework Implementation Projects at the Digital Curation Practice, Promise and Prospects (DIGCCUR) conference in Chapel Hill, North Carolina, 1-3 April.

Many thanks to everyone who got in touch with us during the project. It's been real!

My Faves for Wednesday, April 01, 2009

2009-04-02T01:38:00.000+01:00

Curating Scientific Web Services and Workflow (EDUCAUSE Review) | EDUCAUSE

Interesting article about the curation of workflow and the embedded data pipelines - implications for intractable issues surrounding the intractable research data/instrumentation/algorithm conundrum

[tags: research data, curation]

See the rest of my Faves at Faves

My Faves for Tuesday, March 17, 2009

2009-03-18T00:38:00.000+00:00

UK Research Data Service (UKRDS) International Conference

A succinct summary of the outcome of the UKRDS (UK Research Data Service feasibility study) meeting on 26 February on Neil Beagrie's blog, with links to the executive summary report and presentations from the international set of speakers.

The full event was also blogged extensively by Chris Rusbridge at http://digitalcuration.blogspot.com/2009/02/ukrds-conference-1.html, (see also continuation posts 2, 3, and 4).

Andy Powell's blog, http://efoundations.typepad.com/efoundations/2009/03/a-national-research-data-service-for-the-uk.html, summarises his critical live twittering of the event and includes a number of comments by others.

[tags: website, report, service, data management]

See the rest of my Faves at Faves

My Faves for Tuesday, March 10, 2009

2009-03-11T00:38:00.001+00:00

The Guardian's Data Store

The Guardian have compiled a Data Store whose aim

“is to make important data more accessible to people.”

It consists of a set of links to data and statistics pertaining to a range of contemporary subjects including: Migration, Education, Health (UK and beyond), Military, Politics, Unemployment, Finance.

Within each data page there’s a link to a Google spreadsheet where you can see, download and manipulate the data.

The accompanying Datablog provides an avenue through which such statistics and data can be discussed.

[tags: data sharing]

See the rest of my Faves at Faves

Research data into Fedora at Oxford

2009-03-06T10:45:00.015+00:00

The JISC funded DISC-UK DataShare project in Oxford has brought together several units within the collegiate University: the Oxford University

Library Services, the Nuffield College Data Library, the Oxford University Computing Services and the Oxford e-Research Centre.

This post looks into some of the work carried out by my colleagues in the Library to explore ways to manage research data into Fedora. These efforts are recounted in the blog of Ben O'Steen, Oxford Research Archive Software Engineer.

Some months ago Ben already provided an exceptional account of the challenges encountered when ingesting a research dataset into FEDORA. He described how he dealt with the modelling and storing of a phonetics dataset given to him on a DVD-R, containing around 600 audio files organized in a hierarchical structure.

In a more recent post Ben talks again about storing, curating and presenting research data. This time he focuses on tabular data and highlights the importance of capturing the implicit information (columns data types, table interlinks), keeping the original dataset as well as maintaining a version of the data in a well-understood format with a description of the tables in a machine readable way.

This post also identifies a gap in institutional and departmental IT support for those researchers needing to store tables of data and suggests HBase as the type of basic service that could be provided to avoid the free-form tabular datasets as well as to educate researchers.

All this work has been taking place in parallel to the scoping study I have been conducting in the last 15 months to scope the requirements for services to manage and curate data. This project is, like DataShare, finishing at the end of March but there will certainly be more data management and curation related activities in the University of Oxford.

A Repository is not a Bookshelf!

2009-02-25T09:36:00.001+00:00

JISC Start Up & Enhancement Projects Training Event: Embedding Respositories, University of Lincoln, 10th February 2009

I attended this informative and stimulating event at Lincoln University on 10th February. The programme of presentations over the course of the day offered both practical strategies and food for thought as to how the embedding of repositories in our various institutions might be achieved. A brief account follows …….

Julian Beckton’s presentation of the Lincoln Repository of Learning Materials (LIROLEM), highlighted the importance of ease of use, specifically through appropriate key wording/tagging of records. He acknowledged the necessity of persuading academic colleagues of the benefits and value of repositories by means, for example, of departmental ‘champions’. Institutions also needed to ensure that they maintained a high profile for their repositories.

UKOLN’s Stephanie Taylor spoke about the need formally to establish repositories both within mainstream scholarly communication and institutional policies.

Sally Rumsey, Project Manager of the Oxford University Research Archive, also highlighted the importance of the visibility and accessibility of repositories, advocacy to ensure their use in the first place and good statistics gathering as to how they are being used thereafter.

Lucy Keating, E-repositories Project Officer at the University of Newcastle, led an enthusiastic and inspirational afternoon session. She advocated a single access point for all research-related information, such as the My Impact Research Information Service currently being developed at Newcastle. She also emphasised the importance of forming links with the Research Excellence Framework, highlighting the institutional value of repositories and persuading academics that their research outputs are of much greater use in a repository than on their PCs! We learned of a ‘carrot’ at one institution whereby the annual research report is generated by its repository; if one’s research is not in it, it is, quite simply, not reported!

SHERPA’s European Development Officer, Mary Robinson, looked at the IR on the international stage and we learned that there are currently 1330 repositories in 1013 countries, most of which are in Europe. She introduced us to the DRIVER Project which aims to facilitate and support worldwide repository development. While Mary echoed the earlier themes of strong advocacy and visibility, she also drew our attention to SHERPA’s guide on how not to do it!

The key messages I took away from this event and mulled over at the end of the day on the long journey North were the importance of embedding repositories within scholarly communication, the need to ensure institutional support in making them part of everyday academic practice, the requirement for strong advocacy in demonstrating their benefits and maintaining their visibility and, absolutely essential, making them easy to discover and use. In this last respect a strong image which I took from the day was that contained in Lucy’s statement that “a repository is not a bookshelf!”

Anne Donnelly
DataShare Project Officer

Data Walkabout 7: Melbourne

2009-02-19T13:51:00.014+00:00

My last Data Walkabout stop, Melbourne, coincided with both the Australian Open and a 44C/110F heat wave (but preceded the terrible bush fires in Victoria). Sam Searle, Data Management Coordinator for Monash University Library, was my highly organised host (pictured). She not only arranged a sell-out seminar for me at Monash, but also a lift to Clayton campus with Peter Mathews (Monash University Library Planning Executive) and another back with Gaby Bright (eResearch Communication, VERSI) in time for a full afternoon of meetings at the University of Melbourne. (Considering that train tracks were buckling from the heat, I was very grateful for the escorts!)

The seminar (slides & podcast) led to a lot of thoughtful questions: how to determine data quality and value, how far should institutional data policies go, would we be doing more data audits at Edinburgh, are there services for data documentation, what licensing should be used for data access, how much is data downloaded or re-used, and how could the 'new role' of data librarian (in reference to Alma Swan's report) work with liaison librarians to deliver data management services across the university?

Afterwards I was invited for a sandwich lunch (indoors, thank goodness!) with colleagues from the Library, the eResearch Centre at Monash and ANDS - Monash being the lead partner on the Australia National Data Service. While we lunched, Sam gave a presentation on her role and the Library's activities in data management. As a coordinator, she provides the Library's interface with other university services and contact librarians (akin to liaison librarians). Her work revolves around four themes, which are borrowed from ANDS: 1) Communications, advocacy and outreach, 2) Policy and planning, with oversight by the Research Data Management Subcomittee and Advisory Group, 3) Data management in practice: working with early adopters and the eResearch Centre, 4) Skills and expertise - for early career researchers and postgrads, but also for contact librarians, and 5) Leadership and Collaboration. She took inspiration from Martin Lewis' Library Data Pyramid (presented at the keynote at the 2008 DCC conference reproduced above), but what impresses me is that the library at Monash is active in all areas in the diagram.

Then I heard updates from around the table, first from Paul Bonnington, recently departed from the University of Auckland to lead Monash eResearch Centre. Then, Anthony Beitz, Technical Manager of the Centre, filled me in on a number of innovations: LaRDS is a Large Research Data Store - 1.3 petabytes - researchers can access it from a desktop via Novell or NFS (network file system). Applications for collaboration include Sakai and Confluence (enterprise wiki). The ARCHER set of eResearch tools are customised to the needs of crystallographers, but are designed to be generic for different points on the scientific workflow - such as data capture from scientific instruments, to managing and analysing data, and on to collaboration. These are open source and available to be adopted. Again, I heard the merits of Mediaflux, developed by a Melbourne-based company, as a digital asset management system to store & view still and video images, based on XML.

The Centre provides other solutions for data management including cloud computing. (In cloud computing, users pay to move data in or out of the cloud, but pay nothing to analyse it.) The Library's institutional repository could still provide the means of publishing data: for example the Fedora repository may hold a metadata record and a permanent identifier, linking to the data in the cloud (Amazon or an equivalent). This would help address issues such as university branding. A similar method is envisaged for linking to data in LaRDS.

Then David Groenewegen updated me on ANDS. These are early days but they are testing out their ideas in real situations - particularly through the crystallographers' TARDIS project. They are still building up a team - branching out from Monash University and Australia National University (ANU) to have staff in every Australian state. He explained the ORCA registry, middleware that generates web pages (for Google to index, say) about datasets, names, subject area, and institution - generated automatically with hyperlinks and permanent identifiers. I asked about the issues of a name authority: People Australia from Australia National Library assigns a unique ID to authors and individuals as subjects. Since some authors do not appear in monographs but only in serials, ANU has developed a workaround for identifying names of people - some pages still have to be added by hand.

A challenge ANDS faces currently is how to work within disciplines, as well as institutions. Collaborations take place globally, so where there are existing disciplinary-based data sharing mechanisms, ANDS intends to adapt to those interfaces. In working with institutions, the main challenge is building capacity. Universities have signed up to the Australian Code for the Responsible Conduct of Research, but there's not necessarily sufficient infrastructure in place. ANDS' sister project, ARCS, is one answer, and has funds to build a nationwide 'data fabric'. ANDS is considering providing a 'repository in a box' via SRB/IRODS, to institutions. Seeding the Commons continues to be their motto - now they just have to give it a go.

Later on at the University of Melbourne, Simon Porter, Information Manager (Research) from the eScholarship Research Centre demonstrated the Find An Expert system, which contains contact details, projects, and publications of all academic staff. He is working with the Library and the Research Office to streamline flow of research information into the repository, OPAC, and the web directory. Simon strongly believes staff should not have to enter information that already exists elsewhere. This ethos, combined with an opt-out policy, means the system is information-rich without the staff even ever seeing their own web pages. Simon has an engaging way of explaining his work, such as this paper for a forthcoming Australian Educause conference, A ’Facebook’ for Research.

Donna McRostie, Director, Information Management, invited me to a discussion with the Discipline Librarians group meeting in the late afternoon, and Jenny Ellis (Director, Scholarly Information) kindly escorted me across campus and out of the scorching heat to find the room. The VERSI team (Victorian eResearch Strategic Initiative), whose meeting kept getting pushed back later and later in the day, treated me to drinks after 5 instead, for further data discussion. I'm afraid I didn't take notes, but many thanks to Gaby, Simon, Ann Borda, A.B.M. Russel and Lyle Winton for an interesting and fun evening! Also to Ross Wilkinson, Executive Director of ANDS, for meeting me for a coffee the next - my last - day, and to Helen Hayes, Knowledge Transfer Director, for lunch. It was great to see Helen again, I had last known her as the Vice Principal of Knowledge Management and Librarian to the University of Edinburgh. It is, as they say, a small world after all.

JISC Developer Happiness Days

2009-02-16T11:45:00.007+00:00

I attended the JISC Developer Happiness Days in London this week, an event organised by JISC to bring together developers and users of education software to exchange ideas and learn some new technologies. The very first Lightning Talk I attended on Tuesday was Paper Prototyping, a user-centric based approach to graphical user interface design using sketches, post-its, paper and scissors. The approach is simple, common-sensical and appealed to me because of my personal belief that an ingredient of successful software projects is a high level of user-developer/designer interaction. It provoked a lot of discussion between developers I spoke to afterwards about whether it would be useful in their projects and it got me thinking about DataShare and other JISC funded projects I have been involved with as a developer in the past year.

The Edinburgh DataShare repository to date has struggled to attract users, a situation common for many institutional repositories it would seem. However, to my mind, with repositories the most important users are the repository manager and community levels administrators (interaction by ordinary end users, who submit items, is brief and probably not so important). Admins are not only site/community administrators but are usually heavily involved in the ingest process (depositing orphaned items for example), so their usage coverage tends to span the whole application. They are the users that will really suffer when the usability of a system is poor. For this reason they should be central to the design process.

At the event, I also heard the view that stakeholders, not end users, are the key to successful projects - keep the managers happy and everyone is happy. Certainly, the stakeholders should be involved in defining what the system should do, but if they won't be using the system their input on how the system is implemented is not so useful. On the Tuesday there were 'UberUser' sessions, where students, lecturers, researchers and administrators could talk about existing application problems (that developers could potentially work on for the Hackathon competition). Combined with paper proto-typing this seems a much more sensible approach.

On the other hand I heard the following view expressed at one of the lightning talks: "Don't ask users what they want, ask them what problem they would like solved." In other words leave the implementation to developer/designers and people who understand web interface usability methodologies. Furthermore, at the repository community meeting on Thursday the question was asked why current repositories are simply digital versions of a library (i.e not Web 2.0) : "We did what the librarians asked," was the response.

As I am not a usability expert, and don't have particularly strong opinions about how GUIs should look and behave, I would personally feel uncomfortable with this approach. In any case, if the librarians that attended the Repository Fringe in Edinburgh last year are typical it would seem that if librarians and repository developers got together for a few paper prototyping sessions now the repository world would look a lot different.

Data Walkabout 6: Brisbane, University of Queensland

2009-02-08T16:47:00.006+00:00

From the city centre, it is a pleasant and fast ferry ride up the Brisbane River to the University of Queensland. This next Data Walkabout stop gave me the chance to chat with the dynamic Belinda Weaver (at yet another outdoor campus cafe). Although she's on secondment and not currently working on the institutional repository, my impression is that she's accomplished so much already she could be allowed to take a break.

I inquired about the institutional survey which she initiated and Margaret Henty expanded to other universities, Investigating Data Management Practices in Australian Universities. The outcomes provide a baseline of evidence at each participating institution but, like so much else in
Australia, they don't stop there, but take action to foster change.

The status quo for data management amongst researchers was - perhaps depressingly - found to be much the same as that in the UK (through SToRE, DAF and other surveys): often a junior researcher is put in charge, there is no standard practice, there aren't rewards for doing things well, and few consequences for doing it poorly. Often the problem arises only after something goes wrong and data are lost.

Problematically, universities have not seen data management as a responsibility nor something for which they need to provide services. As a direct result of this survey University of Queensland has put data management/loss into their overall risk strategy. Belinda believes a risk management approach is a powerful way to influence institutional senior management to support proper data management.

As we were sipping our coffee, Christiaan Kortekaas was walking by, and Belinda waved him over. Christiaan is the inventor of the Fez open source interface to the Fedora repository software for the University of Queensland Library, which is a competent rival to proprietary solutions. It's quite flexible, and can offer different metadata schemas (e.g. MODS, Dublin Core, etc.) and a variety of classification schemes.

As for libraries, Belinda saw multiple roles (her secondment replacements are pursuing this now). Although data management support is often in no one's job description, it is commonly repository managers who fill this void - perhaps due to the rallying encouragement of APSR, the Australian Partnership for Sustainable Repositories. Specifically, librarians could provide support in describing data structures (metadata & documentation), providing training and templates for data management; writing data rescue case studies; and exit plans for data producers leaving university. Whereas research offices tend to focus on new grants and fostering collaboration, and IT services on servers and cost recovery, libraries are in a unique position to help researchers in finding relevant tools and technology (Web 2.0, etc.) to enhance their research - just as they help them find publications literature. She even thinks that librarians should be based with faculty rather than all in the library building itself, so they can be part of the team. This has worked well, for example in the hospital, where librarians work alongside clinical researchers.

Belinda emphasised that researchers are not necessarily aware that librarians 'know stuff' about tools and technologies, so advocacy is needed. Her publicity poster urges staff and students to "join the growing number of UQ academics and researchers" who are preserving their digital research material with UQ eSpace. Smiling faces of people provide an 'imagine' scenario about materials they can deposit and the implicit benefits of doing so, making sharing research output seem the most natural thing in the world.

Here's a wee gem from Belinda: because repositories are a new service, people don't realise the huge potential. If researchers think they don't need libraries, then adding value to the research chain is vital. So: libraries should be re-purposing themselves around repositories.

Data Walkabout 5: Brisbane, QUT

2009-02-05T17:54:00.016+00:00

I was very pleased that Paula Callan, e-Research Access Coordinator at Queensland University of Technology (QUT) was available to meet me next (pictured with me). I have met Paula twice before: at Edinburgh on a study visit of her own and at the OAI5 conference at CERN in 2007, so I know she is switched on to both repositories and data management issues. QUT EPrints has been running for five years, and over 1500 academics are regular self-depositors. Paula was responsible for much of this success, though QUT had a huge advantage that Professor Tom Cochrane, an Open Access advocate, pushed through the institutional repository and supported it from the top as Deputy Vice-Chancellor (Technology, Information and Learning Support).

Each time I meet Paula she fills me in on Australian developments, such as the now-completed Australian Research Repositories Online to the World (ARROW) project which coordinated the efforts of Australian institutional repositories; the Australian ResearCH Enabling enviRonment (ARCHER) and its open source toolset; Australian Research Collaboration Service (ARCS) which features a data storage service to provide a national 'data fabric'; Online Research Collections Australia (ORCA) - an online registry of Australian research collections, and the Australian Code for the Responsible Conduct of Research. This last one is key to institutional responsibility for research data management because compliance is mandatory to receive research funding. It states that institutions are not only responsible for providing "safe and secure" storage facilities for data, but that there must be a policy on retention, ownership, and access to that data at an institutional level.

Another driver for Australian institutions is the Australian Research Council Funding Agreement which, for certain research grants, requires that research outputs - including both data and publications - be lodged in an institutional or disciplinary repository within 6 months of completion.

I learned more from Carolyn Young, Associate Director, Library Services (Information Resources) who kindly made time to speak with me before my talk to Library and eResearch support staff on DataShare and the Data Audit Framework projects. She and Joe Young, Manager of High Performance Computing, developed the Research Support Plan to comply with the government policies above, e.g. how to implement them, and how to enhance support services for research and data management.

Included in the plan are: a research data management policy; templates for funded research data management plans; a training programme for researchers; an organisational model that utilises staff efficiently for new services, and a data store for all departments (not just the heavy data users). They're looking inwards, i.e. at the OAK Law Project that has data expertise to contribute, and outwards, for example Monash University Library's Research Support Plan. They plan to contribute to the ORCA data registry and help to seed the ANDS Data Commons. Paula also introduced me to Joe's colleague Lance Divine, who told me about technology they use to help research projects visualise and manage their data, such as Mediaflux and plone.

Speaking of the OAK Law Project (Open Access to Knowledge) I met with Kylie Pappalardo, who explained the cutting edge work they do with open data and the Australian Creative Commons. We had an interesting discussion about whether open data should be licensed via the Creative Commons attribution-only license (OAK Law's opinion), or dedicated to the public domain only to avoid attribution stacking and other barriers to re-use (a view held by John Wilbanks of Science Commons). Since coming back to Britain I see that Rufus Pollock from the UK-based Open Knowledge Foundation has weighed in on this debate.

Kylie sent me away with a copy of their Practical Data Management: A Legal and Policy Guide, a useful resource although it is based on Australian law and practice. For example, in Australia the quintessential 'telephone book case' was settled in favour of the data collector so data can have copyright (because effort matters, just like creativity - not the decision in USA). But of course Britain has the Database Directive outside copyright law, so here too there is potential for IPR in data, though this has not been tested much in court.

See Data Walkabout (1) for further context about this post.