DataShare Blog: July 2009

Friday, 31 July 2009

Repository Fringe 2009: Closing Plenary and Thanks

Peter Burnhill is introducing our last session of the Fringe with thanks and acknowledgments of organizers and supporters. He has also paid tribute to Rachel Heery who was a friend and colleague of many at this event and who sadly passed away last week. A tribute to her from her colleagues can be found on the UKOLN website.

Finally Peter has handed over to our Closing Plenary by Clifford Lynch, Director of the Coalition for Networked Information (CNI).

Clifford arrived this morning and has been able to attend some very useful Round Table sessions and he's been reading up on the previous sessions so there will be some reflections on what's been said in the last few days in his talk.

Cliff opened by saying that, as some of us will know, he conducted a low profile reconnaissance of the scene here about 10 days ago: one of the things I learned was that edinburgh has been a hot bed of repository development since the 18th century.

If you go to the records office you can get a copy of a document about the building of General Register House called "A proper repository"!

Today I want to talk about four major areas at length:

Repository services
Repository services in the broader ecosystem. I think we are also challenged to understand the scope and ambitions of what we call repositories
Repositories and the life-cycle of what goes in repositories. There are some interesting questions there that we're probably not thinking about quite enough.
And finally I wanted to talk about selling or building support for repositories. It's not really the right heading. Some of the things I want to talk about are support, some are about the purpose of repositories and there are some other thoughts in there.

I want to start by commending Sally and Ben for their talk which, based on the blog, touched on many of the issues I've been thinking about and talking about and seeing as key issues for repositories for some time. I'll have a few things to say which will lead on their talk but I think there is a great deal of abiding knowledge in that talk and it is a great picture of where we are now and where we might go.

Repository Services
One of the things I found so interesting about the discussions this morning has been the distinction that many people seem to be making between data repositories and what they call repositories of scholarly outputs, e-prints or similar terms. They are perceived, with some justice, to have some different characteristics and behaviors. You could spend a lot creating a much wider repository than you would ever need to for an e-print repository: people just can't write that fast. But when you get other types of data in there the storage is consumed rapidly and you hit real policy issues about curation and the rationing of storage.

I think in the States people often think that a repository can do everything. It's interesting to me to turn it around and talk about e-print repository services, data repository services: essentially thinking of suites of services rather than the underlying implementation, servers, storage etc. You get a very different view of needs when you see it that way. When you look at data we may see repositories that range from generic data to repositories that deal with specific genres of data in particular areas. We already see some of the latter in subject repositories. In molecular biology for instance most of the repositories just take very very specific types of objects in specific formats that are disciplinary in nature. Stressing the emphasis on services here is quite helpful.

Repositories in the Ecosystem
Let me say a bit about repository services as part of a bigger and much more complex ecosystem. There was an International Repositories Workshop that JISC and others sponsored in March in Amsterdam that got into this a little. Repositories connect to things that provide support for e-science, workflow management, high performance storage and computing, etc. What we have to be thoughtful and careful about is the scope of what goes on in a repository. I've seen people say that we should put repositories in the workflow for e-science. I've seen put them in the context of high volume, high reliability storage. I'm not sure we want our repositories to be that volatile. I'm sure that there is a space for lots of high quality storage facilities to support academia doing IT-enabled work. But is the space for repositories? Finding the boundaries will be very delicate and very messy.

Also as we put more kinds of things in repositories we have to look at how much of a use or computational environment that repository becomes.

For example what happens if you want to put video in a repository. If I have a big video file I can deposit it into a conventional repository and I can take it out of the repository and use whatever tools I have to view it. On the other hand I might deposit it into a smart video repository that will fix the format, calculate the bandwidth needed to view (and perhaps adjust the format of delivery according to users' bandwidth). That would be a complex repository tailored to the types of objects deposited.

But what about objects like videogames? Do we embed a videogame environment in the repository or do we leave it to the user. These issues have a lot of implications about the development and ongoing support costs of repositories. As we put in data or e-prints in to repositories we have to work out how much computational work you can do in the repository. Can you do complex or only basic computational work? Or do you take data out to do computational work on materials? These issues get at how you build interfaces, how semantic they are, how flexible are they.

Replication and the semantics of replication we are currently very sloppy about. We have rule-based processes about submissions for instance and the idea of propagating multiple repositories - this seems sensible and reasonable. We also know replicated geographically distributed storage is a good thing. We have a suspiciion that not going it alone on storage is a good thing - it's better to share and collaborate in preservation and curation. How replication for reliability and propogation to multiple collections will work still needs a lot of work, similarly provenence and version tracking. This is an area that requires some proper consideration. There is also an issue of storage and the cloud.

Further there is the issue of software. People seem comfortable with the concept that software has a place of some sort in repositories, but if we think of repositories in the wider information ecosystems, people have been thinking about replication and preservation already. Software designers have their own repositories of codes and versions and it's unclear to me how our repositories relate to these.

The last ecosystem point I want to make - and it was discussed at some length in a round table earlier - is how we connect repositories and especially institutional, but also disciplinary, repositories into the new name environment that is emerging. From publishers, web of science, journals etc. You have a number of identity management and federation changes that all result in IDs being created. All of these things are at play here. Essentially IRs set you up for the challenge of doing name authority for your local community, but in the context of things like identity management.

One of the fascinating byproduct things we found in a recent workshop was that institutions view your name as being your name according to humans and payroll services. That name may change occassionaly. They have not gone as far as saying that names may not be in roman characters anymore - that what appears in roman characters could just be an extra name version or transliteration of their name (the original may be chinese characters for instance). There is some capacity for name change aliases but it hasn't occured that people publish under versions of their name that may not be what is on their paycheck. People are not always consistent and there is little provision for "literary names" in university identity management. We'll regret that soon if not already. We need to think big about names. I am becoming more and more convinced that personal names are a really important part of scholarly infrastructure and we'll see some fascinating convergence of geneology databases and ID databases into review and scholarly literature. We have the strong compotent of faculty biographies that are linked to scholarly work already and all of that integrates into giant biographical and factual networks. I use factual to differentiate from mental state and influences. I mean the dull facts like jobs held, papers written, very factual sorts of data. I think there's a really big set of developments here that connect into repositories. It would be very helpful to sit back and think hard about this.

There's one other spanning service that we talked a little about in the grand challenge session. And theat's the issue of annotation and annotation of information resources across data and repositories. Herbert Van de Somple has got funding to work on this lately and Zotero is doing some interesting work here. There will be some interesting findings to tie into scholarly dialogue.

Repositories and the life-cycle of what goes into repositories
Repositories to date have been about getting stuff in, making it accessible. We are not far in enough for stewardship and preservation and management of what is in there to be our major concern but it is important that we don't forget about these. It will be important over time. So for example how much specific disciplinary knowledge do you need to classify a specific scholarly knowledge? The depositor knew and understood these things. If there is a serious issue the user can contact the depositor. Short term that's fine. Very long term the contributor may be dead. You may have to make assessments of whether it's worth keeping the data set in light of subsequent developments. You may have to look at formatting and presentation and linking of data depending on what else is published and occuring. The equation really changes when you look at the near term versus the long term.

I also think it's helpful to move away from the notion of preserving things forever. It seems more helpful to say "my organisation will take responsibility for preserving this for 20 years. After 15 years we will reassess if we keep it; whether the responsibility will transfer to another party; or whether perhaps we discard it at the end of the period". This is a much more structured kind of thing than just saying we'll keep it forever. These more realistic timescales are probably a very good way to structure arguments about preserving items in repositories right now.

Repositories in the bigger economic environment
I note a couple of things. Firstly I have a feeling that we have to get a lot more serious about the idea of "do what you can with bulk ingest now and deal with it later". There are a lot of materials coming at us now that we are not prepared for. We have to do our best to take them in and pass them on. In the current economic climate corporations of long historical status are evaporating and some sort of memory organisation needs to take on their corporate records and archives. At least in the states some government records - especially at local levels - are in peril and we have to think about sudden onslaughts of material not covered by "give us £50 million pounds for this work" and that instead need instant very low cost action.

I'm struck by the question of the extent to which we are trying to market services to a user community rather than promoting collections created by the services. I think we present both sides of that in a confusiong way. Are we doing something for our users as the service or is the value in the resultant collection? I think that ties closely to mandates for research data and open access - they favour the service side. But there is also the every popular critique that "I looked in your repository and there's really nothing interesting there".

There is a thought I'd like to share that at the rate that the economic structure of scholarly communications is changing, and given the kinds of economic pressures at some institutions, repositories at the instititional or disciplinary level, may take on far more importance far more quickly than we imagined: they may become the main access route to scholarly information. Given serials budgets and the pressures to channel money into data curation I can see a time when repositories are a primary - not a supplementary - access path to scholarly materials.

I will close with two fringey propositons.

The first I have been thinking about on and off for a couple of years. About how to make the case for repositories at an institutional level. It may be wrong to think of lots of transitional deposits, it may be that departments interact with repositories infrequently but on a big scale. Suppose we set up a programme for a distinguished academic so that we sent someone to collect scholarly materials and move copies of those into the repository in an orderly fashion creating a legacy of material from the subject honouring their work. I think that would be very attractive to scholars and could be seen as a real way to honour individuals and highlight institutional achievement. This may be a better way to gain support for a repository than chasing everyone for every paper.

Secondly IRs need to move from just being for academia and research centres. A public variation, perhaps local repositories or interest based repositories which may be separate to the academy but may connect in some ways, and we need to think about how we connect to those and build integrated access to information that has scholarly impact for the long term. As well as making scholarly materials more available to the wider public.

I throw these ideas out for the future.

Its very striking how much progress has been made with repository deployment. The software still has rough edges, deployment strategy has rough edges but I come away from discussions like this with the feeling that there is very substantial momentum. And we need to use that wisely especially when we think about where to put investment effort into repositories I think that trying to do everything will dissipate momentum but there is enough momentum that many things are achievable at this point.

And I think that's where I should finish.

Q & A

After an enthusiastic response from the audience Peter Burnhill reflects on Cliff's talk. This is the second repository fringe - they are supposed not to be too formal. Peter asks that when we all blog about this event we should think about what is or is not working at these events. The idea we have is that we try new things and see how things can change etc. So on that note, how do we start the questioning? There are so many threads here to explore - we'll be teasing some of these out at the pub later I think but lets get that started here...

Q (Sheila Cannell): On the economic point, at the moment repositories are very tied up with scholarly economics and the relationships with publishers. How could the economic crisis lead to a sudden change? I can't quite see how this would happen. My fear is that we continue with a number of models not knowing what we should be doing.

A (Cliff): I think it's unlikely to change for all disciplines at once or in the same way. I am struck by just how bad the economic situation is in the states. There have been huge cuts to the library budgets. I live in California and the library and University financial cuts are astounding at the moment. I can't imagine getting to a point in a few years where you might ask the high energy physics folks if they could pay for the physics archive or if they would rather keep the journal subscriptions. They'd obviously prefer both but pushed they'd probably go for the archive. Second tier journals will fall off subscriptions lists but from time to time they publish important articles and these become invisible if not accessible some other way. I'd like to think we'll see an explosion of overlay journals cherry picking from the content in IRs but I'm pretty sceptical about that at this point. My personal, and rather unpopular, view is that we will back off a lot on peer review in a lot of disciplines. I think the biomedical sciences will hang onto it but I think we have been profligate in using unneccassary manpower and increasingly the systems just don't work properly. We're not being strategic enough about where we do peer review and where we don't. My guess is that you'll see more direct distribution of scholarship through repositories or reconstituted university press organisations rolled into libraries than you've seen in the past.

Q (Ian Stuart): You had a comment on the peer review process. Is it the concept or the method that you think causes the problem?

A: There is a bunch of problems: the volume of literature; the person hours in the process; the tendency to send stuff out for peer review even when it is oviously trash or not good candidate work; there is a culture of materials being submitted, rejected and submitted for the next tier journals so the same piece of work is reviewed multiple times. Can we afford these processes? If people did their fair share of peer review it would be something like 8 weeks per year out of someone's productivity.

Q (Les Carr): I was initially horrified by the idea that you think of the repository as a place to store work at the end of careers - a mausaleum concept crept into my head - but you could equally talk about the other end of careers and the proof of your value when trying to get tenure. Perhaps using repositories as the way to show the value of contribution is what I will take away to think aout.

A: I think your choice of the tenure review as a potential intervention point is a really interesting one. At a well run institution you marshall transactional work, you not only look at your publications but you also talk about what you see hapenning in the future and you could represent that in a repository in a very useful way. You could even move tenure cases into the repository too. It is a really interesting idea.

Comment (Peter Burnhill): Two things you spoke about, archival and appraisal responsibiity are very interesting. And also that it is not just our stuff - neccassarily the repository movement has been concerned with academia, with our stuff but we need to look out to the wider digital world and representation of that for future scholars.

Response (Cliff): I think there is an interesting example here. The Library of Congress put a bunch of pictures on Flickr. What they thought they were doing was seeing if tagging is a useful retrieval mechanism. What was interesting was that they invited people to comment (as well as tag) these photos. These were images of the history of American life - locomotives, airplanes etc. It turns out that there are people around who will tell you the entire history of a locomotive, and the maintenance manual, and the book they wrote on it, etc. just from seeing an imge. When you put up these pictures all that stuff comes into play as surrounding documentation. We don't have a good place to house that right now. This is stuff of schizophrenic scholarly use. Scholars aren't much interested until it is useful for their research some how. I think there's some really interesting stuff to look at here.

And with that further food for thought Peter closed the questioning there by thanking Cliff for an excellent summary of his excellent keynote.

And finally Peter handed over to Ian Stuart for the results of the Grand Challenge which took place earlier today.

Ian Stuart (EDINA), Balviar Notay (JISC) and Ben O'Steen (Oxford University Library Services) were the judging panel. There were 4 very different submissions.

The question asked for enhancements for a repository for the interest of the researcher. After much discussion we really liked what Patrick McSweeny had done. Basically it was a mash-up of information from the e-print repository and his university website and managed to produce a single information page about a specific researcher which was automatically kept up to date.

Congratulations Patrick!

And with that the 2009 Repository Fringe draws to a close.

Repository Fringe 2009: Afternoon Tutorial (Open Data)

Fresh from lunch, a brief chat with Clifford Lynch (he's been reading the blog en route to the event) and some overheard checking of cricket scores we're moving onto the afternoon Tutorial.

Jordan Hatcher (ipVP) & Jo Walsh (EDINA): Implementing Open Data

A chat on the work of the Open Knowledge Foundation, including CKAN and Knowledge Forge (led by Jo) and then an in-depth session on the legal side of open data, including the new legal tools available through OpenDataCommons.org, including a database specific copyleft license.

Jo opens with an overview of the Open Knowledge Foundation. Rufus Pollock started the Open Knowledge Foundation and it's entirely volunteer run. The idea is look at what has been successful in free and open source software communities and take the most successful parts of that and apply it to open data and open knowledge. It's not a campaigning organization but very much a building foundation. It's all about how things can be reused. Various projects exist, some are about publishing data, some are standards type efforts, others are exemplars and pilots. There are four principles that characterise and open approach to knowledge management

Incremental development - not one big master release.
Development is decentralized.
Development is highly collaborative.
Development is componentized - so data is in modules and packages that can easily be reused.

For example CKAN (Comprehensive Knowledge Archive Network) is very much about a component type model for maintaining knowledge. The volumes of data involved can be a problem though so we're now looking at something called the Open Grid to find ways to deal with major usage of shared open knowledge.

You hear a lot about "Open" but what does that actually mean? People call their data open or free but how do these principles apply to open data. Rather than starting from a licence one of the first moves was to build an open knowledge definition. So these specifically are less restrictive than some Creative Commons licenses.

Jo hands over to Jordan is going to talk about how these principles and licenses relate to other projects. Jordan is an IP lawyer as well as being a director of the Open Knowledge Foundation. A project that has been worked on with Edinburgh University is Open Data Commons. Jordan will talk about about various aspects of this work.

The origins of Open Data Commons came from Talis. They are part of the semantic web community and open innovation. They knew that open data was important for access to knowledge community (e.g. Open Government). In the sciences data sharing is particularly important as the cost and effort of data is huge (for example space exploration is far too costly to replicate). Talis knew that IP relates to databases so the importance of data + the legal restrictions on data = a need for a legal solution.

So, copyright is really 3 things. It covers a lot of stuff (films, buildings, programs, etc.) although it doesn't cover facts BUT it is an ownership restriction and licenses how you unpackage what you own and what restrictions you put on sharing work with others. So a database is a blend of schema, tables, data entry/output sheets and tables. And that sits in a database software. Some material will be copyright of the software, some will be content agnostic. Copyright covers data, the database and the database software - each in different ways. Challengingly the database rights apply to both the database AND the database content.

So by Open Data Jordan is talking about applying open data to databases and data in databases.
Open means use, reuse, etc. Sharealike is ok (as per GPL or CC licenses) but when other open licenses were created there wasn't a lot of licencing work around open data. You can kind of think about it like a market adoption curve. Linux and the concept of open source software is mature. CC (Creative Commons) is in the early majority stage. Open data is at the early adopters stage.

The reason data gets it's own category is that it is different from software and content both legally and in what you do with it. So, a word about Creative Commons licences:

Not all relavent rights are licenced (the database right in particular) in most CC licences.
CC licences are not consistent across database rights.
CC is not written for databases. For instance attribution can be different for databases.
ShareAlike is unclear when combining one work with another (collective work).
CC doesn't address software as a service.

Creative Commons don't really want you to use their licence for databases and have stated as much on the Science Commons website.

Open Data Content: Legal Tools
The Open Database licence (ODbl) intends to solve shortcomings of Creative Commons. Quite recently completed. The licence has gone through 4 comment rounds including various legal experts across the world.

You use the database to produce a subset of data - a produced work. The ODbl works on both the database and the produced work. So out of the gallaxy of IP rights it picks recognised licences in the agreement but in an open way. ODbl is a worldwide license (following the example of the GPL). It is a human readable licence - not a huge block of legalese.

Attribution wise ODbl requires copyright notices to stay with the database. Produced work requires a brief notice of the source and that it is licenced under the ODbl - this is effectively consumer protection. You will not discover an interesting produced work only to find that your access to view or work with the original data is blocked (or vice versa).

Share-Alike allows derivative databases but they are required to be under ODbl but you can licence a derivative differently to the original database.

Database Content Licence (DbCL) aims to licence the content of databases. It's at the content layer and covers individual copyright that might be present.

OpenStreetMap is considering moving from a CC licence to a OpenData type licence.

The Public Domain Dedication & Licence (PDDL) allows you to wholly hand over your copyright. Where this is not possible (some geographical territories) you instead license users to do what they like with your data. PDDL + Community Norms - these are complementary components to let you indicate expected behaviours for the database you have opened to the community. You can't sue someone for plaguerism. Your protection is the academic norms from social pressure and maybe an honour code. Jordan uses as an analogy The Bluebook. This is the standard system of citation - you can't get published in the US if your citation doesn't match up. Not a legal issue but a community norm issue.

Science Commons is based in Boston. They are the science end of Creative Commons. They were addressing science issues in about 2005 (but not very publically). They came up for a protocol for implementing open access data with maximum flexubility without licencing hang ups. So this standard was developed online. The protocol is not a licence but sets a standard for licences for science data.

Comparing where we're at (this is a Venn diagram but I haven't grabbed a picture I'm afraid): OpenData Commons has ODbL and PDDL. PDDL overlaps with Science Commons. Creative Commons sits within Science Commons.

Closing Thoughts
There has been a suggestion of a rift between Creative Commons and the Open Knowledge Foundation. Both organisations agree that public domain databases are a good idea and that science in particular in the public domain is super. There are some licence details we don't agree on but broadly we're aiming at the same issue.

A little lesson to close on: if you look at the lunar landing video that is currently preserved you find that the quality is terrible despite the fact that high quality images were recorded at the time. NASA had head budget issues regarding the cost of storage data and (accidentally) recorded over the original in the 1980's. My point is that you should not let IP get in the way of preservation - sharing can be much safer.

Q & A

Q (Peter Burnhill): Was "Copyleft" a joke when it was thought up, is the concept useful at this point?

A (Jordan): The GPL has some fairly formal structure. We have evolved to version 3. It has a formal structure that includes the legal requirement to share. There is a rich history of jokey attempts to subvert the notion of copyright but Copyleft is fairly well established and understood in the IP world now.

Q: What is the difference in those licenses between derived database and produced work?

A: Produced work might be the contents of the field and not the field names or structure. There is a little cross over as the definition of the database in the licence is fairly broad. So there is some blurriness about whether it's a produced work or a derived work. This has come up in the openstreetmap context. This can be quite a technical issue to look at. The licence acts as a constitution - a guiding document on how you should apply things. If a community isn't served by the licence they can also come up with their own guidelines that sites with it.

Q (Robin Rice): Could you talk a little more about the norms and particularly how that could work in a repository environment. When I think of my typical users - researchers - they may be happy with the public domain licence if they don't think about it too much. But if they felt someone could make commercial profit out of that data they would upsert and maybe Copyleft would be better.

A (Jordan): So GPL is not automatically not commercial. A non commercial clause can cause confusion about what counts as commercial. Who thinks that anything done by a charity is non commercial? It's not clear cut. Some view non profit or charity usage as non commercial BUT the law generally sees the type of use not the type of user as key. People have different views. People generally licence under CC licences and it is up to them (not CC) to enforce the licence they choose. Educational use is an interesting issue too. Is a university non commercial? Are CPD courses non commercial?

Comment (Peter Burnhill): In some ways the Tax Man decides what is and is not commercial. So the unievrsity is a charity, but their catering arm is a commercial company...

Response (Jordan): Really it is the person who shares their work that decides what is commercial and non commercial. This is a contract and a legal one so you need to know what all parties think the definition of the terms is when they make their agreement. But we (the Open Knowledge Foundation) are not going to do a non commercial licence because that's not our wish.

Follow Up (Robin Rice): So what does sharealike mean for content in these contexts?

A (Jordan): So I share a photo with a sharealike attritubtion open knowledge type licence. A blend of that photo with another would be a derivative work. Think of it in layers. Layer one is my copyright, layer two is your copyright. Sharealike makes you use the same sort of licence for what you do with my image. It creates a sense of comments. You have to give back to the community if you use the data. Now with the non commercial CC licence no-one can use an item derived from data for profit.

Q (Robin Rice): How do you communicate norms to your users?

A (Jordan): There is a document which provides an example of how you might want to lay out your expectations how people behave with your database. You can be at the high end (like making a website work in such a way that users have to read terms and conditions) or you can be more casual about what you want. There are ways to make sure you are cited properly for instance - you can have rules or community pressure for that. If people publish on the web it is (semi) possible to detect a use that isn't fairly citing or attributing the originator. BT had a project with open source code in it and they didn't reveal it and this was revealed through naming and shaming to the community online. Social pressure is very useful even without any legal restrictions.

Q (Peter Burnhill): If you do want to put restrictions on use you are into Copyright. If you go towards Copyleft and Open - it's not that commercial use is improper but that it is not attributed. Attribution is a really big issue if you go properly open. Modifying and changing work is scary. Hopefully there is an indication of the original version - data acknowledgement forms really.

A (Jordan): I have had Flickr photos with CC licences used in commercial contexts. I'm not that bothered as it's not my primary profession. Now academics do have a gut reaction to people making money for their work but there is an advantage to stuff being open. It's really useful to stop data being stored in just one place. Why wait around for the government to release a reformatted website of data? Just put it out there so that it can go out there. There are distinct benefits for use and reuse. There are economic benefits about making material available in this way: 1 + 1 = 5 in this context sometimes.

Comment (Jo): One of the issues for academics is impact, profile and reuse. Turning scholarly citation into data sharing and collaboration. Real benefits there.

Comment (Peter Burnhill): Statements that travel (or should travel) with data. One is a data disclaimer - public data in the US often includes a document about how to cite this data and also information that responsibilities of data producer stops at the original data set. All you do is support what you put out there.

Comment (Jordan): The UK government does this too. it states that data is in the public domain but you can't misattribute this data to be official government data if it has been changed or analysed. Basically they disconnect their authority when you work with the data. Effectively it's a licence. Different licences can be a nightmare when you get past one dataset. Licenses may conflict and you might have a 100 datasets to deal with. With data in the public domain you know what you can do and how you can treat the data. the problem in the academic domain is that people can end up licenceing data they don't have the rights for.

Repository Fringe 2009: Tutorials

This morning the Repository Fringe breaks into 3 groups: the DCC Network is holding an event, there are Round Table sessions and we'll be blogging from the Tutorials. If you want to see the comments from other sections you can please take a look at the CoverItLive stream on the Repository Fringe website.

Simon Bains is our chair today and he's introducing our tutorial from Talat and Stephanie from UKOLN.

Talat Chaudri (UKOLN) - Building application profiles in practice: agile development and usability testing (2 hours)

Talat is talking about a specific set of application profiles funded by the JISC, particularly Dublin Core profiles. These were funded but have not been taken up as expected so Talat's post has been created to find out why these profiles are not in more use. For instance:

Practical usability testing (JISC DCAPS: SWAP, GAP, IAP, TBMAP, LMAP, SDAP) - some of these are broad (and possibly unwieldy) and some are extremely specialized profiles.
There is a sense that JISC built these profiles and they should be in use. Talat feels what we don't want to do next here is to build untested, preconcieved ideas of user requirements - we should consult users instead!
Substantiate need for complex metadata structures/terms: convince developers.
All outcomes possible (minor alterations, radical reformulation, status quo). We may learn that what was done isn't perfect. They may be great and have certainly been created by specialists but it may be that changes are required to make the profiles more useful. The profiles may be good but we may need tools for implementation.

Method
We know that different institutions work differently and may implement these things differently so we need

Start where we can do most (and feedback is most immediately available). We are starting with SWAP as we have the most experience there but the other profiles are also based on SWAP so it's a good starting point. It's not that it's the most important profile.
Iterative development of methodology (different per resource/DCAP?). We may not get things right the first time. We'll do this like software design: deliver, test, take feedback, make tweaks etc. I don't think it's an approach that's really been taken with metadata before.
Work with what we have (DCAPs as they currently stand, not further hypothesis).
Test interfaces and paper prototyping. We'll be doing some of this today with the Card Sorting method.
Include everybody.

Outcomes

Clearer idea of what APs do for real users (repositories, VREs, VLEs etc). The idea is to take bits of metadata schema from different places and tailors them for a specific purpose.
Methodology for building APs that work (how to make them better in future). We build an application from scratch - what's the best way to do that again. And what is the process for assessing and improving an AP? We need to capture this too.
Toolkit approach to facilitating APs for which there is a demonstrated use. We want to test different parts separately and to see what people really want to do and how they really want to use the APs.
Show the community how to do it! This is part of our role. We can't just through APs out there we have to show people how they use these APs and relate it to their work.

What are we Testing and Why?

Metadata elements that are useful (what you need to say and what for). This is probably the most important issue.
Relationships between digital objects (what you need to find next). You want useful connections but you don't need absolutely every single connection. What do users need. This is a jump off point from Google really. Making a key connection to related items or citations etc. might be most useful. There are other connections you could make that may have no value for users.
Structure (what related things have in common e.g. article & teaching slides). What is the same? Effectively this is about entities. They are about sections of metadata. If things have a relationship what is it?
Don't build what you don't need. If users don't want or need it you shouldn't develop it.

Example: "Event Metadata"
What do you want and need from this data?
Event Type: Workshop

Workshop Location
Workshop Type
Workshop Date
Opinion of Workshop {boring; interesting} etc.

So... we have a practical task now.

Task: Free-listing, card sorting and user testing a data model
We split into groups and, as this was hands on, I'll just tell you the highlights of the task:

Each group selected a type of thing to build a data model for. Our group picked "Room Type" - basically a room as a resource that requires a data model.
Then we individually listed all the properties of that resource type for a few minutes.
Then we pooled all our ideas and wrote the agreed terms on post it notes - one for each quality.
Then we sorted those post its into groups/clusters. In our case this was tricky. Where did location fit? Where did price fit? What were our top level categories? We were thinking of the model as relating to a faceted browse of a website...
Then we created usage scenarios for our data model. This immediately raised a few qualities we hadn't thought of in our model.

Next the groups each switched a member so that an outsider to the data model could run through one of the scenarios and spot the problems. We immediately found that our outsider was looking at the discovery of a room for booking from completely different priorities/qualities. This meant our top level qualities probably needed a rethink. But of course this was only one user. We'd need to test the same model with lots of users to know what works and what doesn't. Additionally we were thinking of our model as relating to websites but was the model the website structure or the data structure? We weren't sure...

Final Comments from Talat

User level view of things should be simplified.
Developer will see structural complexity. You need to know why it's useful to use a data structure.
Good documentation makes happy developers. And happy users. It's important to be clear about what you can use this model for. Showing and telling the models really helps.
Test early, test often. Card if cheaper than code! People need to know what it's for! Things can't be perfect for everyone but you cannot foresee things in theory. And metadata needs to be a flexible living structure.

This was a really interesting hands on session and raised lots of contentious metadata issues. It also highlighted how useful Card Sorting could be for website and interface design - simple though a bunch of post its are they are super fast for paper profiling and seeing where different elements should and should not sit on a page.

After this session we have a little coffee break (and such edible perks as micro eclairs!) and then back for the next tutorial:

Richard Jones, Symplectic: “AtomPub, SWORD and how they fit the repository: going beyond deposit”

This will be about how to create tools that work over many repositories. The relationships between dynamic and static systems. A bit about SWORD, AtomPub and how they are used. Richard will also be talking about Repository Object Models and their equivelences across platforms as well as Repository workflows and their equivelences across platforms.

How do we provide deposit tools that are usable and simple for depositors. And, from the institutional perspective, what the bigger picture is and how data is deposited and reused.

In this tutorial we'll look at how the static meets and clashes with the dynamic. Scholarly publications are not completely static but repositories expect static content. There is a notion of the golden record of a piece of research. There is no sense of version management other than through metadata notes (nothing like Subversion which is often used for version control in developments).

Richard is showing us an Architectural overview of solution provided by Symplectic. It is a matter of combining lots of flexible repository tools and interfaces provided by Symplectic that sits on the repository that the university runs.

Now we're having a nice demo of Richard depositing materials as an academic (on the left hand screen) and viewing them in DSpace as a repository manager (on the right hand screen): Interestingly Richard's demo with a single user account for an institution rather than each academic needing an account for the repository is increasingly common. Each academic has a uun for the university system so the symplectic part of the system just slots into their whole academic login/homepage. There is a real time relationship between the DSpace repository and the Symplectic tool in the academic's homepage/webpage management space.

What is SWORD?
The key thing Richard would point out is that SWORD has an error handling specification. It's really for create only and package deposit. There is a requirement to retrieve a URI and that can be something that allows you to access things more flexibly (through AtomPub say).

"AtomPub is an application-level protocol for publishing and editing web resources"

It is a RESTful protocol - no need for a grand web service frameoworl and HTTP requests and
The spec for AtomPUb allows more flexible create,replace, modify and delete
All communications in Atom format - designed for web type content not binary type content.
HTTP POST in AtomPub is designed to allow an Atom feed document to be published ay a specific end point.

At Atom item for a deposit is administrative data and item level information alongside a reference to file content.

So Publications does a POST with SWORD Headers Pub ATOM feed file content to the repository. Then repository return an atom feed and Publications provides a feed.

It's interesting to get into how repositories describe their deposits. items (DSpace) are eprints (EPrints) are objects (Fedora) are a feed in Atom. All the API calls need to respect the different naming used here. No real equivalent of bundle or document for Fedora or Atom but they can be added to bitstream/entries. This may be a ubiquitous structure.

Repositories aren't just structured but also workflows (Richard is referring back to his "Machu Picchu" session yesterday). Terms for each part of the process are named differently. The work-flow is flexible until the archive stage when you are expected to reach a static stage.

So if you wanting a dynamic system sitting on a static system how do you get the feedback you need? There are different stages you can ask about changes - in this stage at the archive stage there is no ability to modify. You have to create a clone copy to work on. Richard will now demo this. For instance if a preprint was deposited and you want to update it with the print or post print version. You can revoke the right to publish on something already in the repository and that kicks it back to the workflow which mean you can then replace the copy, change it, etc. Academics can add and remove files in Publications as they like. New items are created in the repository. In the archive there is a versioning system that allows references forward and backward in the repository. So you replace and update clones, items never leave the repository. There are some wrinkles - especially to do with level of change. You can have minor changes that do not trigger a review process of the deposited item, or major changes can trigger a re-review of the item. This is a matter of configuration. However the jist is that you create linked duplicates - marked up in the metadata. You can choose to delete items or directly edit metadata but the automatic processes creates multiple clones. EPrints has a version control structure already and ideally that is what you would want in repositories in general.

Additional note: the academic workspace includes a publishers policy area but as many organizations want no barriers to deposit this is often ticked by default or set to be a question mark to indicate that items can be deposited OR that it is not known whether or not items can be deposited.

Deduplication we're allowing an item to have multiple parents so that here is automatic process for many depositors putting in the same item to the same repository.

Fire and forget deposit is not sufficient to interact with the full ingest life-cycle of an average repository system. SWORD and AtomPub offer most of what you need for a deposit life-cycle environment. ATOM can be used effectively to describe content going in and out of repositories. Repositories have analogous content models of and ingest processes. De-duplication of items is very much non-trivial.

Q & A

Q: How does this work for depositing for multiple repositories?

A: It does not support multiple simultaneous deposits yet. You can pick between different ones to deposit to though. We want everything else to work first. We hope to have it working for all three systems mentioned (DSpace, Fedora, EPrints) soon. After that we want to look to deposit in multiple places. We need to crack policy and process issues there though as different repositories have different agreements etc.

Q: The license in Publications, where is it pulled from? Are there different licenses for different collections?

A: It's not pulled exactly yet. We're moving to a choice of licenses or combination of licenses. Different licenses for different collections isn't possible yet. We provide a collection mapper to put content into a given collection at the moment. There is also a more complex tool to map collections to who may wish to deposit in them. It is very complex to set up and local specific so it's not part of the standard Symplectic offering.

Richard concluded by welcoming comments and questions over lunch. Off to grab some food for me too...

Thursday, 30 July 2009

Repository Fringe 2009: Show and Tell

After a very nice lunch the Repository Fringe split into two groups. The DataShare meeting took place in another room in the Informatics Forum (and I hope that a wee post summarizing that session will be possible) whilst I stuck with Ballroom A for the Show and Tell sessions.

Morag Watson, Digital Library Development Manager, Edinburgh University Library

Morag will be talking about Open Journal Software (OJS).

We bid for the Repository Enhancement Strange JISC strand as Corpus:

We partnered with st andrews.
The project was to do a review of software.
User surveys were also planned to be part of the work.
The project would look at OJS and commercial softwares like ScholarOne and Bepress.

We wanted to look at the role of libraries as a publisher. Edinburgh University library doesn't publish journals any more. JISC didn't fund the bid so this project has gone ahead under our own resources. We've only really looked at OJS. It is an open source system

OJS is a Public knowledge project.
It is widely used.
c. 2000 journals already registered but many have not published many issues.
Open access publishing platform.
Locally installed and locally controlled - this is a big benefit for us in terms of what you can do with the journal and how you control it.

The Journal of Legal Access is the first journal in 30 years from Harvard University (as publisher) and it is built on OJS.

Software is licenced under a GNU public licence.

There is lots of support and documentation
Very active support forums
Developer forum
Video tutorials
OJS in an hour (although it's 178 pages!)
Active and involved worldwide community.

So what does OJS do?

A journal management and publishing system - from author submission to editorial control and formating.
Policies, processes etc. can all be set by the journal editors.
Full indexing for searching inside the system and also indexed by Google.
Can add keywords and geotags.
Very low cost to access.
Lots of journals in Africa use OJS - it is a helpfully low bandwidth system.
You can have reader registration. And you set up journal subscription in the system.

You click through the screen and when you're finished you've got a journal. Very easy and very menu driven. And it's easy to drag and drop stuff into the system. You do not need to be technical to use it.

It has a very flexible design (via CSS) for single and multiple journals. You can have one design for all journals or different designs for each journal you publish. You can add audio and video etc. One of the journals we're working with wants to add these types of elements in fact.

We've chosen a flexible design approach but some publishers on OJS reuse templates. Common templates can be reused to save effort. But there is an issue with users and academics wanting/not wanting a common design. If you do very customised designs you may need multiple instances of OJS.

Edinburgh is running a one year pilot implementation (from March 2009). Currently 2 journals: Critical African Studies (launched in May 2009) and Concept (launching October 2009). Critical African Studies was previously in print but there had been no issues for a while so the online version revives it.

We have branded OJS as part of the Edinburgh University Library website. Editors pick logos, abstracts etc. that then show up on the page.

The journal itself fits with what had already been publicised. Mostly formatted through templates and CSS to format. We support the tools and training and set up but not the day to day running of the service.

Morag demos the system as a user who has a Journal Manager role. There are lots of roles that can be set up ut you can turn off a number of these roles to make things simpler. You can have it set up for one person across the whole journal. The system emails users whenever items for review come in. All items are held in the system so if something changes or someone leaves all that data is safely stored. You can also set up future issues in advance quite easily. A lot of nice management features.

Pretty straightforward to use. There is also the possibility of RSS as part of the system. Also you don't have to stick to strict publishing schedules. RSS means people don't miss the next article even if publication schedules are changables.

Implementation Issues

Software setup - we are trying out one installation for all the journals.
Workflow for journals are understood by academics but managing that from an electronic system is different and we now have a nice template for this.
Interface configuration.
Who owns the data? The journal? The academic? The library? The institution? It's a tricky balancing issue.

What next?

More journals. We haven't been publicising this but lots of enquiries have been coming in at this stage all the same.
Also looking at Institutional Journals (e.g. Current research in Edinburgh via repository),
Integration with the Institutional Repository - some work in Australia looks at this and we'll be taking a look at this shortly.
Planning/costing for live services.
SDLC - other organisations in Scotland may want to offer this...

Q & A

Q (Richard Jones): On repository integration: can you only put the content in the repository or do you have to duplicate the data in both systems?

A (Morag): We need a way to publish data from those outside our institution as well and that wont go into our institutional repositories so right now it is a matter of us wanting to back up what we want.

Q (Jo Walsh): You mentioned costings for a live service - does OJS take payments?

A (Morag): OJS hasn't charged us for anything yet. You can subscribe but not pay for journals. And no one we've been working with wants payments. As far as I know OJS doesn't mandate that and we're not interested in doing that.

Q: What formats can you publish in? HTML only? Or PDFs, docs etc.?

A: HTML and pdf are supported. you can submit in any format but have to publish as either pdf or webpage.

Q (Les): Is this for lo-fi basic things or could there be some Nature competitors using OJS

A: Here we are interested in the lo-fi stuff right now. But there are some bigger journals in the US that probably use a lot more functionality. Most of our academics want to switch off a lot of the additional features. We've only worked with 3 departments so far.

Les asks the crowd a question: How many folk have their own publications in the university? Only two people: OUP (which wouldn't use OJS) and UKOLN (who publish an international journal with OJS).

Morag: How much of the review process you want is important to how you set this stuff up.

Les: We find dissemination is the thing many of our acadamics want but I think we should look at this stuff as a complementary service.

Hugh Glaser and Ian Millard: RKB, sameAs and dotAC

Hugh starts by telling us: "This is all very exciting to me. It's a community I don't usually engage with and it's all very exciting - there's so much data - I want to eat it all!"

Hugh's work wasn't originally funded by JISC but this work is now being supported by the Rapid Innovation fund.

Linked Data - Tim Berners Lee says it's the Semantic Web done right and the Web done right. This is linking open data and the idea is that you can name things. Once that's done you can use and work with them regardless of what those things are.

Use URIs as names for things
HTTP URIs so that people can look up those names
When you look up the URI, provide useful information
include links to other URIs for further discovery.

So RDF:

Hugh shows us LOD datasets on the Web mapped in March 2009 - this is a great diagram to illustrate 4.5 billion triples and 180 million data links and it includes DBPedia (the linked data version of Wikipedia) and a big cluster of material on publications from work that Southampton has been grabbing from metadata for various projects etc.

Hugh is showing an amazingly complex data architecture image of Southampton's system which links faceted browsing of multiple knowledge bases, ontology mapping etc. Sometimes the system goes out to find connecting data.

Hugh is also showing a serious of source links and knowledge sources - it's a wide range and he warns us: "don't just look for what you expect to find". To follow up we have a hugely complex web of materials.

The interface to the system is available to view on the web at RFBexplorer.com - connections appear in this system from the semantics of the linked data. So you can track the connections. The interface uses loads of background services to deliver the service. Some people publish usable stuff. Others might want to use our RESTful outputs. You can use these URIs as part of applications or webpages etc. to pull together lots of information.

A postgraduate in chemistry who had done Google Gadgets before made a few gadgets with the RFB service. We are also thinking about doing something similar with the iPhone so you can, for instance, search around and get a sense of who you meet when you are at conferences.

The system ties different URIs together. The geonames issue mentioned this morning ties in here.

Concluding remarks

We gather data from all the e-prints today and from other systems tomorrow
You should worry about your identifiers - can you reliably match your publications to a consistent author id?

dotAC.info takes the previous work forward with a very UK focus. We'll do the geographic side of it, better co-reference, connect to live project databases etc. It will change the face of research in the UK and how it's developing.

Q & A
Q: Is there a REST interface /API and what's the URL for it?

A: The demos page is the best place to go but I'm currently working on better documentation and will probably put that on SourceForge. It's mostly quite simple but then you might want to get it back in different formats. We want to link in the linked data world but we want to be able to pull stuff out of that world too.

Q (Robin Taylor, UoE): What do you mean by ontologies?

A: I mean structured descriptive data standards like Dublin Core. We can cope with multiple ontologies but using a standard one can be a huge benefit to getting information out of systems and making the best use of linked data. It's nice if it's more powerful than Dublin Core but standard ontologies are the key thing.

Hugh: We are very very keen to get input on the the dotAC project. Please drop us an email.
Q (Les): What benefit can we get back from your project?

A: One benefit is quite direct: we will be able to link data in your repository to project data and other data on the web around the world. An interesting thing is: what's the vision of the interesting stories or uses of this data you could make? It's not about the journal articles - there are lots of other possibilities and emergent properties. This is a new world and you need ideas and visualizations to see where it's going next. This is why we're pushing the service side of things:

Q: Where you have countries harvesting repositories into one place: if there isn't a consistent naming strategy you have a problem of duplication. Your service could help with this I think.

A: Yup we can link things and aggregate them. We aren't talking about this as a naming authority though. You can have your URIs and manage those. Other people do their own thing. The service then decides whether to believe your relationship to those URIs or someone elses. This adds some reliability to search.

Gavin

Henrick, Enovation: Demonstration of some UI customization for a Repository, including implementation of ajax keyword taxonomy.

Enovation is a company that supports multiple types of repositories for various organizations so we do things like developing the UI.

So for Trinity COllege Dublin for instance Enovation took DSpace and added:

Two-way integration to student research system.
Custom e-theses workflow from submission to printing.
Hiding empty collections.
Enhanced advanced search.
Integration extended of the TAPIR code to cater for TCD licensing and restrictions.

SDCC (South Dublin County Council) Learning Circle:

Site was a CMS with document management plug-in.
Wanted a more strictly controlled repository with set meta data requirements and simple processes.
Specific design and user interface changes.
Streamlined submission process.
Much improved browsing experience.

Some of the solutions delivered include:

Make adding items (deposit) simpler with a tree based category selection, a simpler submit page encompassing all aspects on one page.
Standardize the author names - implement a find people interface connected to central user system.
Standardize keywords - implement a taxonomy driven lookup with pre-checks keywords.
Make adding news/event information easier - embedding an HTML editor for news areas and we added news and events blogs to the site.

So we're now seeing the simple deposit page. Gavin indicates that they added an AJAX lookup for keywords and this is something others are asking us for.

There is a slightly different dynamic to this use of a repository but DSpace is being used effectively for what they want. It's not a full CMS but the WYSIWYG news editor makes a big difference to usability and flexibility. None of the IR managers we've dealt with can write HTML but the interface forces you to do that. So the standard WYSIWYG approach we've added here works far better.

SDCC came with exact specifications so we tailored DSpace to meet that. The homepage is lively and bright to look at and looks pretty unlike DSpace - and required quite a lot of development. This site isn't available publicly but it is a really useful system for the 50,000 staff to use internally. Giving numbers of items in each section makes navigation quicker and easier.

Various projects Enovation are currently working on (but can't unfortunately be named at this stage) include:

A Digital Learning Object Repository which will include:

Recommend a resource (positive system only - things will be more recommended than others).
Add a review/comment.
Subject based communities of practice (using Mehara).

Fedora Preservation Layer

Use of Escidoc front end.
Customized services.

We looked for interfaces and found Escidoc from Germany which is really good.

Social Studies "Dark Archive"

Preserve original image formats internally.
Generate and publish web formats.
Export of system to open access repository.
Link to map server for rendering.

This is an interesting example. There is a very serious archive that is internal but also a public facing version that people can deposit into with a one way connection between the two.

Open Access

Harvester for the National funded project.
Linking seven universities in Ireland.

And that's all I wanted to say. We work on making things easier so that people actually feel able and comfortable with depositing materials.

Q & A

Q: What is the take home message here? Do all these repository systems have problems that repository developers should look at?

A: Well, as was said this morning, when IRs were thought of they were thought of by developers for admins and managers NOT for researchers and you really need them to suit researchers. This is a marketing tool in many ways so it has to be good and simple and fit with the wider web interfaces much better.

Q: One question from 2 angles. With my DSpace hat on: is this an XML UI or a customized UI? And are you going to get code back into DSPace?

A: Yes, we are in talks to do this. This is a customized UI. Some parts can be easily added. Things like the people finder would be an obvious thing to add to more repositories and would be easy to break out.

Q (Les): You make things better for researchers but the researchers don't commission this sort of work - it's the managers or the librarians...

A: Researchers can be core to the UI needs: for Trinity College Dublin the visionary management let it be user led. The Learning Circle site was also user driven.

Fred Howell, TextSensor:

Using personal publications lists for batch repository deposit using SWORDs (and some more peaceful approaches) -

the EM Loader Project

The idea was a project to connect publicationslist.org to the Depot, a nationwide repository being run by EDINA.
Researchers care about their own web page but they don't want to fill in metadata forms.
So we started from the point of seeing how we could make a personal home page really easy. And once that's been done maybe we can connect that to the repository.

Publicationslist.org allows you to maintain personal publications lists of material in any format. There are about 11,000 researchers worldwide. You can then embed the data into your own webpage from publicationslist.org.

For this project we mixed that nice solution to deposit those items identified to repositories. We added a deporist button to the interface and a page element to let you know the status of items submitted to repositories.

There is an automated process that detects your publications and whether or not they have been deposited, lets you do simple SWORD deposit and pressing the button deposits the item.

Initially we thought that SWORD would solve all problems. But as Richard mentioned earlier it is tough to change anything once deposited with SWORD so we had to do some additional API work. So the status of the items is checked and, based on timestamps, will redeposit an item as needed.

APIS required:

Initial single item deposit - SWORD (with json-bibtex metadata).
Send updates/new version of metadata.
Check status of deposited items.
Search for authors in repositories.
Make the process simple and automated.

You can look at the live demo on the test version of publications list.org.

One of the things we have built in is a PubMed search that lets you find your papers. Once you have a publications list you click deposit - and this creates an instant account on the depot. YOu then click to send the articles to the repository. We haven't typed anything in but you have deposited your papers etc. So part one of the process is great. But then you've reached the limits of SWORD. Would be great to build updating etc. into future iterations of SWORD.

So we felt that academics and researchers don't stay in one institution so what happens when they move to new academic institutions. So we move to a model of "here are my publications with an Atom feed (or similar) and this feeds all the relavant repositries.

PULL is better than PUSH because it:

Removes the need for researchers to do anything extra to deposit.
Is much simpler for publications list providers to support.
And it's so much easier to hit multiple repositories at once.

So from SWORD to PLOUGHshares. Researchers pick their website and lets their repository do the key work.

Q & A

Q(Jim downing Uni of Cambridge): Have you had people move institutions yet? I can see far more issues being raised than just where the files are.

A: we have people who have moved institutions on publications list but this EMLoader service has not been tried as a live service yet.

Les: It is really exciting to see these problems being looked at and tackled!

Comment (Peter Burnhill): I wanted to point out that Fred and I met on the internet. A Blogger in Ohio reviewed publicationslist.org and the depot and suggested they should be connected and that's what actually kicked off this whole process. By weird coincidence both Publicationslist and EDINA were in the same city anyway. We were mated on the internet by a blogger in Ohio!

Q: If something changes something at the web end it changes in the repository but does it work in the other direction?

A: No, we have only done unidirectional but it would be good to do the two way thing.

Daniel Hook, Symplectic

: “AtomPub, SWORD and how they fit the repository: going beyond deposit”

One of our main motivations is that we wanted a way for academics to publish homepages with minimal effort. There is real variance in enthusiasm for this but increasingly there is a need for accountability.

So what we have done is automatic aggregation of data from key data sources to automatically generate lists of publications. There is a balance to be found between all data sources and useful ones. We also let people search Google Books to find publications. The idea is to minimise manual data input for these webpages.

Ben and Sally talked this morning about disproportional feedback and this is a great thing to have for academics. We need to make pages so much more useful than what they are used to. We need citation and usage statistics and we're working with Thomson and Scopus to get at this sort of data. There is also the matter of research strategy. There is so much bibliographic data that can be used to show the bibliographic strengths of the institution automatically.

So this is a demo showing lots of different items - books, chapters, conferences, journals etc. - which can be approved or not depending on whether that aggregation has found your work. The idea is any sort of RAE/REF item can be grabbed.

So the system searches for various terms - you can set up default search terms, check settings etc. so that your persona can be used to find your work essentially (although in this demo it is different settings for each repository/data source). Additionally categories are key in archives so you can restrict your search to expected categories. We talked about URIs and IDs so you can use your archive identifyer as one of your settings.

The search looks for publications regularly and alert you if/when items are found so that you can approve (or not) of them. Approved items may include multiple copies - perhaps a manual source and an automatic Web of Science data. You can add a manual link to override any data errors.

You can also see if other authors have a approved an article. And this means we have a co-author list for the service and can see who has worked with who.

The pluses of this system are:

Easy reference list of your work when it comes to filling in grant applications etc.
It does feed a repository but searching is quite different as you can have a wider range of material represented here.
You can find articles by text found in organisational papers so if you are looking for possible collaborators you have a good starting point to finding out more and making contact. It's a useful tool to browse the organisation as well as helping you find collaborators.

On the pilot system you can see various tabs including one for full text and publisher policy. USe the ROMEO ShERPA system to check copyright of a given article. and this tab is colour coded to indicate if there is an item and what the licence is. The licence is cached in the repositoy with the item. From an academic perspectife this is all about easy deposit.

There are a few wrinkles: you can deposit articles from other authors. Physics are a bad subject for repositories as researchers have often already deposited in archive.org.

Finally Daniel shows us a diagram of the organisation connecting collaborators together. There is much potential to see the organisational research strategy. Visualisations will be going in the next version of the system.

Q (Les Carr): Yes there is a lot of research intelligence bound up in repositories that we need to harvest and look at. How many of you are working on this sort of service - do you see yours connecting to this service or as stand alone services?

A (Daniel): Richard will talk more about Symplectic tomorrow at his tutorial. He will be talking about Sword and Atom and how those map onto Symplectic and other digital repositories connect to content. And how to deal with the layers of processes in front of repositories.

Daniel's presentation concluded the Show and Tell Sessions for the day so we ajourned to the Atrium for a drinks reception. The winning Pecha Kucha was also announced to be Julian Cheal of UKOLN and he gamely modeled his prize and glittery cowboy hat (the sticker you can just see shows his name and talk title - we had to drop our golden nuggets in our favoured hat!):

And with that it was off in our various directions for the evening...

Repository Fringe 2009 : Pecha Kucha

What is Pecha Kucha? A formal presentation style, which gives rise to some very inspiring talks. In essence, you have 20 slides in your talk (not 19, not 21), and each slide is displayed for 20 seconds. The whole presentation therefore lasts for 6 minutes and 40 seconds.

This morning's sessions are in three bundles of three speakers each and we'll be voting them (using golden nuggets and cowboy hats to indicate our preference!).

Group A

James Toon, University of Edinburgh: ERIScotland

ERIS is: Enhancing Repository Infrastructure Scotland
IRIS Scotland was the predecessor to this project and it looked to see if central rather than local repositories would be more useful. The project was considered a success and we wanted to build on the momentum
Grant Funding Call 12/08 - Strand A5 - Repository Enhancement
IRIS took a top down view of the issue, ERIS is taking a bottom up view of the issue working with Research pools
Reasearch Pools have issues too and we need to get a better understanding of their needs.
The community developing repositories haven't engaged as much as they should have done with research users and that leads to problems
They are working and talking to researchers and repository managers together.
We need to find a unified way to curate and encode repositories on a national and global scale.
We also want to make recommendations and suggestions for training and policy around repositories
Not only the functionality but also the tech is being considered as part of this work.
We also want to deliver enhancements already proposed - these will be full implementations not just demonstrators.
BUT we are not developing just because we can. The development teams are working closely with the user engagement work
Most of the goals are based on work already started
We don't know what the research pools will actually want. We know local and aggregation factors will both be important.
We only have a vague idea of what is needed at this stage but we have to be more confident about this if we are making high level recommendations.
Planning, making a business case and policy are all crucial to having solid long term impact.
One big assessment of user needs and demands summarizes must of the work of ERIS.
We think this work is achievable but we have to realistic about long term preservation and access.
We need unified communities, be trusted and build a sustainable repository.

Les Carr, Southampton: "Repository Challenges"

When you set up a repository you must fit many many targets and multiple agendas.
There is value in the ability to share freely but there is a catch...
We have to adapt to the web as researchers. We are not used to doing our science in public.
Preservation and saving our material is a huge problem. From a data point of view it's an enormous bogeyman problem but we have to do what we can in the realistic way.
e-Learning - the ability for repositories to not only preserve articles but also data and materials for e-learning. There are all sorts of ways of researching that are expensive and very non textual and the only account is journal articles and notes.
Business is part of academia and funding and business cases are important.
Repositories must be efficient and effective. It's for the repository to provide these services.
We have set up repositories to be like a box of lego - so you can put data together as lots of modular components
We need to Pimp our research ride.
The cloud: there is so much power and capability both in the cloud and on our computers. Fitting into day to day working practices and activities is crucial.
A repository is like gears on a bike. It mediates between the components of the research world
We don't need to be chemically enhanced to do well but repositories DO need to be helped to be as useful as they promise to be.
No technology on it's own is the answer. The combination and the blend of people with technology that makes for so much more powerful a future.

Guy McGarva, EDINA: ShareGeo - Discovering and Sharing Geospatial Data

This is an overview of ShareGeo. A deposit tool that forms part of Digimap.
Digimap provides access to licensed geo data sources.
A lot of data exists out there and it's hard to find and use - especially if derived from licensed data.
The repository uses DSpace and allows stats and other functionality to track use and connect to services.
The data in ShareGeo can be either open access data or that derived or tied to licensed data.
Example data includes land use, grids, research generated metadata etc.
We are trying to formalize the process of sharing and reuse.
ShareGeo is based upon the work of the Grade project which found a need for this type of specialist sharing of licensed data.
We currently have a fairly high numbers of logins, users and downloads but little upload of data.
We use a map based search and spatial queries of data sets are enabled making the finding of data easier for users.
The footprints of datasets are shown on maps which is useful but quite powerful when combined with spatial queries.
ShareGeo take a single zipped file for deposit of material which means one or many files can make up a deposit. We have a size limit of 1Gb at the moment but this is just to make managing the service manageable.
The automatic ingest maps the data for ShareGeo and geospatial metadata is automatically added to the item in ShareGeo.
Licenses are part of the deposit process so that you always know what type of data and usage you are dealing with.
Issues regarding take up include the difficulties in the closed nature of the site. There are also commercial sites that also provide data sharing facilities.
Future improvements include looking at sourcing and adding more open data, creating a sister open access version of ShareGeo etc.
The main issue for us right now is how to get more data deposited and how to build up our community of users.

Q&A for Group A

Q (Balvier, JISC): How are you doing the aggregation for ERIS?

A (James Toon): The NLS leads the aggregation. Very standard aggregation in use right now...

Follow up (Balvier, JISC): Have you spoken to Paul Walk at UKOLN as they are working on something in this area.

A (James Toon): We've chatted. Right now normalizing data is really what we want to be able to do to provide a good API.

Q: Regarding ShareGeo: How hard was it to get PostgreSQL to do the Geo searching.

A: Not too bad but we do it in quite a basic way from lists. We're looking at a more geospatial extension that might allow a more sophisticated solution.

Follow up comment from the floor: You might look at LocalSOLR.

Group B

Richard Jones, Sympletic: Symplectic Repository Tools

Richard will be talking about some of the repository tools that his company Symplectic produce.
Richard works on repository integration tools to go with the repository systems they make
The aim is how do we provide a deposit tool to make things easy and efficient.
We're starting with an image of Researcher publication lists which form part of their repositories.
And a full text tab - you can see what file you have uploaded, permissions and the ROMEO publisher policies.
Publications pull in data from lots of sources. They connect the repository with SWORD and AtomPub as the method.
Why not just sword? Well it's only designed for creating not updating/removing/changing items.
AtomPub exists in a RESTful environment so extra functionality can be added in.
Some real complications though. Repositories are designed to be static but Symplectic is a more dynamic environment in a constant state of flux.
Repository workflows are a complication - there are three stages really: working copy; review; archiving.
So if you blend static and dynamic repository systems what do you get? A really complex slide - but Richard assures us that we can find out more in the tutorial tomorrow!
Benefits for the researcher - you can update and correct your data if/as needed which can also mean better metadata creation.
Where next? Lots of bells and whistles with repository tools linked to publication tools as well as helping to make standards grow.
Richard will be talking more about this tomorrow.

Julian Cheal, UKOLN: “Repository Deposit Using Adobe Air”

What is it that we're trying to capture?
Academics write things in their notebooks, it's not easily absorbed into repositories.
You could make researchers work on computers...
There's a quote from Bruce Chatwin: "Losing my passport was the least of my worries. Losing my notebook was a disaster" - data is important to researchers and they need to know it's safely archived.
But we need to make it more straightforward for depositing materials.
Adobe Air is a runtime environment and combines the world of the web with the desktop. It's cross platform. It's a rich internet application.
So who's made an AIR? Various twitter clients, The BBC and various advertising companies for a start.
Academics want stats and relationships to funders etc.
Julian has thus made a prototype that looks - deliberately - very much like Flickr uploader.
The finished product is able to drag and drop, easy to use, and pretty to look at.
It's a small application - Julian is showing us all the files involved and it's only a few.
Academics want easy repositories so drag and drop functionality on their desktop is perfect. It uses as SQLite database so can synchronize offline data as soon your machine is re-connected to the internet.
The application talks to SWORD, looks up ROMEO, the name project etc. to catch automatic metadata.
Screen shots indicate that you drag and drop, add metadata as you want. You can add lots or less metadata as appropriate. Auto-complete makes this easy.
JISC has offered to have a deposit event to combine all the deposit apps. This will take place in October.

Hannah Payne & Antony Corfield, The Welsh Repository Network: "The Welsh Repository Network: A tasty bit on the side!"

URNIP is the JISC repository enhancement project for the WRN.
Wales has a diverse HE landscape - very varied size institutions with vary different needs.
We have face to face and video conferences with institutions and we're doing site visits.
Each summer there is a library and IT development event and we are using this to communicate with our project partners
We share support calls and share work via Google code.
We are working with 4 partner institutions on deposit to see whether deposit increases with changes to the deposit process. But we are only at the pilot stage at the moment.
We will be reporting the best models, policies and possibilities as part of this project.
e-Theses and dissertations: the National Library of Wales already collects all paper copies but we want to see if they can be a hub for electronic deposit too.
The e-Thesis project will connect preservation and metadata functionality.
Auto-complete is a hit so we are looking at the work of a previous project Deposit Plait which looked at harvesting and checking data via web services.
Users wanted import and export of metadata to link to other educational databases and services.
Embedded players and multimedia deposit were the highest user priority. Holograms, art works and film are all key research outputs for various institutions in Wales.
We are not a standalone project but fit into the wider repository landscape. We want a cross-project forum so we can establish a set of services and support across repositories in the UK.
Diolch yn fawr am gwrando! (Thank you for listening).

Q & A for Group B

Q (Les Carr): The scottish ERIS project is from shared research, is the welsh one more based on library collaboration?

A (Hannah): Yes. ERIS is very research focused. We are perhaps a step back looking at collaboration and development at this point.

Q (Hugh Glaser): For the adobe air application, where do you get data for auto-complete functionality?

A (Julian): I use SWORD APIs where possible but some data you have to grab and work around as not everyone has a suitable API.

Q: You don't send data to the national centre for text mining for instance to find keywords?

A (Julian): If they have an API that can be used then I'd be very happy to tie that in to the tool.

Group C

Joyce Lewis, Southampton: "Marketing and Repositories - Tell me a Story"

Joyce is talking about the importance of stories and how repositories can help us to tell stories.
"People don't care about cold facts. They care about pictures and stories" - Nancye Green.
Back when Joyce started at the University they did news releases and the broadcast media weren't really targeted and only a few releases were picked up by the print media. Once published the story was also lost.
The university environment has changed now. Lots of universities, lots of research and lots of enthusiasm to show.
Quote being shown here is along the lines of the fact that universities do a poor job of telling investors what they get for their money.
At the moment Joyce tells the stories about the university through text with links and a picture but there is SO much more on the web that could be used. We don't want people to get tangled up in lots of unlinked resources on the web though.
Impact is key to the RAE and REF and this has to be thought about when we think about what we record and promote and how.
Project called Tell Tale is about telling the story of research. It's not funded yet but fingers crossed...
The project would catalogue adaptations to a repository necessary to capture the research story.
It would involve enhancing the content by putting it together into a story.
We want to create narratives automatically with story templates and narrative generation software that links around to items.
Then what is left is to demonstrate success through these stories and the usage of the content they talk about.
The bottom line is that we want to tell a better story.

William Nixon & Gordan Allan, Glasgow: "Enrich - Research System and Repository Integration"

William and Gordan are talking about the JISC funded project Enrich.
The project aims to bring disconnected research elements together.
Research systems are miles from repository systems...
What is a research project? It's an idea. It may or may not have funding or licenses or artifacts associated with it.
There is a sense of research alchemy. And some research MUST publish, others may not have to.
The research lifecycle includes a short burst of publishing but a lot of unpublished work.
University of Glasgow's Research System which tracks funding and licensing and we've started connecting that to the repository.
Repositories need to relate better with the research systems. Records can, when set up this way, now be pushed out via RSS and Twitter for instance. Enlighten is a service whose use has been growing more and more. They are at about 40% full text right now but a requirement to deposit material at the university should get the repository nearer the 100%.
In the old day repositories and research systems were separate silos.
Junction boxes are the future - we're about services. Turning the repository as a junction box to other resources. For example data from repositories is used to generate publications list on staff pages at the university. To add your publications you have to deposit them.
Most searches are not native - people come in via Google and other search engines.
We have freely available global open access.
But key to success are good relationships, easy clear systems and processes and the university policies really help us to be successful.
Enrich will bring together lots more data to tie into the hybrid repository/research Enlighten service.

Jo Walsh, EDINA: "Geoparsing text"

Jo wants to introduce herself to this community as she is just starting to move into a new role with EDINA and to engage with the repositories community.
Geoparser - developed in this very building - which is based on a grammar based named entity recognition technique that allows geo tags to be added to text automatically.
The recognition links to the Gazetteer service. They work well together and the more places you find, the more accurate the look up will be. Text context creates an idea of geo context.
GeoCrossWalk has been around for a while and it has a service status now. It uses Ordnance Survey to identify places. It's an enormous and useful service but it has been limited to Digimap users and licensed users. This confuses users.
This year we will also be expanding the service with an open access Gazetteer using geonames.org and the same type of system as the licensed version. The results will be variable BUT geonames has a wiki style ability to edit so errors can be identified and fixed.
Geoparser webservice will be a simple RESTful API for document placename extraction and markup. You can use OS data OR the OpenDate Gazetteer.
Jo is looking for a sense of user requirements and how this tool fits in specifically with repository needs.
Linking items across the repository seems to be one useful case.
You might use techniques to bootstrap geographic metadata for archives of textual components.
Spatially searching archives and nearby related material would be another use case.
Please contact Jo with comments, feedback and use cases.

Q & A for Group C

Q: What kind of licence do you pick for the open access geo stuff?

A (Jo): Actually it's from other sources and inherited.

Q: How do you feel about where institutional repositories are going?

A (William and Gordan): We feel more like an 8 year overnight success right about now. We've visited every department in the university. Everyone asks How not Why deposit these days. They used to ask why they should. We've seeped into the research process. We've been very supported by our Vice Principal for research. We're really started to realise the potential of all the data we have been gatehring. And we are in a post RAE, pre REF place so we're looking at how to repurpose the repository to suit that change best.

DataShare Blog

Friday, 31 July 2009

Repository Fringe 2009: Closing Plenary and Thanks

Repository Fringe 2009: Afternoon Tutorial (Open Data)

Repository Fringe 2009: Tutorials

Thursday, 30 July 2009

Repository Fringe 2009: Show and Tell

Repository Fringe 2009 : Pecha Kucha

Blog Archive

RIN - RIN Team Blog

petermr's blog

Open Access News

Open Knowledge Foundation Weblog

IASSIST Communiqué

OA Librarian