Thursday, 1 April 2010

Last post to this blog

The DISC-UK DataShare project ended last year, and so this marks the end of this blog. We did write up our self-evaluation for JISC, our funder, in December, which any project followers might be interested to read.

As for followup activities, there are many. Oxford is undertaking the Embedding Institutional Data Curation Services in Research (EIDCSR), funded by JISC. Southampton also has a new JISC-funded project, Institutional Data Management Blueprint, which aims to 'to create a practical and attainable institutional framework for managing research data that facilitates ambitious national and international e-research practice.' Edinburgh produced online guidance for University staff on research data management and continues to host and develop the Edinburgh DataShare repository service developed as part of the project.

All three partners are interested in pursuing institutional policies for research data management. We'll continue to collaborate actively through DISC-UK.

Thursday, 12 November 2009

DataShare presentation at DSpace User Group Meeting 2009

Robin Rice gave a presentation to the DSpace User Group Meeting in Gothenburg Sweden on 15 October, 2009, entitled "Edinburgh DataShare : Tackling research data in a DSpace institutional repository."

The presentation is saved in the University of Gothenburg repository, GUPEA, along with a video of the live presentation.

Friday, 2 October 2009

As follow-up activity to the DataShare and Data Audit Framework projects at the University of Edinburgh, a new set of web pages have been written as guidance for university researchers on research data management, sharing and preservation, published as part of a re-launched University of Edinburgh Information Services website.

The shortcut URL is:
http://www.ed.ac.uk/is/data-management

Similar initiatives have been collected and linked on a page from the Australia National Data Service.

Friday, 11 September 2009

My Faves for Thursday, September 10, 2009

Nature data sharing special edition including detailed on pre-publication and post-publication data sharing and tools

[tags: Data curation, data sharing, data management]

See the rest of my Faves at Faves

Monday, 3 August 2009

Repository Fringe 2009: Round Table Sessions

We have an extra treat from Repository Fringe courtesy of Robin Taylor of Edinburgh University Main Library who has kindly allowed us to share his notes on Round Table Sessions A and C:

Round Table A: "Practical impact and experiences of institutional OA mandates for IRs" (Helen Muir, Queen Margaret University)

Opening remarks from Helen Muir indicated that there had been resistance to the mandate:
  • Perception that it was a time burden.
  • Researchers resented feeling pressurised.
There are two approaches available:
Carrot approach
  • Emphasise benefits with Google Analytics.
  • Ease researchers concerns by utilising the ePrints 'request a copy' feature.
  • Using Scopus to find missing articles.

Stick approach
Should there be any penalty for not complying with a mandate ? The consensus was no.

Without exception all those present had a mediated deposit process. This might be library staff but might also be department research administrators depositing on behalf of staff. A perceived advantage of using research administrators was that they know the staff and therefore also know who to chase up etc.

Other observations from participants:
  • There is a difficulty in getting 'final' versions.
  • Targeting 'star' researchers appears to have a knock on effect.

Additional thoughts on how to emphasise benefits to the researcher included:
  • Publications lists for web pages seen as an important incentive.
  • Statistics to demonstrate increased exposure.
  • Advocacy appears to work. Increased deposit rates amongst those that had had the benefits explained.

Amongst those institutions that have a mandate there was agreement that it could not been seen as the final solution. One such institution reported a 25% compliance rate. Nevertheless there was agreement that mandates were generally a good thing. It raised the profile of the IR and generated debate on the subject. Mandates are considered another useful tool in the battle for deposit.


Round Table C: "Where will repositories be in 5 years time ?" (Ian Stuart, EDINA)

The key points in this session were that:
  • IRs will become part of a wider research management process.
  • Research pools are difficult to manage - cross institution, cross discipline.
  • Identity management and naming authorities are important factors in delivering trusted and consistent repositories.
  • The hope and/or expectation is that IRs provide the core management of data with agile services using the data in varied ways. There is a need to avoid IR software becoming monolithic.
  • The expectation is that there will ultimately be many copies of items available from different sources on the internet. Is this in itself a valid form of preservation?
  • Will current peer review practices change ? Could a more informal model work where works are 'reviewed' in the public domain (the internet)?
  • Who will manage the content? University departments? Do they have sufficient resources to do so?
  • Will IRs hold content or could they just be 'virtual' repositories pointing to the content held elsewhere? This brings us back to the role of IRs and preservation.
This was a lively and useful session but it was therefore only possible to capture an idea of the questions and ideas raised.

Friday, 31 July 2009

Repository Fringe 2009: Closing Plenary and Thanks

Peter Burnhill is introducing our last session of the Fringe with thanks and acknowledgments of organizers and supporters. He has also paid tribute to Rachel Heery who was a friend and colleague of many at this event and who sadly passed away last week. A tribute to her from her colleagues can be found on the UKOLN website.

Finally Peter has handed over to our Closing Plenary by Clifford Lynch, Director of the Coalition for Networked Information (CNI).

Clifford arrived this morning and has been able to attend some very useful Round Table sessions and he's been reading up on the previous sessions so there will be some reflections on what's been said in the last few days in his talk.

Cliff opened by saying that, as some of us will know, he conducted a low profile reconnaissance of the scene here about 10 days ago: one of the things I learned was that edinburgh has been a hot bed of repository development since the 18th century. If you go to the records office you can get a copy of a document about the building of General Register House called "A proper repository"!

Today I want to talk about four major areas at length:
  • Repository services
  • Repository services in the broader ecosystem. I think we are also challenged to understand the scope and ambitions of what we call repositories
  • Repositories and the life-cycle of what goes in repositories. There are some interesting questions there that we're probably not thinking about quite enough.
  • And finally I wanted to talk about selling or building support for repositories. It's not really the right heading. Some of the things I want to talk about are support, some are about the purpose of repositories and there are some other thoughts in there.
I want to start by commending Sally and Ben for their talk which, based on the blog, touched on many of the issues I've been thinking about and talking about and seeing as key issues for repositories for some time. I'll have a few things to say which will lead on their talk but I think there is a great deal of abiding knowledge in that talk and it is a great picture of where we are now and where we might go.

Repository Services
One of the things I found so interesting about the discussions this morning has been the distinction that many people seem to be making between data repositories and what they call repositories of scholarly outputs, e-prints or similar terms. They are perceived, with some justice, to have some different characteristics and behaviors. You could spend a lot creating a much wider repository than you would ever need to for an e-print repository: people just can't write that fast. But when you get other types of data in there the storage is consumed rapidly and you hit real policy issues about curation and the rationing of storage.

I think in the States people often think that a repository can do everything. It's interesting to me to turn it around and talk about e-print repository services, data repository services: essentially thinking of suites of services rather than the underlying implementation, servers, storage etc. You get a very different view of needs when you see it that way. When you look at data we may see repositories that range from generic data to repositories that deal with specific genres of data in particular areas. We already see some of the latter in subject repositories. In molecular biology for instance most of the repositories just take very very specific types of objects in specific formats that are disciplinary in nature. Stressing the emphasis on services here is quite helpful.

Repositories in the Ecosystem
Let me say a bit about repository services as part of a bigger and much more complex ecosystem. There was an International Repositories Workshop that JISC and others sponsored in March in Amsterdam that got into this a little. Repositories connect to things that provide support for e-science, workflow management, high performance storage and computing, etc. What we have to be thoughtful and careful about is the scope of what goes on in a repository. I've seen people say that we should put repositories in the workflow for e-science. I've seen put them in the context of high volume, high reliability storage. I'm not sure we want our repositories to be that volatile. I'm sure that there is a space for lots of high quality storage facilities to support academia doing IT-enabled work. But is the space for repositories? Finding the boundaries will be very delicate and very messy.

Also as we put more kinds of things in repositories we have to look at how much of a use or computational environment that repository becomes.

For example what happens if you want to put video in a repository. If I have a big video file I can deposit it into a conventional repository and I can take it out of the repository and use whatever tools I have to view it. On the other hand I might deposit it into a smart video repository that will fix the format, calculate the bandwidth needed to view (and perhaps adjust the format of delivery according to users' bandwidth). That would be a complex repository tailored to the types of objects deposited.

But what about objects like videogames? Do we embed a videogame environment in the repository or do we leave it to the user. These issues have a lot of implications about the development and ongoing support costs of repositories. As we put in data or e-prints in to repositories we have to work out how much computational work you can do in the repository. Can you do complex or only basic computational work? Or do you take data out to do computational work on materials? These issues get at how you build interfaces, how semantic they are, how flexible are they.

Replication and the semantics of replication we are currently very sloppy about. We have rule-based processes about submissions for instance and the idea of propagating multiple repositories - this seems sensible and reasonable. We also know replicated geographically distributed storage is a good thing. We have a suspiciion that not going it alone on storage is a good thing - it's better to share and collaborate in preservation and curation. How replication for reliability and propogation to multiple collections will work still needs a lot of work, similarly provenence and version tracking. This is an area that requires some proper consideration. There is also an issue of storage and the cloud.

Further there is the issue of software. People seem comfortable with the concept that software has a place of some sort in repositories, but if we think of repositories in the wider information ecosystems, people have been thinking about replication and preservation already. Software designers have their own repositories of codes and versions and it's unclear to me how our repositories relate to these.

The last ecosystem point I want to make - and it was discussed at some length in a round table earlier - is how we connect repositories and especially institutional, but also disciplinary, repositories into the new name environment that is emerging. From publishers, web of science, journals etc. You have a number of identity management and federation changes that all result in IDs being created. All of these things are at play here. Essentially IRs set you up for the challenge of doing name authority for your local community, but in the context of things like identity management.

One of the fascinating byproduct things we found in a recent workshop was that institutions view your name as being your name according to humans and payroll services. That name may change occassionaly. They have not gone as far as saying that names may not be in roman characters anymore - that what appears in roman characters could just be an extra name version or transliteration of their name (the original may be chinese characters for instance). There is some capacity for name change aliases but it hasn't occured that people publish under versions of their name that may not be what is on their paycheck. People are not always consistent and there is little provision for "literary names" in university identity management. We'll regret that soon if not already. We need to think big about names. I am becoming more and more convinced that personal names are a really important part of scholarly infrastructure and we'll see some fascinating convergence of geneology databases and ID databases into review and scholarly literature. We have the strong compotent of faculty biographies that are linked to scholarly work already and all of that integrates into giant biographical and factual networks. I use factual to differentiate from mental state and influences. I mean the dull facts like jobs held, papers written, very factual sorts of data. I think there's a really big set of developments here that connect into repositories. It would be very helpful to sit back and think hard about this.

There's one other spanning service that we talked a little about in the grand challenge session. And theat's the issue of annotation and annotation of information resources across data and repositories. Herbert Van de Somple has got funding to work on this lately and Zotero is doing some interesting work here. There will be some interesting findings to tie into scholarly dialogue.


Repositories and the life-cycle of what goes into repositories
Repositories to date have been about getting stuff in, making it accessible. We are not far in enough for stewardship and preservation and management of what is in there to be our major concern but it is important that we don't forget about these. It will be important over time. So for example how much specific disciplinary knowledge do you need to classify a specific scholarly knowledge? The depositor knew and understood these things. If there is a serious issue the user can contact the depositor. Short term that's fine. Very long term the contributor may be dead. You may have to make assessments of whether it's worth keeping the data set in light of subsequent developments. You may have to look at formatting and presentation and linking of data depending on what else is published and occuring. The equation really changes when you look at the near term versus the long term.

I also think it's helpful to move away from the notion of preserving things forever. It seems more helpful to say "my organisation will take responsibility for preserving this for 20 years. After 15 years we will reassess if we keep it; whether the responsibility will transfer to another party; or whether perhaps we discard it at the end of the period". This is a much more structured kind of thing than just saying we'll keep it forever. These more realistic timescales are probably a very good way to structure arguments about preserving items in repositories right now.

Repositories in the bigger economic environment
I note a couple of things. Firstly I have a feeling that we have to get a lot more serious about the idea of "do what you can with bulk ingest now and deal with it later". There are a lot of materials coming at us now that we are not prepared for. We have to do our best to take them in and pass them on. In the current economic climate corporations of long historical status are evaporating and some sort of memory organisation needs to take on their corporate records and archives. At least in the states some government records - especially at local levels - are in peril and we have to think about sudden onslaughts of material not covered by "give us £50 million pounds for this work" and that instead need instant very low cost action.

I'm struck by the question of the extent to which we are trying to market services to a user community rather than promoting collections created by the services. I think we present both sides of that in a confusiong way. Are we doing something for our users as the service or is the value in the resultant collection? I think that ties closely to mandates for research data and open access - they favour the service side. But there is also the every popular critique that "I looked in your repository and there's really nothing interesting there".

There is a thought I'd like to share that at the rate that the economic structure of scholarly communications is changing, and given the kinds of economic pressures at some institutions, repositories at the instititional or disciplinary level, may take on far more importance far more quickly than we imagined: they may become the main access route to scholarly information. Given serials budgets and the pressures to channel money into data curation I can see a time when repositories are a primary - not a supplementary - access path to scholarly materials.

I will close with two fringey propositons.

The first I have been thinking about on and off for a couple of years. About how to make the case for repositories at an institutional level. It may be wrong to think of lots of transitional deposits, it may be that departments interact with repositories infrequently but on a big scale. Suppose we set up a programme for a distinguished academic so that we sent someone to collect scholarly materials and move copies of those into the repository in an orderly fashion creating a legacy of material from the subject honouring their work. I think that would be very attractive to scholars and could be seen as a real way to honour individuals and highlight institutional achievement. This may be a better way to gain support for a repository than chasing everyone for every paper.

Secondly IRs need to move from just being for academia and research centres. A public variation, perhaps local repositories or interest based repositories which may be separate to the academy but may connect in some ways, and we need to think about how we connect to those and build integrated access to information that has scholarly impact for the long term. As well as making scholarly materials more available to the wider public.

I throw these ideas out for the future.

Its very striking how much progress has been made with repository deployment. The software still has rough edges, deployment strategy has rough edges but I come away from discussions like this with the feeling that there is very substantial momentum. And we need to use that wisely especially when we think about where to put investment effort into repositories I think that trying to do everything will dissipate momentum but there is enough momentum that many things are achievable at this point.

And I think that's where I should finish.


Q & A

After an enthusiastic response from the audience Peter Burnhill reflects on Cliff's talk. This is the second repository fringe - they are supposed not to be too formal. Peter asks that when we all blog about this event we should think about what is or is not working at these events. The idea we have is that we try new things and see how things can change etc. So on that note, how do we start the questioning? There are so many threads here to explore - we'll be teasing some of these out at the pub later I think but lets get that started here...

Q (Sheila Cannell): On the economic point, at the moment repositories are very tied up with scholarly economics and the relationships with publishers. How could the economic crisis lead to a sudden change? I can't quite see how this would happen. My fear is that we continue with a number of models not knowing what we should be doing.

A (Cliff): I think it's unlikely to change for all disciplines at once or in the same way. I am struck by just how bad the economic situation is in the states. There have been huge cuts to the library budgets. I live in California and the library and University financial cuts are astounding at the moment. I can't imagine getting to a point in a few years where you might ask the high energy physics folks if they could pay for the physics archive or if they would rather keep the journal subscriptions. They'd obviously prefer both but pushed they'd probably go for the archive. Second tier journals will fall off subscriptions lists but from time to time they publish important articles and these become invisible if not accessible some other way. I'd like to think we'll see an explosion of overlay journals cherry picking from the content in IRs but I'm pretty sceptical about that at this point. My personal, and rather unpopular, view is that we will back off a lot on peer review in a lot of disciplines. I think the biomedical sciences will hang onto it but I think we have been profligate in using unneccassary manpower and increasingly the systems just don't work properly. We're not being strategic enough about where we do peer review and where we don't. My guess is that you'll see more direct distribution of scholarship through repositories or reconstituted university press organisations rolled into libraries than you've seen in the past.


Q (Ian Stuart):
You had a comment on the peer review process. Is it the concept or the method that you think causes the problem?

A: There is a bunch of problems: the volume of literature; the person hours in the process; the tendency to send stuff out for peer review even when it is oviously trash or not good candidate work; there is a culture of materials being submitted, rejected and submitted for the next tier journals so the same piece of work is reviewed multiple times. Can we afford these processes? If people did their fair share of peer review it would be something like 8 weeks per year out of someone's productivity.


Q (Les Carr): I was initially horrified by the idea that you think of the repository as a place to store work at the end of careers - a mausaleum concept crept into my head - but you could equally talk about the other end of careers and the proof of your value when trying to get tenure. Perhaps using repositories as the way to show the value of contribution is what I will take away to think aout.

A: I think your choice of the tenure review as a potential intervention point is a really interesting one. At a well run institution you marshall transactional work, you not only look at your publications but you also talk about what you see hapenning in the future and you could represent that in a repository in a very useful way. You could even move tenure cases into the repository too. It is a really interesting idea.


Comment (Peter Burnhill): Two things you spoke about, archival and appraisal responsibiity are very interesting. And also that it is not just our stuff - neccassarily the repository movement has been concerned with academia, with our stuff but we need to look out to the wider digital world and representation of that for future scholars.

Response (Cliff): I think there is an interesting example here. The Library of Congress put a bunch of pictures on Flickr. What they thought they were doing was seeing if tagging is a useful retrieval mechanism. What was interesting was that they invited people to comment (as well as tag) these photos. These were images of the history of American life - locomotives, airplanes etc. It turns out that there are people around who will tell you the entire history of a locomotive, and the maintenance manual, and the book they wrote on it, etc. just from seeing an imge. When you put up these pictures all that stuff comes into play as surrounding documentation. We don't have a good place to house that right now. This is stuff of schizophrenic scholarly use. Scholars aren't much interested until it is useful for their research some how. I think there's some really interesting stuff to look at here.

And with that further food for thought Peter closed the questioning there by thanking Cliff for an excellent summary of his excellent keynote.

And finally Peter handed over to Ian Stuart for the results of the Grand Challenge which took place earlier today.
Ian Stuart (EDINA), Balviar Notay (JISC) and Ben O'Steen (Oxford University Library Services) were the judging panel. There were 4 very different submissions.

The question asked for enhancements for a repository for the interest of the researcher. After much discussion we really liked what Patrick McSweeny had done. Basically it was a mash-up of information from the e-print repository and his university website and managed to produce a single information page about a specific researcher which was automatically kept up to date.


Congratulations Patrick!


And with that the 2009 Repository Fringe draws to a close.

Repository Fringe 2009: Afternoon Tutorial (Open Data)

Fresh from lunch, a brief chat with Clifford Lynch (he's been reading the blog en route to the event) and some overheard checking of cricket scores we're moving onto the afternoon Tutorial.

Jordan Hatcher (ipVP) & Jo Walsh (EDINA): Implementing Open Data

A chat on the work of the Open Knowledge Foundation, including CKAN and Knowledge Forge (led by Jo) and then an in-depth session on the legal side of open data, including the new legal tools available through OpenDataCommons.org, including a database specific copyleft license.

Jo opens with an overview of the Open Knowledge Foundation. Rufus Pollock started the Open Knowledge Foundation and it's entirely volunteer run. The idea is look at what has been successful in free and open source software communities and take the most successful parts of that and apply it to open data and open knowledge. It's not a campaigning organization but very much a building foundation. It's all about how things can be reused. Various projects exist, some are about publishing data, some are standards type efforts, others are exemplars and pilots. There are four principles that characterise and open approach to knowledge management
  1. Incremental development - not one big master release.
  2. Development is decentralized.
  3. Development is highly collaborative.
  4. Development is componentized - so data is in modules and packages that can easily be reused.
For example CKAN (Comprehensive Knowledge Archive Network) is very much about a component type model for maintaining knowledge. The volumes of data involved can be a problem though so we're now looking at something called the Open Grid to find ways to deal with major usage of shared open knowledge.

You hear a lot about "Open" but what does that actually mean? People call their data open or free but how do these principles apply to open data. Rather than starting from a licence one of the first moves was to build an open knowledge definition. So these specifically are less restrictive than some Creative Commons licenses.

Jo hands over to Jordan is going to talk about how these principles and licenses relate to other projects. Jordan is an IP lawyer as well as being a director of the Open Knowledge Foundation. A project that has been worked on with Edinburgh University is Open Data Commons. Jordan will talk about about various aspects of this work.

The origins of Open Data Commons came from Talis. They are part of the semantic web community and open innovation. They knew that open data was important for access to knowledge community (e.g. Open Government). In the sciences data sharing is particularly important as the cost and effort of data is huge (for example space exploration is far too costly to replicate). Talis knew that IP relates to databases so the importance of data + the legal restrictions on data = a need for a legal solution.

So, copyright is really 3 things. It covers a lot of stuff (films, buildings, programs, etc.) although it doesn't cover facts BUT it is an ownership restriction and licenses how you unpackage what you own and what restrictions you put on sharing work with others. So a database is a blend of schema, tables, data entry/output sheets and tables. And that sits in a database software. Some material will be copyright of the software, some will be content agnostic. Copyright covers data, the database and the database software - each in different ways. Challengingly the database rights apply to both the database AND the database content.

So by Open Data Jordan is talking about applying open data to databases and data in databases.
Open means use, reuse, etc. Sharealike is ok (as per GPL or CC licenses) but when other open licenses were created there wasn't a lot of licencing work around open data. You can kind of think about it like a market adoption curve. Linux and the concept of open source software is mature. CC (Creative Commons) is in the early majority stage. Open data is at the early adopters stage.

The reason data gets it's own category is that it is different from software and content both legally and in what you do with it. So, a word about Creative Commons licences:
  • Not all relavent rights are licenced (the database right in particular) in most CC licences.
  • CC licences are not consistent across database rights.
  • CC is not written for databases. For instance attribution can be different for databases.
  • ShareAlike is unclear when combining one work with another (collective work).
  • CC doesn't address software as a service.
Creative Commons don't really want you to use their licence for databases and have stated as much on the Science Commons website.

Open Data Content: Legal Tools
The Open Database licence (ODbl) intends to solve shortcomings of Creative Commons. Quite recently completed. The licence has gone through 4 comment rounds including various legal experts across the world.

You use the database to produce a subset of data - a produced work. The ODbl works on both the database and the produced work. So out of the gallaxy of IP rights it picks recognised licences in the agreement but in an open way. ODbl is a worldwide license (following the example of the GPL). It is a human readable licence - not a huge block of legalese.

Attribution wise ODbl requires copyright notices to stay with the database. Produced work requires a brief notice of the source and that it is licenced under the ODbl - this is effectively consumer protection. You will not discover an interesting produced work only to find that your access to view or work with the original data is blocked (or vice versa).

Share-Alike allows derivative databases but they are required to be under ODbl but you can licence a derivative differently to the original database.

Database Content Licence (DbCL) aims to licence the content of databases. It's at the content layer and covers individual copyright that might be present.

OpenStreetMap is considering moving from a CC licence to a OpenData type licence.

The Public Domain Dedication & Licence (PDDL) allows you to wholly hand over your copyright. Where this is not possible (some geographical territories) you instead license users to do what they like with your data. PDDL + Community Norms - these are complementary components to let you indicate expected behaviours for the database you have opened to the community. You can't sue someone for plaguerism. Your protection is the academic norms from social pressure and maybe an honour code. Jordan uses as an analogy The Bluebook. This is the standard system of citation - you can't get published in the US if your citation doesn't match up. Not a legal issue but a community norm issue.

Science Commons is based in Boston. They are the science end of Creative Commons. They were addressing science issues in about 2005 (but not very publically). They came up for a protocol for implementing open access data with maximum flexubility without licencing hang ups. So this standard was developed online. The protocol is not a licence but sets a standard for licences for science data.

Comparing where we're at (this is a Venn diagram but I haven't grabbed a picture I'm afraid): OpenData Commons has ODbL and PDDL. PDDL overlaps with Science Commons. Creative Commons sits within Science Commons.

Closing Thoughts
There has been a suggestion of a rift between Creative Commons and the Open Knowledge Foundation. Both organisations agree that public domain databases are a good idea and that science in particular in the public domain is super. There are some licence details we don't agree on but broadly we're aiming at the same issue.

A little lesson to close on: if you look at the lunar landing video that is currently preserved you find that the quality is terrible despite the fact that high quality images were recorded at the time. NASA had head budget issues regarding the cost of storage data and (accidentally) recorded over the original in the 1980's. My point is that you should not let IP get in the way of preservation - sharing can be much safer.


Q & A

Q (Peter Burnhill): Was "Copyleft" a joke when it was thought up, is the concept useful at this point?

A (Jordan): The GPL has some fairly formal structure. We have evolved to version 3. It has a formal structure that includes the legal requirement to share. There is a rich history of jokey attempts to subvert the notion of copyright but Copyleft is fairly well established and understood in the IP world now.


Q: What is the difference in those licenses between derived database and produced work?

A: Produced work might be the contents of the field and not the field names or structure. There is a little cross over as the definition of the database in the licence is fairly broad. So there is some blurriness about whether it's a produced work or a derived work. This has come up in the openstreetmap context. This can be quite a technical issue to look at. The licence acts as a constitution - a guiding document on how you should apply things. If a community isn't served by the licence they can also come up with their own guidelines that sites with it.

Q (Robin Rice): Could you talk a little more about the norms and particularly how that could work in a repository environment. When I think of my typical users - researchers - they may be happy with the public domain licence if they don't think about it too much. But if they felt someone could make commercial profit out of that data they would upsert and maybe Copyleft would be better.

A (Jordan): So GPL is not automatically not commercial. A non commercial clause can cause confusion about what counts as commercial. Who thinks that anything done by a charity is non commercial? It's not clear cut. Some view non profit or charity usage as non commercial BUT the law generally sees the type of use not the type of user as key. People have different views. People generally licence under CC licences and it is up to them (not CC) to enforce the licence they choose. Educational use is an interesting issue too. Is a university non commercial? Are CPD courses non commercial?

Comment (Peter Burnhill): In some ways the Tax Man decides what is and is not commercial. So the unievrsity is a charity, but their catering arm is a commercial company...

Response (Jordan): Really it is the person who shares their work that decides what is commercial and non commercial. This is a contract and a legal one so you need to know what all parties think the definition of the terms is when they make their agreement. But we (the Open Knowledge Foundation) are not going to do a non commercial licence because that's not our wish.

Follow Up (Robin Rice): So what does sharealike mean for content in these contexts?

A (Jordan): So I share a photo with a sharealike attritubtion open knowledge type licence. A blend of that photo with another would be a derivative work. Think of it in layers. Layer one is my copyright, layer two is your copyright. Sharealike makes you use the same sort of licence for what you do with my image. It creates a sense of comments. You have to give back to the community if you use the data. Now with the non commercial CC licence no-one can use an item derived from data for profit.

Q (Robin Rice): How do you communicate norms to your users?

A (Jordan): There is a document which provides an example of how you might want to lay out your expectations how people behave with your database. You can be at the high end (like making a website work in such a way that users have to read terms and conditions) or you can be more casual about what you want. There are ways to make sure you are cited properly for instance - you can have rules or community pressure for that. If people publish on the web it is (semi) possible to detect a use that isn't fairly citing or attributing the originator. BT had a project with open source code in it and they didn't reveal it and this was revealed through naming and shaming to the community online. Social pressure is very useful even without any legal restrictions.

Q (Peter Burnhill): If you do want to put restrictions on use you are into Copyright. If you go towards Copyleft and Open - it's not that commercial use is improper but that it is not attributed. Attribution is a really big issue if you go properly open. Modifying and changing work is scary. Hopefully there is an indication of the original version - data acknowledgement forms really.

A (Jordan): I have had Flickr photos with CC licences used in commercial contexts. I'm not that bothered as it's not my primary profession. Now academics do have a gut reaction to people making money for their work but there is an advantage to stuff being open. It's really useful to stop data being stored in just one place. Why wait around for the government to release a reformatted website of data? Just put it out there so that it can go out there. There are distinct benefits for use and reuse. There are economic benefits about making material available in this way: 1 + 1 = 5 in this context sometimes.

Comment (Jo): One of the issues for academics is impact, profile and reuse. Turning scholarly citation into data sharing and collaboration. Real benefits there.

Comment (Peter Burnhill): Statements that travel (or should travel) with data. One is a data disclaimer - public data in the US often includes a document about how to cite this data and also information that responsibilities of data producer stops at the original data set. All you do is support what you put out there.

Comment (Jordan): The UK government does this too. it states that data is in the public domain but you can't misattribute this data to be official government data if it has been changed or analysed. Basically they disconnect their authority when you work with the data. Effectively it's a licence. Different licences can be a nightmare when you get past one dataset. Licenses may conflict and you might have a 100 datasets to deal with. With data in the public domain you know what you can do and how you can treat the data. the problem in the academic domain is that people can end up licenceing data they don't have the rights for.