Friday, 31 July 2009

Repository Fringe 2009: Afternoon Tutorial (Open Data)

Fresh from lunch, a brief chat with Clifford Lynch (he's been reading the blog en route to the event) and some overheard checking of cricket scores we're moving onto the afternoon Tutorial.

Jordan Hatcher (ipVP) & Jo Walsh (EDINA): Implementing Open Data

A chat on the work of the Open Knowledge Foundation, including CKAN and Knowledge Forge (led by Jo) and then an in-depth session on the legal side of open data, including the new legal tools available through OpenDataCommons.org, including a database specific copyleft license.

Jo opens with an overview of the Open Knowledge Foundation. Rufus Pollock started the Open Knowledge Foundation and it's entirely volunteer run. The idea is look at what has been successful in free and open source software communities and take the most successful parts of that and apply it to open data and open knowledge. It's not a campaigning organization but very much a building foundation. It's all about how things can be reused. Various projects exist, some are about publishing data, some are standards type efforts, others are exemplars and pilots. There are four principles that characterise and open approach to knowledge management
  1. Incremental development - not one big master release.
  2. Development is decentralized.
  3. Development is highly collaborative.
  4. Development is componentized - so data is in modules and packages that can easily be reused.
For example CKAN (Comprehensive Knowledge Archive Network) is very much about a component type model for maintaining knowledge. The volumes of data involved can be a problem though so we're now looking at something called the Open Grid to find ways to deal with major usage of shared open knowledge.

You hear a lot about "Open" but what does that actually mean? People call their data open or free but how do these principles apply to open data. Rather than starting from a licence one of the first moves was to build an open knowledge definition. So these specifically are less restrictive than some Creative Commons licenses.

Jo hands over to Jordan is going to talk about how these principles and licenses relate to other projects. Jordan is an IP lawyer as well as being a director of the Open Knowledge Foundation. A project that has been worked on with Edinburgh University is Open Data Commons. Jordan will talk about about various aspects of this work.

The origins of Open Data Commons came from Talis. They are part of the semantic web community and open innovation. They knew that open data was important for access to knowledge community (e.g. Open Government). In the sciences data sharing is particularly important as the cost and effort of data is huge (for example space exploration is far too costly to replicate). Talis knew that IP relates to databases so the importance of data + the legal restrictions on data = a need for a legal solution.

So, copyright is really 3 things. It covers a lot of stuff (films, buildings, programs, etc.) although it doesn't cover facts BUT it is an ownership restriction and licenses how you unpackage what you own and what restrictions you put on sharing work with others. So a database is a blend of schema, tables, data entry/output sheets and tables. And that sits in a database software. Some material will be copyright of the software, some will be content agnostic. Copyright covers data, the database and the database software - each in different ways. Challengingly the database rights apply to both the database AND the database content.

So by Open Data Jordan is talking about applying open data to databases and data in databases.
Open means use, reuse, etc. Sharealike is ok (as per GPL or CC licenses) but when other open licenses were created there wasn't a lot of licencing work around open data. You can kind of think about it like a market adoption curve. Linux and the concept of open source software is mature. CC (Creative Commons) is in the early majority stage. Open data is at the early adopters stage.

The reason data gets it's own category is that it is different from software and content both legally and in what you do with it. So, a word about Creative Commons licences:
  • Not all relavent rights are licenced (the database right in particular) in most CC licences.
  • CC licences are not consistent across database rights.
  • CC is not written for databases. For instance attribution can be different for databases.
  • ShareAlike is unclear when combining one work with another (collective work).
  • CC doesn't address software as a service.
Creative Commons don't really want you to use their licence for databases and have stated as much on the Science Commons website.

Open Data Content: Legal Tools
The Open Database licence (ODbl) intends to solve shortcomings of Creative Commons. Quite recently completed. The licence has gone through 4 comment rounds including various legal experts across the world.

You use the database to produce a subset of data - a produced work. The ODbl works on both the database and the produced work. So out of the gallaxy of IP rights it picks recognised licences in the agreement but in an open way. ODbl is a worldwide license (following the example of the GPL). It is a human readable licence - not a huge block of legalese.

Attribution wise ODbl requires copyright notices to stay with the database. Produced work requires a brief notice of the source and that it is licenced under the ODbl - this is effectively consumer protection. You will not discover an interesting produced work only to find that your access to view or work with the original data is blocked (or vice versa).

Share-Alike allows derivative databases but they are required to be under ODbl but you can licence a derivative differently to the original database.

Database Content Licence (DbCL) aims to licence the content of databases. It's at the content layer and covers individual copyright that might be present.

OpenStreetMap is considering moving from a CC licence to a OpenData type licence.

The Public Domain Dedication & Licence (PDDL) allows you to wholly hand over your copyright. Where this is not possible (some geographical territories) you instead license users to do what they like with your data. PDDL + Community Norms - these are complementary components to let you indicate expected behaviours for the database you have opened to the community. You can't sue someone for plaguerism. Your protection is the academic norms from social pressure and maybe an honour code. Jordan uses as an analogy The Bluebook. This is the standard system of citation - you can't get published in the US if your citation doesn't match up. Not a legal issue but a community norm issue.

Science Commons is based in Boston. They are the science end of Creative Commons. They were addressing science issues in about 2005 (but not very publically). They came up for a protocol for implementing open access data with maximum flexubility without licencing hang ups. So this standard was developed online. The protocol is not a licence but sets a standard for licences for science data.

Comparing where we're at (this is a Venn diagram but I haven't grabbed a picture I'm afraid): OpenData Commons has ODbL and PDDL. PDDL overlaps with Science Commons. Creative Commons sits within Science Commons.

Closing Thoughts
There has been a suggestion of a rift between Creative Commons and the Open Knowledge Foundation. Both organisations agree that public domain databases are a good idea and that science in particular in the public domain is super. There are some licence details we don't agree on but broadly we're aiming at the same issue.

A little lesson to close on: if you look at the lunar landing video that is currently preserved you find that the quality is terrible despite the fact that high quality images were recorded at the time. NASA had head budget issues regarding the cost of storage data and (accidentally) recorded over the original in the 1980's. My point is that you should not let IP get in the way of preservation - sharing can be much safer.


Q & A

Q (Peter Burnhill): Was "Copyleft" a joke when it was thought up, is the concept useful at this point?

A (Jordan): The GPL has some fairly formal structure. We have evolved to version 3. It has a formal structure that includes the legal requirement to share. There is a rich history of jokey attempts to subvert the notion of copyright but Copyleft is fairly well established and understood in the IP world now.


Q: What is the difference in those licenses between derived database and produced work?

A: Produced work might be the contents of the field and not the field names or structure. There is a little cross over as the definition of the database in the licence is fairly broad. So there is some blurriness about whether it's a produced work or a derived work. This has come up in the openstreetmap context. This can be quite a technical issue to look at. The licence acts as a constitution - a guiding document on how you should apply things. If a community isn't served by the licence they can also come up with their own guidelines that sites with it.

Q (Robin Rice): Could you talk a little more about the norms and particularly how that could work in a repository environment. When I think of my typical users - researchers - they may be happy with the public domain licence if they don't think about it too much. But if they felt someone could make commercial profit out of that data they would upsert and maybe Copyleft would be better.

A (Jordan): So GPL is not automatically not commercial. A non commercial clause can cause confusion about what counts as commercial. Who thinks that anything done by a charity is non commercial? It's not clear cut. Some view non profit or charity usage as non commercial BUT the law generally sees the type of use not the type of user as key. People have different views. People generally licence under CC licences and it is up to them (not CC) to enforce the licence they choose. Educational use is an interesting issue too. Is a university non commercial? Are CPD courses non commercial?

Comment (Peter Burnhill): In some ways the Tax Man decides what is and is not commercial. So the unievrsity is a charity, but their catering arm is a commercial company...

Response (Jordan): Really it is the person who shares their work that decides what is commercial and non commercial. This is a contract and a legal one so you need to know what all parties think the definition of the terms is when they make their agreement. But we (the Open Knowledge Foundation) are not going to do a non commercial licence because that's not our wish.

Follow Up (Robin Rice): So what does sharealike mean for content in these contexts?

A (Jordan): So I share a photo with a sharealike attritubtion open knowledge type licence. A blend of that photo with another would be a derivative work. Think of it in layers. Layer one is my copyright, layer two is your copyright. Sharealike makes you use the same sort of licence for what you do with my image. It creates a sense of comments. You have to give back to the community if you use the data. Now with the non commercial CC licence no-one can use an item derived from data for profit.

Q (Robin Rice): How do you communicate norms to your users?

A (Jordan): There is a document which provides an example of how you might want to lay out your expectations how people behave with your database. You can be at the high end (like making a website work in such a way that users have to read terms and conditions) or you can be more casual about what you want. There are ways to make sure you are cited properly for instance - you can have rules or community pressure for that. If people publish on the web it is (semi) possible to detect a use that isn't fairly citing or attributing the originator. BT had a project with open source code in it and they didn't reveal it and this was revealed through naming and shaming to the community online. Social pressure is very useful even without any legal restrictions.

Q (Peter Burnhill): If you do want to put restrictions on use you are into Copyright. If you go towards Copyleft and Open - it's not that commercial use is improper but that it is not attributed. Attribution is a really big issue if you go properly open. Modifying and changing work is scary. Hopefully there is an indication of the original version - data acknowledgement forms really.

A (Jordan): I have had Flickr photos with CC licences used in commercial contexts. I'm not that bothered as it's not my primary profession. Now academics do have a gut reaction to people making money for their work but there is an advantage to stuff being open. It's really useful to stop data being stored in just one place. Why wait around for the government to release a reformatted website of data? Just put it out there so that it can go out there. There are distinct benefits for use and reuse. There are economic benefits about making material available in this way: 1 + 1 = 5 in this context sometimes.

Comment (Jo): One of the issues for academics is impact, profile and reuse. Turning scholarly citation into data sharing and collaboration. Real benefits there.

Comment (Peter Burnhill): Statements that travel (or should travel) with data. One is a data disclaimer - public data in the US often includes a document about how to cite this data and also information that responsibilities of data producer stops at the original data set. All you do is support what you put out there.

Comment (Jordan): The UK government does this too. it states that data is in the public domain but you can't misattribute this data to be official government data if it has been changed or analysed. Basically they disconnect their authority when you work with the data. Effectively it's a licence. Different licences can be a nightmare when you get past one dataset. Licenses may conflict and you might have a 100 datasets to deal with. With data in the public domain you know what you can do and how you can treat the data. the problem in the academic domain is that people can end up licenceing data they don't have the rights for.

No comments: