CC0 for Data

Sunday, September 7, 2014

A few over-arching, and simple, tenets are well-known to and recited by those who work with intellectual property (IP): facts can't be copyrighted; data can't be copyrighted because data are facts, however, databases may be copyrighted as they express its maker's creativity in the arrangement of the data. These tenets, along with the idea-expression dichotomy—an idea can't be copyrighted, only its creative expression can be—make up the general common knowledge of IP in the current consciousness.

But, let's examine what exactly happens in real-life. A scientist, typically with no knowledge of law, consciously or otherwise, decides to share her work with others by putting it on her web site. Like most, she gives not a second thought to her copyright in her work, sufficiently comfortable in her tacit knowledge that, as the creator, she is responsible for the integrity of that work, wants and expects her work to be used by others and, under normal usage, would get credit for it, typically via a citation. Under most copyright regimes, however, by not explicitly choosing a license for her work, she has, by default, been co-opted into unwittingly applying an "all rights reserved" (ARR) license.

Alternatively, consider that our scientist protagonist has been following the current conversation on open source, open data, and open science, and even supports those movements, at least in spirit, if not yet in action. She has, perhaps, even heard of a few open licenses—GPL, MIT and BSD licenses for software, the ubiquitous Creative Commons (CC) licenses for other kinds of content, and perhaps even a few uncommon ones. As she prepares to upload her work to her web site, she decides to apply one of the CC licenses to her non-software content. The content she is sharing consists of a draft paper she is writing, a couple of slide presentations, perhaps a video interview or two, and a few comma-separated values (CSV) files of data.

Our protagonist really wants her work to be used by others. More users of her work is good for her own career, and it also may lead to productive feedback from and engagement with others. And, while she desires credit for her work, she never really gives that a second-thought—giving and receiving credit, typically via citation, has been the norm in science for about as long as science as existed. In fact, using someone's work and not giving credit is considered so odious, it can lead to censure by one's peers and discipline, the ultimate fall from grace in academia. So, she goes about doing her bit to share her work without worrying about getting anything in return, because getting credit in return is the normal way of doing things.

On the other hand, consider that our scientist is somewhat knowledgable about open science and open licenses. She has been sensitized to the idea of choosing license explicitly, and now sharing has become an explicit act for her. As such, she is thinking about the fact that anyone anywhere is free to use her work. Forget the fact that really nothing has changed, that everyone, everywhere was always free to use her work within the bounds of fair use and the norms of use in academia. This new born sensitivity gives her more channels for customizing reuse, and in return, puts a legal burden on the re-user to give her attribution above and beyond that may be provided by normal citation.

This surficial knowledge combined with a fairly customizable licensing solution has the potential for drawbacks that could be classified as category errors. While few would argue against the copyrightability of most original works, it is very difficult if not impossible to distinguish between creative and non-creative elements of data. Lets go back to the simple tenets stated at the beginning. Clearly, one could apply a copyright to a database, but what exactly is a database? Is the Microsoft Access file of bird sightings a database? Most would say "yes." If so, one could apply a copyright license to that file. One could download that file, export the data within, and use that data confident in the knowledge that the copyright didn't extend to that data. But, what if the Microsoft Access file contained within it small pictures of birds? Surely, those pictures would be copyrightable. Perhaps the morphology, location and timing would be considered facts and not be copyrightable, but the more colorful description of the appearance and habitat of the birds would be considered creative enough to be copyrightable. There are literally infinite variations of such scenarios further compounded by the fact that the concept of databases itself has changed sufficiently to go well-beyond the mental grasp afforded by the simplistic statement that "data can't be copyrighted but the databases that contain such data may be copyrightable."

It is close to impossible for the licensor to not only be correctly aware of what portions of her data are copyrightable and what are not but also to apply the suitable licenses to such individual components. As such, she may either apply a license to the entire offering, thereby potentially burdening the user with legal obligations on portions that did not merit such burden. Or, she may not mistakenly apply a more liberal license to portions rights to which she actually wanted to reserve for herself. This over- or under-reach may be detrimental to her intent to fully share her work with others.

Similar category errors apply from the perspective of the user. He may see a work licensed more restrictively than he desires, even though that license was applied mistakenly, and decides on pass on using the work. Or, he encounters a work that was mistakenly released on a more liberal license than the licensor intended. While the user is perfectly within his rights to continue using the work as originally licensed, the creator of the work is also perfectly within her rights to change the license to more restrictive.

In most cases any of the above scenarios detailed above may not mean much. Most academic creators would be content licensing their works with open licenses, mistakenly or not, and most academic users would be content using those works, providing credit via citation as per the norms of science. Little would the users realize that they may perhaps be under-attributing per the legal requirements of the license. However, worthy of more serious thought are the edge cases where high monetary or other kind of value is perceived in the work that merits actually defending or thumbing ones nose at the chosen license.

Consider what choices the licensor has if she discovers her work is being used in breach of the permissions allowed by her chosen license: She could send a note to the violator. The violator might comply and everything would be copacetic. Or, the "violator" might disregard the licensor's request. The licensor then might choose to write to a higher authority, perhaps arbiters from her learned society, or post the violation on a public forum, hoping from public censure. Those acts may have the desired effect, or they may backfire on her. Either way, either the licensor or the licensee, or worse, both, are going to be subjected to ridicule. Alternatively, the licensor might decide to sue the licensee. How many academics in this world really have the wherewithal to file a court case? It would be possible if a really grave injustice were perceived, but the potential reparations would have to make up for the time, anguish and monetary cost of filing a law suit.

Contrary to what one might think, the best license is not no license. In fact, as mentioned earlier, applying no license automatically classifies the work has having all rights reserved, only fair use being permitted. In effect, no copyright license defaults to the most restrictive license. The best license is the most liberal license, and the most liberal license is the waiver of licenses, the CC0 public domain dedication. In effect, CC0 opts out of the copyright system, waiving all legal and moral obligations. In jurisdictions where it may not be possible to waive all obligations, particularly moral obligations, CC0 reverts to a general public license allowing any and all use of one's work with no further obligation on one's part. Effectively, CC0 puts the work in the public domain.

Does anything change by putting one's work in the public domain? Not really. Those using the work academically will still give credit to the original author as that is the right thing to do. And, those who never intended to follow the norms will continue to not follow the norms, but now will be not following the norms legally. Everyone benefits by keeping the law out of academia.

Licensable parts of research

Output from a research experiment can be divided into three big buckets:

  1. software: both binary and source code
  2. literature: both formal, peer-reviewed and informal such as slide presentations, videos, audio files and images
  3. data: both raw, non-copyrightable data and interpreted data with some element of creativity that makes it copyrightable

The three buckets detailed above are together necessary to provide a complete picture of the research experiment. Let's consider each bucket and its components further.

Compiled binaries of software are necessary as they ease the burden on the user, but it is not always possible to provide compiled binaries for every major operating system. While source code is necessary to both examine as well as extend the source code, it may also be necessary so the user may compile it for her own particular combination of operating system and version. However, being able to compile source code implies ready availability of the the compiler for one's platform.

Formal, peer-reviewed, published literature is considered the gold-standard of science. It signifies a documented and generally accepted milestone in research. However, the process of arriving at the milestone is supported by informal literature that may shed additional understanding on the process, providing subtleties not possible within the limits of a ten page academic paper. Moving images, sounds and anecdotes can further increase the understanding of a research experiment that might have lasted from a few months to several years.

As discussed earlier, most data are really a mix of both raw and creative elements. In a few cases it may be possible to differentiate between the two easily, but in most cases it may not be feasible.