What is a license?
At its simplest, a license is a statement of permission about how someone else‘s property may be used by others. A license may be implied, be stated explicitly, or agreed upon via a contract between two parties. Licenses are governed by a branch of federal law called “property law.” Licenses apply to all kinds of properties, be they real (such as land), personal (car, jewelry), and intellectual (music, novella, magazine article). Licenses that are applied to intellectual property are called copyright licenses. This paper is limited to licensing of scientific information, hence, to intellectual property.
How do these licenses come about?
The standard, default copyright is a result of laws enacted by the government, specifically, the copyright law enshrined in Title 17 of the United States Code (17 USC). This copyright law protects copyrightable works automatically at the time of such works are fixed in a tangible medium. Barring certain conditions and uses, this law effectively reserves all rights in such a work for the benefit of its creator.
Are there other licenses besides the U.S. Copyright?
The United States Copyright License is not the only license available to us. If we don‘t want to use that standard license, we can use any of the many existing alternative licenses, or even make one of our own. Over time, the world of alternative licenses has seen just that happen. Many alternative licenses have been created, each differing from the other in some respect. This freedom to create one‘s own license is a blessed freedom, but it also results in a surfeit of hard to understand, non-uniform licenses, a veritable babel of licenses.
Why is it important to license data?
While in most instances a user might not worry about the license for data, it becomes particularly important when data may be used to create something of monetary value. The data user will want to be sure no one is going to come out and sue her for using data in violation of its license. This, as. A license is the only unambiguous way for the data creator/publisher to convey to the data user what may be legally done with the data.
A data set without a license attached is unfortunately not devoid of a license. In fact, since every intellectual property acquires the most restrictive U.S. Copyright by default at its instant of creation, a data set without an explicit license is licensed by default with the U.S. Copyright. In other words, if the data creator/publisher does not provide an explicit license with the data, the creator/publisher ends up having “all rights reserved.” Other than fair use, a data user cannot do much with the data without potentially violating the copyright.
How can one license one‘s data?
The first step is to decide what you want others to be able to do with your data. Then choose a license that best expresses your wishes. And, finally, mark your data somehow so a user would have a easy to find and very clear indication of your license.
How to decide what others may do with your data?
Sometimes you may simply have no say in the matter, for example, when by employment or funding obligations you may be required to make your data available to everyone. This is true of most federally funded projects.
Some data are in public domain to begin with, so they can‘t be copyrighted because copyright can only be applied to works of original authorship.
Other data may be facts, and as such, they also can‘t be licensed since facts have to remain free for everyone. For example, data on the temperature of the air, or the composition of a rock can‘t be licensed. An easy way to think about this -- if you discovered it, you can‘t protect it, but if you created it, you may be able to protect it with a license.
A work produced by you may be licensed, but keep in mind, if someone uses the work in violation of the conditions of your license, the court may decide the validity of your license based on whether or not the work is of sufficient creativity.
What is a good reason to place restrictions on the use of data?
There are several reasons why a restrictive license or delayed or restricted release of data may be more suited:
- researchers may want to delay releasing data until after they have published their results in journals;
- private parties may want to restrict access to their intellectual property for strategic reasons;
- national security and individual privacy and security are always valid reasons to restrict access to data
How to choose a license?
Besides the default U.S. Copyright, there are many, many licenses that one can choose from. If any of the existing licenses don‘t fit your needs, you can always create a new license. But, that may contribute to confusion (see below).
What are the problems with licensing?
Some of the problems with licensing are:
- too many choices;
- confusion re data vs. information;
- new conditions from mixed datasets;
- jurisdictional boundaries
Why are too many choices a problem?
So, this is the first problem with licensing—on the one hand we have the default, automatically created and assigned, “all rights reserved” United States Copyright License, and on the other hand we have the “anything goes” custom licenses that would require extra diligence in figuring out what one can do with the content.
What confusion stems from data as opposed to information?
For the scope of the following discussion, let us assume that, for the most part, there is little difference between data and information. Information at one level, after all, is data at another level, hence, the two terms will be used interchangeably.
Data can range from completely raw readings such as those coming out of sensors and getting recorded automatically, to completely interpreted data resulting from human interpretation and analysis. The former may not be copyrightable in most jurisdictions while the latter are likely to be protected in most jurisdictions.1
In the United States, raw data, or facts, have to remain free for everyone. As a matter of fact, copyright law does not protect facts or ideas, instead, extending protection only to the creative expression of them. In reality, most data lie somewhere in between facts and expression, being a mix of raw and interpreted data. This can be a source of problem for both the creator and the user of data. The creator might worry about wrong categorization, that is, accidentally marking facts as copyrighted, or releasing copyrightable works as free. The user might worry about wrongful use, that is, using data that are copyrighted, or underusing, that is, passing up on data that were free but had been wrongly marked as protected.
What is the effect of jurisdictional boundaries?
As noted above, even though facts are not protected in the United States, the same may not be true in other jurisdictions where even unoriginal, raw data might be protected under certain circumstances. This problem can be fairly common in science since scientists work across national boundaries, frequently collaborating in problems that are international in scope. Sadly, there is no such thing as an international copyright. Protection against unauthorized use in a particular country depends on the national laws of that country.2
While copyright laws exist in most countries, there are countries where either such protection doesn‘t exist, is unclear, or is not reciprocal in scope.3
How is mixing datasets a source of problems?
Scientists seldom work with one data set, instead, typically mixing one or more data sets with their own data. This creates a problem with respect to data license, because when data with different licenses are mixed, new data are created, and the resulting data have a new license that results from the unique mix of the two component licenses. The use conditions of the resulting data become as restrictive as the most restrictive license of the component data. Under certain conditions, while it may be possible to legally acquire certain data, even mixing them together might be a violation of licenses. The following matrix depicts the various choices that a user can make and their effect on interoperability of data.4
Effect of License on Interoperability of Data No license U.S. Copyright DIY license A specific license A small set of licenses PD or equivalent No interoperability X X X Limited but predictable interoperability X X Limited and unpredictable interoperability X Guaranteed maximum interoperability X X X
At the very least we need to embed the license as metadata within the data to which it applies. Better yet, we need an automatic way to “calculate” the new license of the new dataset formed from the mixing of two or more datasets.5
So, what is a good solution for licensing data?
If at all possible, waive your rights to the data, effectively putting them in the Public Domain. This would eliminate any chance of license confusion, and ensure maximum interoperability of data thereby maximizing the chances of data being reused by others.
If you want to reserve some rights to the data, consider choosing a popular, internationally accepted, liberal license that would be least restrictive to reuse of data. The licenses offered by Creative Commons offer international acceptance, availability in a multitude of languages, and even suited for data. CC0 (pronounced CC-Zero) or CC-BY may make for suitable licenses.