Licensing Scientific Data Collections

a primer on licensing options
Invalid Date

Note: I am not a lawyer. This is meant for scientists, such as myself, not lawyers. It may be faulty (though I don’t think it is, and it certainly is not legal advice.

Scientific collections

Let us consider a scientific collection of records contributed by an international group of collaborators, held in a complicated database designed to enable varied and speedy querying, analysis, and reporting, and mirrored on servers on different continents. The collaborators are all supported by and subject to the regulations of different institutions and laws of different countries. The group generally believes in allowing access to the data subject to certain conditions, but has no formal notification of conditions of use, or a prominently visible mark or license denoting what can or cannot be done with the data.

  • What are the reasons the group may want to formally establish official conditions of use for its data?
  • How would they determine and describe these conditions?
  • How would the group mark the web site so the conditions of use are apparent to any potential user?
  • How would they deal with the differences in:
  • the laws of the jurisdictions where the data are housed;
  • the institutions and countries that fund the contributors; and
  • the agencies that fund the infrastructure to house and support the database?

Let’s investigate the answers to the above questions by understanding what is a license.

Definition of a license

A license is a legal instrument describing what the licensee may do with the licensor’s property. A license may be implied, stated explicitly, or agreed upon via a contract between two parties.

Licenses are governed by a branch of federal law called “property law.” Licenses apply to all kinds of properties, be they real (such as land), personal (car, jewelry), and intellectual (music, novella, magazine article).

A copyright license applies to tangible intellectual property of sufficient originality. may range from no rights reserved, that is, effectively in the public domain, to all rights reserved, that is, the licensee may not do anything with this creation other than what may be permitted by fair-use.

The standard, default copyright in the United States is a result of laws enacted by the government, specifically, the copyright law enshrined in Title 17 of the United States Code (17 USC) that protects original works automatically at they are fixed in a tangible medium. Barring certain conditions and uses, this law effectively reserves all rights in such a work for the benefit of its creator.

Most, though not all, countries have copyright laws.

Why is it important to license data?

While in most instances a user might not worry about licensing data, it becomes particularly important when data may be used to create something of monetary value. The user will want to be sure no one is going to come out and sue her for using data in violation of its license. A license is an unambiguous means for the data creator/publisher to convey to its user what may be legally done with that data, and optionally, establish reciprocal obligations, and even disclaim liability.

A data set without a license is unfortunately not devoid of a license. In the United States, not specifying any license is effectively the same as specifying the “all rights reserved” clause. This is because in the US anything tangible we create is automatically protected by the US Copyright Law provided it is of sufficient creativity. In other words, if the data creator/publisher does not provide an explicit license with the data, a user cannot do much with the data, other than fair use, without potentially violating the copyright.

How to decide what others may do with your data?

Sometimes the data creator may simply have no say in the matter, for example, work made for hire1 clause or funding conditions may require making one’s data available to everyone. This is true of most federally funded projects. Of course, contractual obligations may likewise prevent release or data because of strategic considerations.

Some data are in public domain to begin with, so they can’t be copyrighted because copyright can only be applied to works of original authorship.

Other data may be facts, and as such, they also can’t be licensed since facts have to remain free for everyone. For example, access to the temperature of the air, or the composition of a rock can’t be restricted. An easy way to think about this: if you discovered it, you can’t restrict it, but if you created it, you may be able to protect it with a license.

A work produced by you may be licensed, but keep in mind, if someone uses the work in violation of the conditions of your license, the court may decide the validity of your license based on whether or not the work is of sufficient creativity.

Reasons to restrict use of data

There are several reasons why a restrictive license or delayed or restricted release of data may be appropriate:

  • Researchers may want to delay releasing data until after they have published their results in journals;
  • Private parties may want to restrict access to their intellectual property for strategic reasons;
  • National security and individual privacy and security are always valid reasons to restrict access to data

International collections

There is no such thing as an international copyright. A license issued in one country may be respected and be defensible in another country only if the involved countries are party to either a bilateral or global agreement.2 Sorting out the overlapping agreements and bilateral arrangements can be confusing for scientists with little knowledge of law as is evident by just the sheer number of [Copyright Treaties With The US]() that other countries have.

As noted above, even though facts are not protected in the United States, the same may not be true in other jurisdictions. For example, the Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases provides protection for even unoriginal, raw data under certain circumstances.

This problem can be fairly common in science since scientists work across national boundaries, frequently collaborating in problems that are international in scope. Sadly, since there is no universal international copyright, protection against unauthorized use in a particular country depends on the laws of that country. While copyright laws exist in most countries, there are countries where either such protection doesn’t exist, is unclear, or is not reciprocal in scope.

A license is not a contract

While a license is a unilateral statement of what others may do with the work, a contract imposes an obligation from the contracting parties. A license is commonly confused with a contract, but they are distinct legal instruments governed by different laws, and defensible in different conditions.

A contract is like a license, but requires at least two parties agreeing to it. Without at least two parties, a contract cannot exist. A contract specifically describes the obligations of both parties to the contract. For example, “if I give you this data file with experiment readings, you will give me a chart showing a scattergram” is a contract, provided both you and I agree to it. Contracts fall under the purview of state law.

Conflating a license with a contract can create problems by making the language of the license and the obligations of both the licensor and the licensee more complicated.

A babel of licenses

The United States Copyright License is not the only license available to us. If we don’t want to use that standard “all rights reserved” license, we can use any one of the many existing alternative licenses, or even make one of our own.

Over time, the world of alternative licenses has seen just that happen. Many alternative licenses have been created, each differing from the other in some respect. The ability to create one’s own license is a blessed freedom, but it also results in a surfeit of hard to understand, non-uniform licenses, a veritable [Babel of Licenses](). There is no consistency, and the user has to spend extra effort to decipher the conditions of use.

Needless to say, just attaching a license doesn’t physically stop anyone from doing what they want to with your creation. If they do something in violation of your license, it is up to you to pursue them, implore them to cease and desist, or otherwise threaten them with potential legal action. And, if you do take legal action, it is up to the judge to determine whether or not your work was indeed protectable in the first place (was tangible and had sufficient originality), and that your rights were indeed violated by the alleged user.

Problems with licensing

Based on the information above, we can summarize the problems with licensing as follows:

  • The licensor has the difficult job of choosing from among hundreds of different licenses, or worse, create a brand new license adding to the babel;
  • The licensee has the difficult job of trying to understand the legalese that is further complicated by its inherent consistency;
  • While it is easy attaching a license to a self-contained, reasonable sized dataset such as a PDF or a spreadsheet, it becomes quite difficult to attach a license to a dataset made up of atomic components. If the data and their license get separated, the confusion can increase further;
  • Licenses are not valid internationally, a particularly pertinent issue with scientific projects that span international boundaries; and
  • Mixing datasets results in a new license as restrictive as the most restrictive license of the component datasets.

Creative Commons licenses

Creative Commons, an international non-profit dedicated to making open access to data a reality, created licenses for use primarily for creative content. Creative Commons licenses are made up of four building blocks: BY (attribution required) which permits use as long as the data provider is given attribution; NC (non-commercial use only) which allows use as long as the data are not used in a commercial venture; ND (no derivatives allowed) which permits use of data but prohibits any derivatives; and SA (share alike) which obligates the user to release any products incorporating the data under the same license as the data itself.

These four building blocks can be combined into six distinct licenses, ranging from CC BY being the most liberal and CC BY-NC-ND and CC BY-NC-SA being the most restrictive. Not all of these blocks are combinable because some combinations would result in a no-op; for example, ND and SA licenses are mutually exclusive by definition because the former prohibits any derivative works while the latter allows derivative works as long as the derivate is identically licensed.

These licenses solve all of the above problems by:

  • Creating a small set of customizable licenses that cover a rather broad range of use-cases (see table below);
  • Are easily recognizable, and have a brand equity that makes them easily recognizable and understandable all over the world. In fact, CC licenses are not only available in 79 countries in 33 different languages, they have been vetted and adapted to national laws by lawyers from those jurisdictions;
  • Creating a web-based chooser that allows licensors to easily choose one of the available licenses;
  • Providing three copies of the chosen license: an easily recognizable pictograph accompanied by a plain text description; a detailed legal code backing the license; and a version of the license embeddable in data in RDFa3 format, so that the license can not just accompany the increasingly digital and arbitrarily divisible and recombinable data but is also machine parseable.
Creative Commons Licenses

CC licenses and scientific data

As the name implies, Creative Commons licenses were designed primarily for creative content. By definition, creative content may be protected by copyright laws, so CC licenses were designed to customize the rights that the content creator wanted to keep versus the rights that the creator wanted to relinquish. Scientific data, if established as factual, that is, devoid of originality, may not be protected, and have to remain free for everyone. As such, CC licenses may be inappropriate, even unenforceable, when applied to facts.

Additionally, Creative Commons licenses are granted as public licenses, but may be interpreted as contracts depending on how the conditions of license are interpreted by the licensee.8

The problem with data

The licenses were designed for and are most applicable to creative data, that is, data that have some element of human creativity and interpretation, and thus, have creator’s rights that can be protected. Scientific data, that is, raw data that describe the world, and are typically measured or discovered, are not protectable as they are considered facts, and in the United States facts have to remain free for everyone’s benefit. However, very few datasets are clearly facts. In a typical scientific [Data Chain]() such as the [Seismology Data Chain](), the [Tree Allometry Data Chain](), or the [Stable Isotope Biogeochemistry Data Chain](), raw data as they come out of sensors require some measure of processing before they become suitable for human use. Thus, they could be seen as containing protectable elements. This mixed composition of most all scientific datasets can lead to potential confusion of what may be protected, and what may not be protected.

Additionally, data can range from completely raw readings such as those coming out of sensors and getting recorded automatically, to completely interpreted data resulting from human interpretation and analysis. The former may not be copyrightable in most jurisdictions while the latter are likely to be protected in most jurisdictions.


Differences in legal conditions attached to datasets can give rise to uncertainty4 with regards to their use. Scientists seldom work with one data set, instead, typically mixing one or more data sets with their own data. When data with different licenses are mixed, new data are created, and the resulting data have a new license that results from the unique mix of the two component licenses.5 The use conditions of the resulting data become as restrictive as the most restrictive license of the component data.6

Under certain conditions, while it may be possible to legally acquire certain data, even mixing them together might be a violation of licenses. The following matrix depicts the various choices that a user can make and their effect on interoperability of data.

Effect of License on Interoperability of Data
No license U.S. Copyright DIY license A specific license A small set of licenses PD or equivalent
No interoperability X X X
Limited but predictable interoperability X X
Limited and unpredictable interoperability X
Guaranteed maximum interoperability X X X

To overcome this confusion, and to maximize interoperability, Creative Commons invented a waiver of license called CC0 (pronounced “CC Zero”). A CC0 waiver allows the data creator to waive any rights that the creator may hold in the data, effectively putting the data in the public domain. Clearly marking a data with “do whatever your heart desires” eliminates uncertainty and maximizes interoperability.

The problem of databases

Smaller, logically single datasets such as PDFs, videos, spreadsheets and other document based datasets are easy to operate on. They are typically downloaded by the user, so it is also easy to attach a license to them. A database creates at least two problems:

  1. If it is very large, it can’t be copied easily, and is typically used via queries performed over networks. It may return the same result for the same query, or, if the query is time sensitive, even return different results for exactly the same query performed at different times; and
  2. If the database has data in it contributed by many contributors, it becomes difficult, if not impractical, to homogenize the different viewpoints, philosophies, world views, and resulting desired licenses into a single license. This is termed “attribution stacking.”7

One solution is to tag every atomic contribution with a license specific to the contributor. While that may sound like a practical compromise, not only does it increase the technological complexity of storing and transmitting the different licenses, more seriously, it shifts the complexity and thus, uncertainty, to the user. The user has to now be responsible for determining all the licenses that may apply to a single result set composed of many different records, and for complying with these potentially varying conditions.

As computer techniques for data mining grow more sophisticated, it seems shortsighted to restrict extraction of potentially new knowledge from a dataset that was expensive to create in the first place. The more a dataset can be used and reused, the better the return on that investment, particularly in times where budget cuts and belt-tightening make science and data collection more expensive.

On surface, restricting access to big evil corporations seems “fair.” Some believe that after all, the corporations have done the same to scientists for years, and it doesn’t make sense to let them suck all our data and then start charging others for it. However, it is not possible to distinguish the intent of a data user. By restricting access to data based on intent, we end up clubbing everyone from a casual enthusiast to a solo entrepreneur who has created a 99 cent app that utilizes our base data for a new, creative purpose to the potentially big evil corporation that might want to suck all our data and lock it up from others.

Common questions and concerns

  1. What if someone takes my data, uses it to make decisions, and is harmed as a result of those decisions? Put a liability disclaimer on the data.
  2. What if someone takes my data and uses it for “nefarious” purposes? Beauty is in the eye of the beholder. Who are we to judge anyone on how they use any data? What we might view as nefarious may be a brilliant entrepreneurial opportunity in someone else’s eyes.
  3. What if someone takes my data and resells it? Why would anyone buy it if they could get the free version from you? That would be akin to paying a tax for not knowing how to conduct an effective web search.
  4. What if someone takes my data, improves it, and then sells the improved version, or otherwise restricts access to it? If they improved it substantially then it is really a different dataset, and they are well within their rights to make money off of it. Creating new business value through entrepreneurial activities is what constitutes progress just as much as science does.
  5. What if someone takes my data and doesn’t give me attribution? How come you don’t worry about that with your journal papers? Probably because science has, through centuries, depended upon the norms of science to give credit to the works of others without contractually obligating anyone to do so. It has worked for scientific literature, so there is no reason it shouldn’t work for data.
  6. It is easy to give attribution to a logically single dataset such as a PDF, but how would one give attribution to a result of a query, especially if the result changes over time? It is a hard problem, and many bright minds are working to solve this problem. There are already a few useful data citation techniques that may be used to good effect.
  7. How would one attribute a data set that is a result of contributions from different individuals if each individual insists on separate attribution? This is a distinct problem termed “attribution stacking.” See above for more information.

Examples of collections released under CC0

A longer list of data collections released under a CC0 waiver demonstrates the sheer variety of data providers who have committed to making their collections available free of all restrictions to anyone. A few examples are listed below:

National Evolutionary Synthesis Center (NESCent)

NESCent has made

all evolutionary biology and any other data within the scope of (its data and software) policy readily available to the broader scientific community

under a CC0 waiver of license, and via

deposition in a public data repository (e.g. Dryad, Knowledge Network for Biocomplexity) or an established open database.

Cooper-Hewitt National Design Museum

The Cooper Hewitt National Design Museum has made its tombstones collection data available under a CC0 waiver of license. Their logic for releasing data is captured in the following quote:

Cooper-Hewitt is committed to making its collection data available for public access. To date we have made public approximately 60% of the documented collection available online. Whilst we have a web interface for searching the collection, we are now also making the dataset available for free public download. By being able to see ’everything’ at once, new connections and understandings may be able to be made.

Powerhouse Museum

The Powerhouse Museum has released it data under a CC BY-SA license. To quote them:

The Powerhouse Museum is committed to making its collection dataset available to the community in many forms. Traditionally we’ve done this by publishing collection records to our website, but now we also offer direct data access. This allows you to make your own interfaces to the collection and incorporate it into other services.

Universal Protein Resource (UniProt)

UniProt has chosen the CC-BY-ND license. This gives the user the ability to use the data for commercial work as long as the data are not modified and UniProt is given credit.

Protein Databank

The RCSB Protein Databank proclaims:

Data files contained in the PDB archive ( are free of all copyright restrictions and made fully and freely available for both non-commercial and commercial use. Users of the data should attribute the original authors of that structural data.

  1. “The employer or other person for whom the work was prepared is the author” and “The employer or other person for whom the work was prepared is the initial owner of the copyright unless there has been a written agreement to the contrary signed by both parties.” See Circular 9
  2. “There is no such thing as an “international copyright” that will automatically protect an author’s writings throughout the world. Protection against un­ authorized use in a particular country depends on the national laws of that country.” See Circular 38a
  3. ’Resource Description Framework in attributes’is a W3C Recommendation that adds a set of attribute-level extensions to XHTML for embedding rich metadata within Web documents. See RDFa Wikipedia entry for more info
  4. EU’s Directive 96/9/EC clearly proclaims that “differences in legislation in the scope and conditions of protection (of databases) between the Member States… can have the effect of preventing the free movement of goods or services within the Community;”
  5. Hanson, Chris, Lalana Kagal, Tim Berners-Lee, Gerald Jay Sussman, and Daniel J. Weitzner. 2007. Data-Purpose Algebra: Modeling Data Usage Policies. IEEE Policy.
  6. Puneet Kishor, Oshani Seneviratne, and Noah Giansiracusa. 2009. Policy Aware Geospatial Data. working paper. Nov. 2009
  7. “In a world of database integration and federation… would a scientist need to attribute 40,000 data depositors in the event of a query across 40,000 data sets?” See Protocol for Implementing Open Access Data
  8. Hietanen, Herkko A., A License or a Contract, Analyzing the Nature of Creative Commons Licenses. NIR, Nordic Intellectual Property Law Review, Forthcoming.