punkish

 
 
e Infrastructure for Scientific Data
modified on 2010-02-16 06:45:48
This resource may also be reached at http://punkish.org/649

Long-term data interoperability

Presented at the High Level Experts Group Meeting on e-Infrastructure for Scientific Data at the European Commission, Brussels, Belgium.

Feb 17-18, 2010

Puneet Kishor

Creative Commons: Science Commons

Notes

The European Commission wanted to learn about Creative Commons’ perspective on building a long-lasting scientific data infrastructure. I gave this presentation emphasizing CC’s focus on true and deep interoperability of scientific data.

Creative Commons objectives

  • Make data sharing legal, easy and scalable
  • Encourage and empower long-term interoperability

Notes

Creative Commons works to make knowledge sharing legal, easy and scalable

We work in the culture space (music, text, film, art), education (open educational resources, virtual textbooks), and science (biological materials transfer, data sharing, Open Access, semantic web, patents). We craft policy and legal tools to lower the barriers to knowledge sharing.

We believe that barriers to knowledge sharing are lowered by increasing interoperability.

Legal data sharing

  • A small menu of appropriate licenses

    CC-BY license CC-BY-SA license CC-BY-ND license
    CC-BY-NC license CC-BY-NC-SA license CC-BY-NC-ND license

  • Some rights reserved vs. all rights reserved

    • CC licenses allow the creator to choose what to keep and what to give away

Notes

On the one hand we have the traditional copyright system’s “one size fits all,” and on the hand we have a plethora of licenses with no easily distinguishing or discernible differences. Finding both options lacking, Creative Commons created a few, flexible, copyright licenses that are appropriate for different uses.

We have nurtured the porting of those licenses to different jurisdictions around the world.

We have avoided writing licenses that don’t work across the combinations of spectra of regimes and IP types.

Data sharing made easy

  • Easy-to-understand licenses
  • We provide a web-based license chooser

    License chooser

  • We have created the CC Network and the CC Mixter

Notes

We have explained those licenses in easy-to-understand terms, provding different versions of the same licenses that are readable by normal public, by lawyers, and by machines.

We have provided a license chooser that allows you to choose and apply a suitable license in just a few clicks of the mouse.

The CC Network includes: OpenID support allowing your CC Network profile to act as an OpenID; ability to identify your works with an official badge; ability to share your story.

ccMixter is a community music site featuring remixes licensed under Creative Commons where you can listen to, sample, mash-up, or interact with music in whatever way you want.

Making data sharing scalable

  • We provide a scalable infrastructure for the creation of a scalable digital commons
  • Operate across a range of IP types, from data to cultural works to scientific research to patented technologies
  • We use RDFa-based licenses that can be programmatically parsed by computer programs

Notes

We make sharing scalable in the sense that a few licenses can be used by a half-billion objects on the web (probably double that now, as the number is now eight months old).

Using RDFa (resource description format in attributes) to encode our licenses results in machines parseable licenses.

Scaling through interoperability

  • An infrastructure that maximizes interoperability
  • Is an infrastructure that is scalable
  • Makes for an infrastructure that lasts
  • And benefits its users

Notes

We believe the lessons we have learned in making data sharing easy, legal and scalable can also be applied to e-Infrastructures for scientific data in general.

Enable interoperability

  • Interoperability is the absence of barriers
  • When barriers are lowered, interoperability increases
  • Technical, legal and semantic interoperability

Notes

Interoperability is the key concept here. It is the opposite of barriers. As barriers are lowered, interoperability increases.

Interoperability occurs at many levels.

Interoperability at three levels

  • Technical: readable by software
  • Legal: legally accessible
  • Semantic: understandable by programmatic logic

Notes

An underlying premise of an infrastructure for data is long term preservation.

In order to ensure accessibility, interoperability has to be a key design objective.

Truly interoperable data will be technologically, semantically and legally interoperable, thereby maximizing the chances for use, and thus, the returns on investment in building the infrastructure.

Benefits of technical interoperability

  • Reduces friction through format transparency
  • Results in URLs that don’t rot
  • Allows for namespaces that persist
  • Encourages open and published formats

Notes

Benefits of legal interoperability

  • Lowers barriers caused due to incompatible licenses
  • No more category errors
    • What if I classify free data as protected and protected data as free?
  • No more decision paralysis
    • What if I use it but it wasn’t free?
    • What if I pass on it, but it was free to be used?
  • No more attribution stacking
    • Am I obligated to give attribution to everyone who contributed to the dataset?

Notes

Problems with data as property

  • Property is controlled and protected by licenses
  • License conditions are triggered by making a copy

Notes

Problems arise when data are treated as property rather than a shared resource. Works of creative authorship are intellectual property, and can be protected by applying licenses. Data, in particular, raw data, are naturally occurring facts that may be discovered, not created. They have to remain free for the benefit of everyone.

Licensing inappropriate for scientific data

  • Scientific data carry very different protection regimes around the world, as do the databases in which the data reside
    • This creates a complex ecosystem that is hard to unify with a single, scalable license system
  • Scientific data are usually not copyrightable
  • Digital scientific data are frequently hosted, not copied

Notes

Problems arise when data are treated as property rather than a shared resource. Works of creative authorship are intellectual property, and can be protected by applying licenses. Data, in particular, raw data, are naturally occurring facts that may be discovered, not created. They have to remain free for the benefit of everyone.

Fact/expression dichotomy

  • In the U.S., facts are free but original creative expression is protected
  • What is fact and what is not?
    • You have 20 GB of processed meterological data — quickly now, is it
      “fact” or is it “creative content”?
  • What can be licensed and what cannot be covered by licenses?
    • Creative expression can be licensed, but facts cannot be covered by licenses

Notes

The fact-expression divide is a concept in copyright law which states that copyright does not protect ideas. Only the way in which an idea has been expressed is protectable by copyright.

Some courts have recognized that there are particular ideas that can only be expressed intelligibly in a limited number of ways. In these cases even the expression is unprotected, or extremely limited to verbatim copying only. This is called the merger doctrine in the United States.

Diminishing freedom

  • When mixing licenses, the final license is as restrictive as the most restrictive license
  • Scientific research almost always mixes data
  • At each mix, we would get progressively less free data

Notes

This can have a chilling effect on innovation. Businesses hate uncertainty, and not knowing what they might be liable for in the future because of the license of some dataset they used today creates uncertainty.

Mixing data results in new licenses

  • License that allow derivative works or adaptation
Original license (below) may be licensed as →PD BY BY-NC BY-NC-NDBY-NC-SABY-NDBY-SA
PD
BY
BY-NC
BY-NC-ND
BY-NC-SA
BY-ND
BY-SA

Notes

Any dataset licensed with a non-derivative clause will not allow creation of new datasets from it legally.

License to thrill innovation?

  • New licenses are as restrictive as the most restrictive licenses
License matrix
PD/CC0 BY BY-NC BY-NC-ND BY-NC-ND-SABY-NC-SA BY-ND BY-SA ARR
PD/CC0 PD/CC0 BY BY-NC BY-NC-ND BY-NC-ND-SABY-NC-SA BY-ND BY-SA BY-SA
BY BY BY BY-NC BY-NC-ND BY-NC-ND-SABY-NC-SA BY-ND BY-SA
BY-NC BY-NC BY-NC BY-NC BY-NC-ND BY-NC-ND-SABY-NC-SA
BY-NC-ND BY-NC-ND BY-NC-ND BY-NC-ND BY-NC-ND BY-NC-ND-SA BY-NC-ND
BY-NC-ND-SABY-NC-ND-SABY-NC-ND-SABY-NC-ND-SABY-NC-ND-SABY-NC-ND-SABY-NC-ND-SABY-NC-ND-SABY-NC-ND-SA
BY-NC-SA BY-NC-SA BY-NC-SA BY-NC-SA BY-NC-ND-SABY-NC-SA
BY-ND BY-ND BY-ND BY-NC-ND BY-NC-ND-SA BY-ND
BY-SA BY-SA BY-SA BY-NC-ND-SA BY-SA
ARR ARR

Notes

When disparate datasets are mixed together, the license of the resulting dataset is as open as the most restrictive license of the component sets. Hence, licensed data tend toward fewer degrees of freedom as they are mixed with other data.

License to chill innovation

  • Many license combinations are invalid, so those data can’t be mixed
License matrix
PD/CC0 BY BY-NC BY-NC-ND BY-NC-ND-SABY-NC-SA BY-ND BY-SA ARR
PD/CC0
BY
BY-NC
BY-NC-ND
BY-NC-ND-SA
BY-NC-SA
BY-ND
BY-SA
ARR

Notes

When disparate datasets are mixed together, the license of the resulting dataset is as open as the most restrictive license of the component sets. Hence, licensed data tend toward fewer degrees of freedom as they are mixed with other data.

Contracts unsuitable for data

  • Difficult to track
  • Difficult to enforce
  • Apply only between consenting parties aka privity of contract
  • Legal requirement potentially a disincentive to use

Notes

When data are treated as a networked, shared resource, users are encouraged to tap into data sources rather than copy them. This circumvents the issues triggered by copying. Note that this is applicable to large datasets which would be impractical to copy and replicate because of their large size.

Neither license nor contract

  • Licenses and contracts are not the same thing
  • Both are inappropriate for scientific data

Notes

When data are treated as a networked, shared resource, users are encouraged to tap into data sources rather than copy them. This circumvents the issues triggered by copying. Note that this is applicable to large datasets which would be impractical to copy and replicate because of their large size.

Converge toward the public domain

  • Maximize legal interoperability
  • Allow commercial use
    • Science is innovative when the market is able to convert scientific outcome into affordable, commodity products (zipper, ballpoint pen, aspirin, penicilin)
  • Don’t impose share-alike
    • “Do as I do” leads to an impasse because I can’t foresee all possible future scenarios

Notes

Choose PD or CC0

  • Public Domain (PD) works best when data are completely free of copyrightable elements.
  • CC0 (pronounced “CC Zero”) is more appropriate when copyrightable elements and facts are mixed. This is the more usual case.

Notes

Reasoning Behind CC0

Protocol for implementing open data

http://sciencecommons.org/projects/publishing/open-access-data-protocol/

Notes

The protocol is motivated by interoperability of scientific data. The volume of scientific data, and their interconnectedness, makes integration a necessity. For example, life scientists must integrate data from across biology and chemistry to comprehend disease and discover cures, and climate change scientists must integrate data from wildly diverse disciplines to understand our current state and predict the impact of new policies.

The technical challenge of such integration is significant. The forest of terms and conditions around data make integration difficult legally. One approach might be to develop and recommend a single license: any data with this license can be integrated with any other data under this license.

Why a new protocol?

Database protocol

http://sciencecommons.org/resources/faq/database-protocol/

Notes

But this approach, which implicitly builds on intellectual property rights and the ideas of licensing as understood in software and culture, is difficult to scale for scientific uses. There are too many databases under too many terms already, and it is unlikely that any one license or suite of licenses will have the correct mix of terms to gain critical mass and allow massive-scale machine integration of data.

Therefore we instead lay out principles for open access data and a protocol for implementing those principles, and we distribute an Open Access Data Mark and metadata for use on databases and data available under a successful implementation of the protocol.

Problems with “data as property”

  • Property use is governed by licenses
    • Requiring attribution is a license
  • Science is governed by norms
    • Giving proper citation is a norm
  • (PD or CC0) + SC Norms help converge data toward public domain

Notes

Norms not Contracts

  • Encourage citation through community norms, not through contracts
  • Contracts are hard to implement, expensive to enforce

Notes

Requesting and encouraging one type of behaviour, such as citation, through norms and terms of use rather than as a legal requirement based on copyright or contracts, is preferable over contracts.

We are aware that different disciplines and jurisdictions call for different approaches, and this is not always a one-size-fits-all solution. With requesting behaviour through norms and terms of use rather than a legal construct, various scientific disciplines have the ability to develop their own norms for citation, allowing for legal certainty without constraining one community to the norms of another.

Data deluge

  • Internet hosts are growing
  • Amount of public data is growing at an exponential rate
    • Projected to grow to almost a 1000 exabytes this year
    • LHC itself will produce up to 2 GB/second, 15 petabytes a year
  • Humanly impossible to make sense. Can computers help?

Notes

Data growth data courtesy IDC. 2007. The Expanding Digital Universe. EMC Corporation

Large Hadron Collider information from WLCG Worldwide LHC Computing Grid

The web that thinks

  • Semantically structured data can assist
  • Computer programs can identify relevant data and string them together in ways that make sense
  • Instead of humans spending a lot of time looking through search results, search results are more targeted and meaningful

Notes

Semantic queries

  • Getting answers without knowing the detailed syntactic structure of a database
    • Find landlocked countries with population more than 15 million, then display the results sorted by population.

Notes

Normal search engine query

Find landlocked countries with population more than 15 million, sorted by population. 392,000 results.

Google search query

Notes

Example courtesy Feigenbaum and Prud’hommeaux. 2009. SPARQL By Example. http://www.cambridgesemantics.com/2008/09/sparql-by-example/ accessed Feb 9. 2010.

Semantic search engine query

Find landlocked countries with population more than 15 million, sorted by population. 8 results.

PREFIX type: <http://dbpedia.org/class/yago/>
PREFIX prop: <http://dbpedia.org/property/>

SELECT ?country_name ?population
WHERE {
    ?country a type:LandlockedCountries ;
                     rdfs:label ?country_name ;
                     prop:populationEstimate ?population .
    FILTER (
            ?population > 15000000 &&
            langMatches(lang(?country_name), “EN”)
    ) .
} ORDER BY DESC(?population)
country_namepopulation
Ethiopia82825000
Uganda32710000
Nepal29331000
Afghanistan28150000
Uzbekistan27606007
Burkina Faso15757000
Niger15290000
Malawi15263000

Notes

Example courtesy Feigenbaum and Prud’hommeaux. 2009. SPARQL By Example. http://www.cambridgesemantics.com/2008/09/sparql-by-example/ accessed Feb 9. 2010.

Benefits of semantic interoperability

Notes

Semantic interoperability allows mixing structured and non-structured data. Humans can retrieve information using familiar syntax, and computers can be programmed to extract information programmatically, thereby increasing the payback from investments in the repository.

Building a semantically-aware e-Infrastructure

  • Data providers have to use existing vocabularies
  • Annotate their data properly in RDF
  • Make data available on the web as linked open data

Notes

The cost of semantically structuring data will drop so that it will become possible for everyone to provide and consume data easily.

But, getting there is not automatic or easy. We have to make conscious decisions today to get there tomorrow.

Visionary data: Proteome Commons

Proteome Commons CC0 waiver

https://proteomecommons.org/tranche/examples/sciencecommons/choose.jsp

Notes

ProteomeCommons is a public proteomics database for annotations and other information linked to the Tranche data repository and to other resources. It provides public access to free, open-source proteomics tools and data.

The ProteomeCommons.org Tranche network is a cloud of computers that to which one can upload files and download files from. All files uploaded to the network are replicated several times to protect against their accidental loss. Files uploaded to the network can be of any size, can be of any file type, and can be encrypted with a passphrase of your choosing.

ProteomeCommons makes available all its data only under a CC0 waiver.

Visionary data: Tropical Disease Initiative

Tropical Disease Initiative CC0 waiver

http://tropicaldisease.org/kernel/advanced-search/

Notes

The Tropical Disease Initiative aims to provide a “kernel” for open source drug discovery. Such kernel should allow scientists from laboratories, universities, institutes, and corporations to work together for a common cause: find new drugs against tropical disieases such as Malaria or Tuberculosis.

The TDI kernel (v1.0) includes 297 potential drug targets against the 10 selected genomes and is freely and publicly accessible.

Visionary data: SIDER

SIDER CC0 waiver

http://sideeffects.embl.de/download/

Notes

The SIDER Side Effect Resource represents an effort to aggregate dispersed public information on side effects. To our knowledge, no such resource exist in machine-readable form despite the importance of research on drugs and their effects.

Visionary data: Personal Genome Project

Personal Genome Project CC0 waiver

http://www.personalgenomes.org/

Notes

The mission of the Personal Genome Project is to encourage the development of personal genomics technology and practices that: are effective, informative, and responsible; yield identifiable and improvable benefits at manageable levels of risk; and are broadly available for the good of the general public.

To achieve this mission we will build a framework for prototyping and evaluating personal genomics technology and practices at increasing scales

The Personal Genomes Project is committed to making research data from the PGP freely available to the public under a CC0 waiver.

Visionary data: WisconsinView

WisconsinView CC0 waiver

http://www.wisconsinview.org/

Notes

Since 2004, WisconsinView has made aerial photography and satellite imagery of Wisconsin available to the public for free over the web. As part of the AmericaView consortium, WisconsinView supports access and use of these imagery collections through education, workforce development, and research.

Starting June 30, 2009, WisconsinView is making available all of its more than 6 Terabytes of imagery data under the CC0 Protocol provided by Creative Commons.

Visionary data: MichiganView

MichiganView CC0 waiver

http://michiganview.org/

Notes

The MichiganView consortium makes available aerial photography and satellite imagery of Michigan to the public for free over the Web. As part of the AmericaView consortium, MichiganView supports access and use of these imagery collections through education, workforce development, and research.

Starting Jan 28, 2010, MichiganView is making available all of its more than 93 Gigabytes of Landsat 5 and 7, and NAIP imagery data in the public domain using the CC0 Waiver provided by Creative Commons.

Key design principles

  • Resist the temptation to treat data as property
  • Embrace the potential to treat it as a networked resource
  • Aim for maximum reuse
  • Ensure freedom to integrate
  • Leverage existing open infrastructure
  • Build and nurture a community around open data

Notes

Stating the design principles will allow one to develop an e-Infrastructure that has been built from inside-out to meet those objectives.

Preserve data for reuse

The only reason we put data in a computer is so we can take them out again. The data that are easier to get and work with get reused more

Notes

The only reason we put data in a computer is so we can take it out again. The data that are easier to get and work with get reused more.

Measuring success of e-Infrastructure

  • Is it easy to put data in?
  • Are data secure for the long-term?
  • Are private data private and public data easily accessible?
  • Is it easy to take data out?
  • Are the conditions under which the data may be used clear to understand and implement?
  • Can data be retreived programmatically?
  • Is there an active community using the e-Infrastructure?

Notes

It is important to have indexes against which the success of an e-Infrastructure can be measured. These indexes allow one to use resources most efficiently.

The success of an e-Infrastructure can be measured against its objectives — Is it easy to put data in? Can data be kept securely for the long-term? Can private data be kept private and public data be easily accessible? Is it easy to take data out? Are the conditions under which the data may be used clear to understand and implement? Can data be retreived programmatically?

Interoperability is a feature

technology + law + meaning + community working together for a successful e-Infrastructure for scientific data

Notes

A successful e-Infrastructure requires many components: technology, a legal framework, meaningful structure, and, more important than anything, a community that nourishes and uses the data.

The community that uses open data is a varied one — researchers, educators, students, governtment agencies, entrepreneurs, established businesses, and hackers. They don’t have a established identify in common except for a common need for unencumbered data. This group has to be nurtured.

It is expensive to make open and available, expensive to create a long-lasting e-Infrastructure. It is even more expensive to not do it. The old adage fits perfectly: if you think education is expensive, try ignorance.

Acknowledgment and Permalink

Notes