name: intro class: center, middle # eInfrastructure for Scientific Data ## Long-term data interoperability High Level Experts Group Meeting on *e-Infrastructure for Scientific Data* at the European Commission, Brussels, Belgium • February 2010 Puneet Kishor (Plazi) Released under a [CC0 Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/). --- layout: true --- ## Help * Notes are hidden, but may be seen by pressing **P** on your keyboard. * Press **C** to clone a show. * Press **H** for other keyboard shortcuts. ??? notes here --- ## Creative Commons objectives * Make data sharing *legal*, *easy* and *scalable* * Encourage and empower *long-term interoperability* ??? Creative Commons works to make knowledge sharing legal, easy and scalable. We work in the culture space (music, text, film, art), education (open educational resources, virtual textbooks), and science (biological materials transfer, data sharing, Open Access, semantic web, patents). We craft policy and legal tools to lower the barriers to knowledge sharing. We believe that barriers to knowledge sharing are lowered by increasing interoperability. --- ## Legal data sharing * A *small menu of appropriate licenses* * *Some rights reserved* vs. all rights reserved ![CC BY](../img/by.png) ![CC BY-SA](../img/by-sa.png) ![CC BY-ND](../img/by-nd.png) ![CC BY-NC](../img/by-nc.eu.png) ![CC BY-NC-SA](../img/by-nc-sa.eu.png) ![CC BY-NC-ND](../img/by-nc-nd.png) ??? On the one hand we have the traditional copyright system's "one size fits all," and on the hand we have a plethora of licenses with no easily distinguishing or discernible differences. Finding both options lacking, Creative Commons created a few, flexible, copyright licenses that are appropriate for different uses. We have nurtured the porting of those licenses to different jurisdictions around the world. We have avoided writing licenses that don't work across the combinations of spectra of regimes and IP types. --- ## Data sharing made easy * *Easy-to-understand* licenses * We provide a *web-based license chooser* * We have created the *CC Network* and the *CC Mixter* ![License chooser](../img/license_chooser.png) ??? We have explained those licenses in easy-to-understand terms, provding different versions of the same licenses that are readable by normal public, by lawyers, and by machines. We have provided a license chooser that allows you to choose and apply a suitable license in just a few clicks of the mouse. The CC Network includes: OpenID support allowing your CC Network profile to act as an OpenID; ability to identify your works with an official badge; ability to share your story. ccMixter is a community music site featuring remixes licensed under Creative Commons where you can listen to, sample, mash-up, or interact with music in whatever way you want. --- ## Making data sharing scalable * We provide a *scalable infrastructure* for the creation of a scalable digital commons * Operate across a *range of IP types*, from data to cultural works to scientific research to patented technologies * We use *RDFa-based licenses* that can be programmatically parsed by computer programs ??? We make sharing scalable in the sense that a few licenses can be used by a half-billion objects on the web (probably double that now, as the number is now eight months old). Using RDFa (resource description format in attributes) to encode our licenses results in machines parseable licenses. --- ## Scaling through interoperability * An infrastructure that *maximizes interoperability* * Is an infrastructure that is *scalable* * Makes for an infrastructure that *lasts* * And *benefits its users* ??? We believe the lessons we have learned in making data sharing easy, legal and scalable can also be applied to e-Infrastructures for scientific data in general. --- ## Enable interoperability * Interoperability is the *absence of barriers* * When barriers are lowered, interoperability increases * *Technical*, *legal* and *semantic* interoperability ??? Interoperability is the key concept here. It is the opposite of barriers. As barriers are lowered, interoperability increases. Interoperability occurs at many levels. --- ## Interoperability at three levels * Technical: *readable by software* * Legal: *legally accessible* * Semantic: *understandable by programmatic logic* ??? An underlying premise of an infrastructure for data is long term preservation. In order to ensure accessibility, interoperability has to be a key design objective. Truly interoperable data will be technologically, semantically and legally interoperable, thereby maximizing the chances for use, and thus, the returns on investment in building the infrastructure. --- ## Benefits of technical interoperability * Reduces friction through *format transparency* * Results in *URLs that don't rot* * Allows for *namespaces that persist* * Encourages *open and published formats* --- ## Benefits of legal interoperability * Lowers barriers caused due to incompatible licenses * *No more category errors* * What if I classify free data as protected and protected data as free? * *No more decision paralysis* * What if I use it but it wasn't free? * What if I pass on it, but it was free to be used? * *No more attribution stacking* * Am I obligated to give attribution to everyone who contributed to the dataset? --- ## Problems with data as property * Property is *controlled and protected by licenses* * License conditions are *triggered by making a copy* ??? Problems arise when data are treated as property rather than a shared resource. Works of creative authorship are intellectual property, and can be protected by applying licenses. Data, in particular, raw data, are naturally occurring facts that may be discovered, not created. They have to remain free for the benefit of everyone. --- ## Licensing inappropriate for scientific data * Scientific data carry very different protection regimes around the world, as do the databases in which the data reside * This creates a complex ecosystem that is hard to unify with a single, scalable license system * Scientific data are usually not copyrightable * Digital scientific data are frequently hosted, not copied ??? Problems arise when data are treated as property rather than a shared resource. Works of creative authorship are intellectual property, and can be protected by applying licenses. Data, in particular, raw data, are naturally occurring facts that may be discovered, not created. They have to remain free for the benefit of everyone. --- ## Fact/expression dichotomy * In the U.S., *facts are free* but original creative expression is protected * What is fact and what is not? * You have 20 GB of processed meteorological data—quickly now, is it "fact" or is it "creative content"? * What can be licensed and what cannot be covered by licenses? * Creative expression can be licensed, but facts cannot be covered by licenses ??? The fact-expression divide is a concept in copyright law which states that copyright does not protect ideas. Only the way in which an idea has been expressed is protectable by copyright. Some courts have recognized that there are particular ideas that can only be expressed intelligibly in a limited number of ways. In these cases even the expression is unprotected, or extremely limited to verbatim copying only. This is called the merger doctrine in the United States. --- ## Diminishing freedom * When mixing licenses, the final license is as restrictive as the most restrictive license * Scientific research almost always mixes data * At each mix, we would get progressively less free data ??? This can have a chilling effect on innovation. Businesses hate uncertainty, and not knowing what they might be liable for in the future because of the license of some dataset they used today creates uncertainty. --- ## Mixing data → in new licenses
Orig license (↓) may be licensed as (→)
PD
BY
BY-NC
BY-NC-ND
BY-NC-SA
BY-ND
BY-SA
PD
✓
✓
✓
✓
✓
✓
✓
BY
✓
✓
✓
✓
✓
✓
BY-NC
✓
✓
✓
BY-NC-ND
BY-NC-SA
✓
BY-ND
BY-SA
✓
??? Any dataset licensed with a non-derivative clause will not allow creation of new datasets from it legally. --- ## License to thrill innovation? * New licenses are as restrictive as the most restrictive licenses
PD/CC0
BY
BY-NC
BY-NC-ND
BY-NC-ND-SA
BY-NC-SA
BY-ND
BY-SA
ARR
PD/CC0
PD/CC0
BY
BY-NC
BY-NC-ND
BY-NC-ND-SA
BY-NC-SA
BY-ND
BY-SA
BY-SA
BY
BY
BY
BY-NC
BY-NC-ND
BY-NC-ND-SA
BY-NC-SA
BY-ND
BY-SA
BY-NC
BY-NC
BY-NC
BY-NC
BY-NC-ND
BY-NC-ND-SA
BY-NC-SA
BY-NC-ND
BY-NC-ND
BY-NC-ND
BY-NC-ND
BY-NC-ND
BY-NC-ND-SA
BY-NC-ND
BY-NC-ND-SA
BY-NC-ND-SA
BY-NC-ND-SA
BY-NC-ND-SA
BY-NC-ND-SA
BY-NC-ND-SA
BY-NC-ND-SA
BY-NC-ND-SA
BY-NC-ND-SA
BY-NC-SA
BY-NC-SA
BY-NC-SA
BY-NC-SA
BY-NC-ND-SA
BY-NC-SA
BY-ND
BY-ND
BY-ND
BY-NC-ND
BY-NC-ND-SA
BY-ND
BY-SA
BY-SA
BY-SA
BY-NC-ND-SA
BY-SA
ARR
ARR
??? When disparate datasets are mixed together, the license of the resulting dataset is as open as the most restrictive license of the component sets. Hence, licensed data tend toward fewer degrees of freedom as they are mixed with other data. --- ## License to chill innovation * Many license combinations are invalid, so those data can't be mixed
PD/CC0
BY
BY-NC
BY-NC-ND
BY-NC-ND-SA
BY-NC-SA
BY-ND
BY-SA
ARR
PD/CC0
BY
✗
BY-NC
✗
✗
✗
BY-NC-ND
✗
✗
✗
BY-NC-ND-SA
✗
BY-NC-SA
✗
✗
✗
✗
BY-ND
✗
✗
✗
✗
BY-SA
✗
✗
✗
✗
✗
ARR
✗
✗
✗
✗
✗
✗
✗
✗
??? When disparate datasets are mixed together, the license of the resulting dataset is as open as the most restrictive license of the component sets. Hence, licensed data tend toward fewer degrees of freedom as they are mixed with other data. --- ## Contracts unsuitable for data * Difficult to track * Difficult to enforce * Apply only between consenting parties aka !privity of contract! * Legal requirement potentially a disincentive to use ??? When data are treated as a networked, shared resource, users are encouraged to tap into data sources rather than copy them. This circumvents the issues triggered by copying. Note that this is applicable to large datasets which would be impractical to copy and replicate because of their large size. --- ## Neither license nor contract * Licenses and contracts are not the same thing * Both are inappropriate for scientific data ??? When data are treated as a networked, shared resource, users are encouraged to tap into data sources rather than copy them. This circumvents the issues triggered by copying. Note that this is applicable to large datasets which would be impractical to copy and replicate because of their large size. --- ## Converge toward the public domain * Maximize legal interoperability * Allow commercial use * Science is innovative when the market is able to convert scientific outcome into affordable, commodity products * Don't impose share-alike * "Do as I do" leads to an impasse because I can't foresee all possible future scenarios --- ## Choose PD or CC0 * Public Domain (PD) works best when data are completely free of copyrightable elements. * CC0 (pronounced "CC Zero") is more appropriate when copyrightable elements and facts are mixed. This is the more usual case. --- ## Reasoning Behind CC0 ![Protocol for implementing open data](../img/protocol_for_open_data.png) [Open Access Data Protocol](http://sciencecommons.org/projects/publishing/open-access-data-protocol/) ??? The protocol is motivated by interoperability of scientific data. The volume of scientific data, and their interconnectedness, makes integration a necessity. For example, life scientists must integrate data from across biology and chemistry to comprehend disease and discover cures, and climate change scientists must integrate data from wildly diverse disciplines to understand our current state and predict the impact of new policies. The technical challenge of such integration is significant. The forest of terms and conditions around data make integration difficult legally. One approach might be to develop and recommend a single license: any data with this license can be integrated with any other data under this license. --- ## Why a new protocol? ![Database protocol](../img/db_protocol.png) [Database protocol](http://sciencecommons.org/resources/faq/database-protocol/) ??? But this approach, which implicitly builds on intellectual property rights and the ideas of licensing as understood in software and culture, is difficult to scale for scientific uses. There are too many databases under too many terms already, and it is unlikely that any one license or suite of licenses will have the correct mix of terms to gain critical mass and allow massive-scale machine integration of data. Therefore we instead lay out principles for open access data and a protocol for implementing those principles, and we distribute an Open Access Data Mark and metadata for use on databases and data available under a successful implementation of the protocol. --- ## Problems with "data as property" * Property use is governed by licenses * Requiring attribution is a license * Science is governed by norms * Giving proper citation is a norm * (PD or CC0) + SC Norms help converge data toward public domain --- ## Norms not Contracts * Encourage citation through community norms, not through contracts * Contracts are hard to implement, expensive to enforce ??? Requesting and encouraging one type of behavior, such as citation, through norms and terms of use rather than as a legal requirement based on copyright or contracts, is preferable over contracts. We are aware that different disciplines and jurisdictions call for different approaches, and this is not always a one-size-fits-all solution. With requesting behavior through norms and terms of use rather than a legal construct, various scientific disciplines have the ability to develop their own norms for citation, allowing for legal certainty without constraining one community to the norms of another. --- ## Data deluge * Internet hosts are growing
* Amount of public data is growing at an exponential rate * Projected to grow to almost a !1000 exabytes! this year * LHC itself will produce up to !2 GB/second!, 15 petabytes a year * Humanly impossible to make sense. Can computers help? ??? Data growth data courtesy [IDC. 2007. The Expanding Digital Universe. EMC Corporation](http://ftp.isc.org/www/survey/reports/current/). Large Hadron Collider information from [WLCG Worldwide LHC Computing Grid](http://lcg.web.cern.ch/LCG/public/) --- ## The web that thinks * Semantically structured data can assist * Computer programs can identify relevant data and string them together in ways that make sense * Instead of humans spending a lot of time looking through search results, search results are more targeted and meaningful --- ## Semantic queries * Getting answers without knowing the detailed syntactic structure of a database * Find landlocked countries with population more than 15 million, then display the results sorted by population. --- ## Normal search engine query Find landlocked countries with population more than 15 million, sorted by population. !392,000 results!. ![Google search query](../img/google_query.png) ??? Example courtesy Feigenbaum and Prud'hommeaux. 2009. SPARQL By Example. http://www.cambridgesemantics.com/2008/09/sparql-by-example/ accessed Feb 9. 2010. --- ## Semantic search engine query Find landlocked countries with population more than 15 million, sorted by population. *8 results* PREFIX type:
PREFIX prop:
SELECT ?country_name ?population WHERE { ?country a type:LandlockedCountries ; rdfs:label ?country_name ; prop:populationEstimate ?population . FILTER ( ?population > 15000000 && langMatches(lang(?country_name), "EN") ) . } ORDER BY DESC(?population)
country_name
population
country_name
population
Ethiopia
82825000
Uzbekistan
27606007
Uganda
32710000
Burkina Faso
15757000
Nepal
29331000
Niger
15290000
Afghanistan
28150000
Malawi
15263000
??? Example courtesy Feigenbaum and Prud'hommeaux. 2009. SPARQL By Example. http://www.cambridgesemantics.com/2008/09/sparql-by-example/ accessed Feb 9. 2010. --- ## Benefits of semantic interoperability… * Allows programmatic retrieval and consumption of information. Computers can be programmed to get
http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&should-sponge=&query=PREFIX+type%3A+%3Chttp%3A%2F%2Fdbpedia.org%2Fclass%2Fyago%2F%3E%0D%0APREFIX+prop%3A+%3Chttp%3A%2F%2Fdbpedia.org%2Fproperty%2F%3E%0D%0ASELECT+%3Fcountry_name+%3Fpopulation%0D%0AWHERE+%7B%0D%0A++++%3Fcountry+a+type%3ALandlockedCountries+%3B%0D%0A+++++++++++++rdfs%3Alabel+%3Fcountry_name+%3B%0D%0A+++++++++++++prop%3ApopulationEstimate+%3Fpopulation+.%0D%0A++++FILTER+%28%3Fpopulation+%3E+15000000+%26%26+langMatches%28lang%28%3Fcountry_name%29%2C+%22EN%22%29%29+.%0D%0A%7D+ORDER+BY+DESC%28%3Fpopulation%29&format=text%2Fhtml&debug=on&timeout=)
--- ## Benefits of semantic interoperability * So humans can
find landlocked countries with population more than 15 million, sorted by population
??? Semantic interoperability allows mixing structured and non-structured data. Humans can retrieve information using familiar syntax, and computers can be programmed to extract information programmatically, thereby increasing the payback from investments in the repository. --- ## Building a semantically-aware e-Infrastructure * Data providers have to use existing vocabularies * Annotate their data properly in RDF * Make data available on the web as linked open data ??? The cost of semantically structuring data will drop so that it will become possible for everyone to provide and consume data easily. But, getting there is not automatic or easy. We have to make conscious decisions today to get there tomorrow. --- ## Visionary data: Proteome Commons ![Proteome Commons CC0 waiver](../img/proteomecommons.png) [Proteome Commons](https://proteomecommons.org/tranche/examples/sciencecommons/choose.jsp) ??? ProteomeCommons is a public proteomics database for annotations and other information linked to the Tranche data repository and to other resources. It provides public access to free, open-source proteomics tools and data. The ProteomeCommons.org Tranche network is a cloud of computers that to which one can upload files and download files from. All files uploaded to the network are replicated several times to protect against their accidental loss. Files uploaded to the network can be of any size, can be of any file type, and can be encrypted with a passphrase of your choosing. ProteomeCommons makes available all its data only under a CC0 waiver. --- ## Visionary data: Tropical Disease Initiative ![Tropical Disease Initiative CC0 waiver](../img/tropicaldisease.png) [Tropical Disease Initiative](http://tropicaldisease.org/kernel/advanced-search/) ??? The Tropical Disease Initiative aims to provide a "kernel" for open source drug discovery. Such kernel should allow scientists from laboratories, universities, institutes, and corporations to work together for a common cause: find new drugs against tropical diseases such as Malaria or Tuberculosis. The TDI kernel (v1.0) includes 297 potential drug targets against the 10 selected genomes and is freely and publicly accessible. --- ## Visionary data: SIDER ![SIDER CC0 waiver](../img/sider.png) [SIDER Side Effect Resource](http://sideeffects.embl.de/download/) ??? The SIDER Side Effect Resource represents an effort to aggregate dispersed public information on side effects. To our knowledge, no such resource exist in machine-readable form despite the importance of research on drugs and their effects. --- ## Visionary data: Personal Genome Project ![Personal Genome Project CC0 waiver](../img/personalgenomes.png) [Personal Genome Project](http://www.personalgenomes.org/) ??? The mission of the Personal Genome Project is to encourage the development of personal genomics technology and practices that: are effective, informative, and responsible; yield identifiable and improvable benefits at manageable levels of risk; and are broadly available for the good of the general public. To achieve this mission we will build a framework for prototyping and evaluating personal genomics technology and practices at increasing scales. The Personal Genomes Project is committed to making research data from the PGP freely available to the public under a CC0 waiver. --- ## Visionary data: WisconsinView ![WisconsinView CC0 waiver](../img/wisconsinview.png) [WisconsinView](http://www.wisconsinview.org/) ??? Since 2004, WisconsinView has made aerial photography and satellite imagery of Wisconsin available to the public for free over the web. As part of the AmericaView consortium, WisconsinView supports access and use of these imagery collections through education, workforce development, and research. Starting June 30, 2009, WisconsinView is making available all of its more than 6 Terabytes of imagery data under the CC0 Protocol provided by Creative Commons. --- ## Visionary data: MichiganView ![MichiganView CC0 waiver](../img/michiganview.png) [MichiganView](http://www.michiganview.org/) ??? The MichiganView consortium makes available aerial photography and satellite imagery of Michigan to the public for free over the Web. As part of the AmericaView consortium, MichiganView supports access and use of these imagery collections through education, workforce development, and research. Starting Jan 28, 2010, MichiganView is making available all of its more than 93 Gigabytes of Landsat 5 and 7, and NAIP imagery data in the public domain using the CC0 Waiver provided by Creative Commons. --- ## Key design principles * Resist the temptation to treat data as property * Embrace the potential to treat it as a networked resource * Aim for maximum reuse * Ensure freedom to integrate * Leverage existing open infrastructure * Build and nurture a community around open data ??? Stating the design principles will allow one to develop an e-Infrastructure that has been built from inside-out to meet those objectives. --- ## Preserve data for reuse The only reason we put data in a computer is so we can take them out again. The data that are easier to get and work with get reused more ??? The only reason we put data in a computer is so we can take it out again. The data that are easier to get and work with get reused more. --- ## Measuring success of e-Infrastructure * Is it easy to put data in? * Are data secure for the long-term? * Are private data private and public data easily accessible? * Is it easy to take data out? * Are the conditions under which the data may be used clear to understand and implement? * Can data be retrieved programmatically? * Is there an active community using the e-Infrastructure? ??? It is important to have indexes against which the success of an e-Infrastructure can be measured. These indexes allow one to use resources most efficiently. The success of an e-Infrastructure can be measured against its objectives -- Is it easy to put data in? Can data be kept securely for the long-term? Can private data be kept private and public data be easily accessible? Is it easy to take data out? Are the conditions under which the data may be used clear to understand and implement? Can data be retrieved programmatically? --- ## Interoperability is a feature technology + law + meaning + community working together for a successful e-Infrastructure for scientific data ??? A successful e-Infrastructure requires many components: technology, a legal framework, meaningful structure, and, more important than anything, a community that nourishes and uses the data. The community that uses open data is a varied one -- researchers, educators, students, government agencies, entrepreneurs, established businesses, and hackers. They don't have a established identify in common except for a common need for unencumbered data. This group has to be nurtured. It is expensive to make open and available, expensive to create a long-lasting e-Infrastructure. It is even more expensive to not do it. The old adage fits perfectly: if you think education is expensive, try ignorance.