name: intro class: center, middle # Legal Implications of Text and Data Mining ## Creative Commons • August 2015 Puneet Kishor (Plazi) Released under a [CC0 Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/). --- layout: true --- ## Help * Notes are hidden, but may be seen by pressing **P** on your keyboard. * Press **C** to clone a show. * Press **H** for other keyboard shortcuts. ??? notes here --- ## TDM Defined .left-column[ ### 1982 ] .right-column[ > Automatically generating logical representations of text passages… by means of an analysis of the coherence structure of the passages.
Jerry R. Hobbs, Donald E. Walker, and Robert A. Amsler. 1982. Natural language access to structured text. In Proceedings of the 9th conference on Computational linguistics - Volume 1(COLING '82), Ján Horecký (Ed.), Vol. 1. Academia Praha, Czechoslovakia, 127-132. doi:
10.3115/991813.991833
] ??? TDM defined over the years. Made easier by high speed networks, cheap and fast CPU, cheap memory, and market needs for search and behavioral analysis. Used in science to deduce new data from very large datasets. ---
TDM Defined
.left-column[ ### 1982 ### 1999 ] .right-column[ > (semi)automated discovery of trends and patterns across very large datasets … > Use of large online text collections to discover new facts and trends … > (Automating) the tedious parts of the text manipulation process and (integrating) underlying computationally-driven text analysis with human-guided decision making within exploratory data analysis over text
Marti A. Hearst. 1999. Untangling text data mining. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics(ACL '99), Stroudsburg, PA, USA, 3-10. doi:
10.3115/1034678.1034679
] ??? TDM defined over the years. Made easier by high speed networks, cheap and fast CPU, cheap memory, and market needs for search and behavioral analysis. Used in science to deduce new data from very large datasets. ---
TDM Defined
.left-column[ ### 1982 ### 1999 ### 2008 ] .right-column[ > The use of automated methods for exploiting the enormous amount of knowledge available in the biomedical literature.
Cohen, K. Bretonnel; Hunter, Lawrence (2008). "Getting Started in Text Mining". PLoS Computational Biology 4 (1): e20. doi:
10.1371/journal.pcbi.0040020
. PMC 2217579.PMID 18225946.
] ??? TDM defined over the years. Made easier by high speed networks, cheap and fast CPU, cheap memory, and market needs for search and behavioral analysis. Used in science to deduce new data from very large datasets. --- ## Typical TDM Workflow ![TDM workflow](../img/tdm-workflow.png) ??? Analyze a corpus of text (A) using TDM algorithms such as optical character recognition, natural language processing, named-entity tagging, phonetic analysis, stemming, ngrams, etc. (B) to derive some data (C) that is manually inspected by humans (D) against some truth (E) to determine if the analysis is producing data as good as if the corpus were analyzed by humans (F). If not then the analysis is tweaked and run again on the corpus. If the derived data are good or better than what is produced by humans then the derived data become the new truth. Research papers are published which, of course, become a part of the corpus. ---
TDM Examples
.left-column[ ### GeoDeepDive ] .right-column[ a system that helps geoscientists discover information and knowledge buried in the text, tables, and figures of geology journal articles
Zhang, C., V. Govindaraju, J. Borchardt, T. Foltz, C. Ré, and S. Peters. 2013. GeoDeepDive: Statistical inference using familiar data-processing languages. SIGMOD ’13, New York, New York.
] ---
TDM Examples
.left-column[ ### GeoDeepDive ### Curation ] .right-column[ Leveraging text mining to improve human curation
Thomas C Wiegers, Allan Peter Davis, K Bretonnel Cohen, Lynette Hirschman and Carolyn J Mattingly. Text mining and manual curation of chemical-gene-disease networks for the Comparative Toxicogenomics Database (CTD). BMC Bioinformatics 2009, 10:326doi:10.1186/1471-2105-10-326
] ---
TDM Examples
.left-column[ ### GeoDeepDive ### Curation ### Osteoporosis ] .right-column[ Discovering a New Link between Genes and Osteoporosis
Varun K. Gajendran, Jia-Ren Lin, David P. Fyhrie, An application of bioinformatics and text mining to the discovery of novel genes related to bone biology, Bone, Volume 40, Issue 5, May 2007, Pages 1378-1388, ISSN 8756-3282, DOI: 10.1016/j.bone.2006.12.067. (http://www.sciencedirect. com/science/article/B6T4Y-4MVVSS1 1/2/df681f901acd33d5f3eceedb36fe441e)
] --- ## Law and TDM ### United States > (TDM is a kind of non-consumptive use) facilitated by new technologies and increasing computer power, that (does) not directly trade on the underlying creative and expressive purpose of the work being used. > Copying may include only the non-copyrightable aspects of the works, such as ideas, facts, or algorithms—in which case fair use need not come into play—or it may entail copying some expressive aspects of the work, but only as a means to a non-consumptive end.”
Urban, Jennifer. 2010. Updating Fair Use for Innovators and Creators in the Digital Age
--- ## Law and TDM ### United Kingdom > Researchers want to use every technological tool available, and they want to develop new ones. However, the law can block valuable new technologies, like text and data mining, simply because those technologies were not imagined when the law was formed. In teaching, the greatly expanded scope of what is possible is often unnecessarily limited by uncertainty about what is legal. Many university academics – along with teachers elsewhere in the education sector – are uncertain what copyright permits for themselves and their students.”
Hargreaves Report. Copyright Exceptions for the Digital Age
--- ## Law and TDM ### Australia > There is no specific exception in the Copyright Act for text or data mining. Where the text or data mining process involves the copying, digitisation, or reformatting of copyright material without permission, it may give rise to copyright infringement. > One issue is whether text mining, if done for the purposes of research or study, would be covered by the fair dealing exceptions. The reach of the fair dealing exceptions may not extend to text mining if the whole dataset needs to be copied and converted into a suitable format. Such copying would be more than a ‘reasonable portion’ of the work concerned.”
Non Consumptive Use, Australian Law Review Centre
--- ## Law and TDM ### Not a lot of case law > Judge Baer: (Defendants’) participation in the (Mass Digitization Project) and the present application of the (HathiTrust Digital Library) are protected under fair use.
Authors Guild, Inc. v. HathiTrust, 902 F. Supp. 2d 445 - Dist. Court, SD New York, 2012
> Judge Chin: Google Books provides significant public benefits. It advances the progress of the arts and sciences, while maintaining respectful consideration for the rights of authors and other creative individuals, and without adversely impacting the rights of copyright holders… Google's actions in providing the libraries with the ability to engage in activities that advance the arts and sciences constitute fair use.
Authors Guild, Inc. et al. v. Google Inc., U.S. District Court, Southern District of New York, No. 05-08136.
--- ## Law and TDM ### Not a lot of case law > Judge Nelson: We hold that Arriba’s reproduction of Kelly’s images for use as thumbnails in Arriba’s search engine is fair use under the Copyright Act.
Kelly v. Arriba Soft Corp., 336 F.3d 811, 818 (9th Cir. 2003)
> Judge Nelson: We conclude that Google's fair use defense is likely to succeed at trial, and therefore we reverse the district court's determination that Google's thumbnail versions of Perfect 10's images likely constituted a direct infringement.
Perfect 10, Inc. v. Amazon.com, Inc., 508 F.3d 1146, 1165 (9th Cir. 2007)
--- ![TDM workflow](../img/tdm-workflow1.png) A CC license does not apply to uses such as TDM that qualify as Exceptions and Limitations, and the user does not need to comply with terms and conditions of the license if TDM doesn’t implicate copyright or similar rights covered by the CC license. --- ![TDM workflow](../img/tdm-workflow1.png) CC 4.0 ND license specifically grants the rights to extract, reuse, reproduce, and Share all or a substantial portion of the contents of the database, provided any Adapted Material is not Shared ---
**Note:** Data sets resulting from TDM are not necessarily adaptations of the original licensed material. For example, an analysis based on findings would not constitute an adaptation. The data set mined may be an adaptation if it is a modified version of the original. ---
4.0 license
permissions granted (✔︎ permitted; ✘ not permitted)
To mine licensed material for commercial purposes
To produce adapted material
To share licensed material
To share adapted material
BY
✔︎
✔︎
✔︎
✔︎
BY-SA
✔︎
✔︎
✔︎
✔︎
BY-NC
✘
✔︎
✔︎
✔︎
BY-NC-SA
✘
✔︎
✔︎
✔︎
BY-ND
✔︎
✔︎
✔︎
✘
BY-NC-ND
✘
✔︎
✔︎
✘
**Note:** The above chart applies only if permission is needed as a matter of copyright and similar rights. If permission is not needed, there is no need to comply with the CC License Terms and Conditions when doing TDM. --- ![TDM workflow](../img/tdm-workflow2.png) Any modification in TDM analysis requires running it again on the corpus, a time-consumptive process made easier by a persistent cache of the corpus. Publisher contracts can create hurdles in creating such a persistent cache. --- ![TDM workflow](../img/tdm-workflow3.png) Licenses/contracts on the corpus can affect the license under which the final results and data are published. What would happen if the corpus were made of entities under different kinds of licenses? ---
Force11 Data Citation Principles
.left-column[ #### Importance ] .right-column[ Data should be considered legitimate, citable products of research. Data citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publications. ] ---
Force11 Data Citation Principles
.left-column[ #### Importance #### Credit and attribution ] .right-column[ Data citations should facilitate giving scholarly credit and normative and legal attribution to all contributors to the data, recognizing that a single style or mechanism of attribution may not be applicable to all data. ] ---
Force11 Data Citation Principles
.left-column[ #### Importance #### Credit and attribution #### Evidence ] .right-column[ Where a specific claim rests upon data, the corresponding data citation should be provided. ] ---
Force11 Data Citation Principles
.left-column[ #### Importance #### Credit and attribution #### Evidence #### Unique identification ] .right-column[ A data citation should include a persistent method for identification that is machine actionable, globally unique, and widely used by a community. ] ---
Force11 Data Citation Principles
.left-column[ #### Importance #### Credit and attribution #### Evidence #### Unique identification #### Access ] .right-column[ Data citations should facilitate access to the data themselves and to such associated metadata, documentation, and other materials, as are necessary for both humans and machines to make informed use of the referenced data. ] ---
Force11 Data Citation Principles
.left-column[ #### Importance #### Credit and attribution #### Evidence #### Unique identification #### Access #### Persistence ] .right-column[ Metadata describing the data, and unique identifiers should persist, even beyond the lifespan of the data they describe. ] ---
Force11 Data Citation Principles
.left-column[ #### Importance #### Credit and attribution #### Evidence #### Unique identification #### Access #### Persistence #### Versioning and granularity ] .right-column[ Data citations should facilitate identification and access to different versions and/or subsets of data. Citations should include sufficient detail to verifiably link the citing work to the portion and version of data cited. ] ---
Force11 Data Citation Principles
.left-column[ #### Importance #### Credit and attribution #### Evidence #### Unique identification #### Access #### Persistence #### Versioning and granularity #### Interoperability and flexibility ] .right-column[ Data citation methods should be sufficiently flexible to accommodate the variant practices among communities but should not differ so much that they compromise interoperability of data citation practices across communities. ] --- ![TDM workflow](../img/tdm-workflow4.png) **Importance Principle:** Data should be considered legitimate, citable products of research. Data citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publications. *Possible only if the corpus and the data extracted from it are made available to the public.* --- ![TDM workflow](../img/tdm-workflow5.png) **Credit and Attribution Principle:** Data citations should facilitate giving scholarly credit and normative and legal attribution to all contributors to the data, recognizing that a single style or mechanism of attribution may not be applicable to all data. *Possible only if the published literature and the corpus that it cites are made available to the public.* --- ![TDM workflow](../img/tdm-workflow1.png) **Note:** a CC License does not apply to uses (such as TDM) that qualify as Exceptions and Limitations, and the user does not need to comply with terms and conditions of the license if the TDM doesn’t implicate copyright --- ![TDM workflow](../img/tdm-workflow6.png) **Evidence Principle:** Where a specific claim rests upon data, the corresponding data citation should be provided. *Possible only if the corpus is made available to the public.* --- ![TDM workflow](../img/tdm-workflow6.png) **Access Principle:** Data citations should facilitate access to the data themselves and to such associated metadata, documentation, and other materials, as are necessary for both humans and machines to make informed use of the referenced data. *Possible only if the corpus can be made available to the public.*