Text-mining limited access corpuses

Tuesday, January 30, 2018

For scientists who have no other option but to do analysis on closed/limited access corpuses, a discussion on the merits and drawbacks of text-mining such corpuses is irrelevant. For them, it is a choice of either doing their science or not doing it at all.

But, let us consider the scientists who do have a choice. This discussion applies really only to this latter group who can either choose to conduct their research using alternative corpuses, or to dig their feet in the ground and insist upon more liberal licensing terms in line with the spirit of open science.

Restricting myself to just this group of scientists, I see no strength or opportunity in conducting text mining on closed/limited/restricted access text under onerous/restrictive licensing:

  • licenses that restrict sharing of text mining output undermine the ideals and even the practical motivations of academia
  • agreeing to the restrictive terms of publishers potentially weakens subsequent claims to accessing scientific corpuses under open terms

Friends of mine have been working on a text-mining project in the course of which they are building a very large corpus that is off-limits to the general public. This is because of a special and restrictive license negoatiated between the university and a big-name publisher:

  1. If the publisher had put very onerous terms in the license agreement that allowed you to download the papers, would you have done the project?

This describes our starting point. The university had already signed draconian agreements that gave up rights that we would have normally had under copyright law.

  1. Let’s assume that the terms were onerous enough to be unacceptable to you. But doing that research was really important for you. Would you have accepted the terms or would you have chosen to forego that research and do something else?

Re-negotiation of terms is what we have spent a lot of time doing so #2 is part of what we are still doing now. It is a pain in the ass and slow, as multiple back-and-forths are required. What we are doing is new.

  1. Do you in any way now feel hamstrung by the terms now that you are well into your project? Any regrets for having signed that?

To some extent #3 applies, but for the most part we have the range of uses we would like. There are gray areas about what we can and cannot distribute/do because of the multiple steps of separation between origial document and a result/product. Nobody has access to the original documents, nobody has access to “full corpus downloads” of first-step software output from those documents (e.g., ocr, NLP). I think that satisfies the contracts. use after that point is outside of the restrictions of any kind, though we expect citation


The above experience seems quite in line with the kind of outcome we can expect if we negotiate these one-off terms with the publishers. if one compromises on principle, if the principle is negotiable, then it is not a principle. I do believe that one has to take a truly principled stance against the publishers. This is particularly meaningful (and problematic) for a researcher who has to either stay true to the principle and forego the research or do the research at the cost of the principle.