For scientists who have no other option but to do analysis on closed/limited access corpuses, a discussion on the merits and drawbacks of text-mining such corpuses is irrelevant. For them, it is a choice of either doing their science or not doing it at all.
But, let us consider the scientists who do have a choice. This discussion applies really only to this latter group who can either choose to conduct their research using alternative corpuses, or to dig their feet in the ground and insist upon more liberal licensing terms in line with the spirit of open science.
Restricting myself to just this group of scientists, I see no strength or opportunity in conducting text mining on closed/limited/restricted access text under onerous/restrictive licensing:
Friends of mine have been working on a text-mining project in the course of which they are building a very large corpus that is off-limits to the general public. This is because of a special and restrictive license negoatiated between the university and a big-name publisher:
This describes our starting point. The university had already signed draconian agreements that gave up rights that we would have normally had under copyright law.
Re-negotiation of terms is what we have spent a lot of time doing so #2 is part of what we are still doing now. It is a pain in the ass and slow, as multiple back-and-forths are required. What we are doing is new.
To some extent #3 applies, but for the most part we have the range of uses we would like. There are gray areas about what we can and cannot distribute/do because of the multiple steps of separation between origial document and a result/product. Nobody has access to the original documents, nobody has access to “full corpus downloads” of first-step software output from those documents (e.g., ocr, NLP). I think that satisfies the contracts. use after that point is outside of the restrictions of any kind, though we expect citation
The above experience seems quite in line with the kind of outcome we can expect if we negotiate these one-off terms with the publishers. if one compromises on principle, if the principle is negotiable, then it is not a principle. I do believe that one has to take a truly principled stance against the publishers. This is particularly meaningful (and problematic) for a researcher who has to either stay true to the principle and forego the research or do the research at the cost of the principle.