Three Strategies for TDM

Sunday, May 4, 2014

Increase awareness of text and data mining (TDM) in researchers who may be curious about it and might benefit from it, but don't know enough of it.
- The idea here is to create an upswell of TDM demand so demand actually drives this area. With more and more researchers wanting to do TDM, there will be pressure on both their institutions and publishers to make TDM possible and easy. Researchers will drive the agenda, as I believe it should be;
- Explain to them the consequences of restrictive licenses and contracts and educate them on why permissive legal instruments are good for TDM (all publishers are potentially consumers as well, so if we publish our stuff under restrictive licenses, it will come back to bite us in our butts);
Develop a TDM-friendly contractual boiler plate that institutions can use to negotiate favorable contracts with publishers for non-OA corpora;
Develop a pre-processed TDM-ready archive with a uniform API and all open source tools and lexicons published under a completely free and open license and hosted in a distributed way at respected institutions such as the Internet Archive and CERN. Think of it like what the Google syntactic n-gram database is to scanned books, this archive would be to scientific literature.