The Rise of Text Mining and Growing Corruption in STM Publishing

Handsome young man sitting in dark room and using computerWhat's a few billion snippets among friends?

Technologist Carl Malamud, who once challenged U.S. state governments over the practice of charging the public to read public statutes, is taking on the scientific, technical and medical (STM) publishing industry by releasing a gigantic index of the words and short phrases contained in more than 100 million journal articles — including many papers one would normally have to pay to read.

The project is intended to unlock the world’s research papers to computerized analysis while evading copyright protections.

The catalogue, which was released on Oct. 7 and is free to use, holds tables of more than 355 billion words and sentence fragments listed next to the articles in which they appear. Malamud has described it as an effort to help scientists use software to glean insights from published work even if they have no legal access to the underlying papers. He released the files under the auspices of Public Resource, a non-profit corporation in Sebastopol, Calif., that he founded.

Malamud says that because his index doesn’t contain the full text of articles, but only sentence snippets up to five words long, releasing it does not breach publishers’ copyright restrictions on the reuse of paywalled articles. However, legal experts expect that publishers might question the legality of how Malamud created the index in the first place.

Computer scientists already text mine papers to build databases of genes, drugs and chemicals found in the literature, and to explore papers’ content faster than a human could read. But they often note that publishers ultimately control the speed and scope of their work, and that scientists are restricted to mining only open-access papers, or those articles they (or their institutions) have subscriptions to.

And although free search engines such as Google Scholar have — with publishers’ agreement — indexed the text of paywalled literature, they only allow users to search with certain types of text queries, and restrict automated searching making large scale analysis impossible.

Malamud had to get copies of the 107 million articles referenced in the index in order to create it. He’s not saying how he got them. Instead he emphasizes that researchers will not have access to the full texts of the papers, which are stored in a secured, undisclosed location in the United States. Protections aside, publishers will be interested to know if Malamud used illegal sources such as Sci-Hub to acquire the copies, but a legal challenge will be costly and is not likely to permanently quash Malamud’s index.

Scientific Articles for Sale on Black Market

Pressure to publish and the emergence of author pays open access model have led to growing corruption in STM publishing. The latest threat is the “co-authorship publisher”— shady publishers that hawk ready-made papers written by international experts on a wide range of scholarly topics.

An author can buy a position or an entire article. The papers have already been written, translated, proofread and formatted, and the journal has been chosen for publication. The customer simply needs to choose a topic, a position in the article and pay. The cost per position in the article depends on the journal's publication fee policy.

The publisher oversees the article's publication and indexing in the Scopus and Web of Science databases, making changes based on reviewer comments. The publisher ensures the confidentiality of the article position purchase by performing a scientific rewrite of the article title and abstract during the journal publication process. Any dishonest scientist can co-author a publication. Information about co-authors would be available during final manuscript coordination before the submission. To be the sole author, the purchaser must purchase the entire article.

Until payment is made, the publisher withholds the journal and article titles. After the author team is assembled and approved, the publisher sends the article to the journal's editorial office for review. An article review takes 1 to 3 months on average. After article acceptance, publication usually takes 1 to 2 months.

Corruption of this type of allows a researcher who buys research to get promotions and positions that cost them nothing but money.

A Closer Look at the STM Publishing Industry

These are just a couple of recent developments in the STM publishing industry. For a more a comprehensive look at the market, check out reports from Simba Information, a leading authority for market intelligence on the education and professional publishing industries. 

About the author: Dan Strempel is a Senior Analyst at Simba Information, where he has authored more than 26 studies over the past 14 years. His research has been cited in numerous publications including CNBC, Newsweek, Publishing Executive, The Association of American Publishers, and The Society for Scholarly Publishing. You can follow Dan on Twitter, where he shares industry news and analysis.

Topics: Media Industry Insights