
The use of scientific software tools is often not mentioned in research articles.Credit: BalanceFormCreative/Shutterstock
Software is a key component of modern scientific research. However, in many cases the software has not been formally released or cited in the literature, making it difficult for researchers and developers, and the organizations that fund them, to quantify its impact. It is Our newly released data set aims to fill that gap.
Developed by the Chan Zuckerberg Initiative (CZI), a scientific funder based in Redwood City, California, the CZ Software Mentions dataset catalogs formal citations. The software is mentioned in the text of the scientific article, rather than in the1With 67 million mentions from nearly 20 million full-text research articles released last September 28, the dataset is the largest ever database of mentions of scientific software, said CZI’s Science Program Officer. said Dario Taraborelli,
“If you look at the major breakthroughs in science over the last decade, not just in biomedicine, they have consistently been computational in nature,” says Taraborrelli. For example, predicting protein folding or delineating black holes. “And scientific open source software in particular has been at the heart of these breakthroughs.”
Why Science Needs More Research Software Engineers
Through its Essential Open Source Software for Science (EOSS) program, CZI has committed US$40 million over three years to support programmers developing such software in the biological sciences. But the organization wants prospective funders to know where their funding will have the greatest impact. It was the perfect place to paint,” says Taraborelli.
Impact measurement
To create the dataset, Taraborrelli’s team started with an artificial intelligence language model called SciBERT. This is a neural network trained on research papers to display text and fill in missing sections. The researchers further trained SciBERT to process text and determine whether a word or phrase is the name of scientific software. To do this, they presented an existing dataset of about 5,000 scientific papers, called his SoftCite. All software mentions were manually labeled in this data set. The researchers then applied the sophisticated model to a collection of nearly 20 million articles that CZI obtained from his repository PubMed Central online and directly from publishers.
I then tried to identify the specific software tool that each mention referred to. CZI research scientist Ana-Maria Istrate says this is one of her biggest challenges. For example, a set of tools for data analysis called scikit-learn may appear in text as “Scikit learn”, “sklearn”, “scikit-learn81”, or other phrasing. The researchers first applied her clustering algorithm to group software mentions by similarity, with each cluster representing one piece of her software. We then picked the most common terms for each cluster and searched online software repositories such as GitHub to map software names to online locations. Finally, the researchers manually cleaned the data to remove phrases that did not actually refer to the software.
Applied to a subset of 2.4 million papers, the team detected about 10 million mentions. This equates to 97,600 unique software. People can use these data, for example, to identify the most frequently mentioned tools by field of study, to find software titles that appear together, or to identify the most popular parts of software over time. can be revealed in series (see Rise of Software). These potential uses are described in computational notes attached to the Software Mentions data set repository on GitHub. “We are thrilled to note that some of the software ranked near the top are tools funded through the EOSS program,” he says. These include titles such as Seurat, GSVA, IQ-TREE and Monocle.

Source: CZI/Ref.1
Frank Krueger, a computer scientist at Germany’s Wismar University of Applied Sciences, completed a similar project last year.2stating that the CZI team “does a great job of establishing a great resource covering mentions of the software.”
Michelle Barker, who lives in Australia and heads the Research Software Alliance, a nonprofit that brings together developers and funders of scientific software, said the data set is an important contribution. “We’re at this wonderful juncture where research software is being recognized as an important part of modern research,” she says, but researchers are “making data available for analysis.” “We need to. Documenting references to software not only helps direct funds appropriately,” she adds. It also gives developers ratings and helps organizations know who to hire and promote.
It also helps developers know how their work is being used, allowing researchers to indicate which specific tools were used to perform published computational analyses, and to improve reproducibility. Raise.
Need new norms
Tools such as the CZ Software Mentions data set are just one element of the developer’s work. New standards are also needed, researchers say.Amsterdam Declaration on Sustainability Funding for Research Software3Produced by the Research Software Alliance last November. (Same discussion of datasets.)
Former Google chief’s venture aims to save neglected scientific software
And in November, Taraborrelli and colleagues published ’10 Simple Rules for Funding Scientific Open Source Software’FourIt encourages diversity, promotes transparent governance of software projects, and advises funders to support not only the creation of tools but also the maintenance of existing tools.
Ironically, the more a tool is used, the less often it tends to be specifically mentioned in the paper. Taraborelli points out the ubiquity of Matplotlib and NumPy. This is a popular library for numerical analysis and graph plotting in the Python programming language. These usages are often not mentioned. But on GitHub, hundreds of thousands of other software packages depend on these libraries. “If you count software dependencies as citations, some of these projects will be the most influential artifacts ever produced in science,” he says. “Yet, until a few years ago, major funding agencies refused to fund these projects, saying they didn’t have enough impact.”
Robert Lanfear, a biologist at the Australian National University Canberra and co-developer of the IQ-TREE software, said: “Additional usages are always welcome. These will help us better understand how and to what extent each software his package is being used.”