My 2018 proposal for the Better Scientific Software Fellowship

science
software
reproducibility
Author

Sean K. Maden

Published

November 12, 2018

I wanted to share my proposal for the 2018 Better Scientific Software (BSSw) Fellowship. BSSw aims to increase and preserve integrity and standards for publishing computer code in science, and their fellowship program recognizes and supports advocates of this cause. You may or may not be aware that we currently lack standard ways of referencing published code in science as independently citable units. Furthermore, vital source code for experiments can be distributed in many places, including supplemental materials sections behind paywalls, personal websites that may become inaccessible or go offline over time, and repositories on GitHub or elsewhere that may not include inherent and persistent identifiers. I propose using an autocompilation technology to aggregate published scientific computer code and code metadata into a new database, called Pubsrc. This will enable novel assessments of scientific code use, including automatic generation of dependency usage networks, tracking the impact of newly discovered software bugs throughout research, and making scientific code independently citable. I hope you enjoy reading my proposal, and please share or tweet this post if you support this cause.


Section: Experience

Prompt: Describe your work relevant to scientific software.

My Response: My main scientific mission is to reduce friction and latencies in data analysis. In my experience practicing bioinformatics, I grappled with the many difficulties surrounding analysis of genomics data, so I designed and implemented new scientific software to address them. I worked for 3 years in Prof. Bill Grady’s clinical epigenetics lab at Fred Hutch in Seattle, where I cultivated expertise in the epigenetics and molecular bases of gastroesophageal cancers, working my way from intern to Data Analysis Assistant. I collaborated with physicians, wet lab researchers, and fellow science analysts. My experiences led me to design and publish preprocessing and analysis workflows for methylation arrays, which probe chemical modifications to DNA that affect gene expression. I also developed a series of tools and resources to aid other researchers (see https://github.com/metamaden, in particular the methyPre, cgmappeR, and cgageR repositories). I further made intellectual contributions at an NIH hackathon event, where I pushed commits to the open-source PubCode project, a resource for indexing published scientific code (see project repository at https://github.com/NCBI-Hackathons/PubCode). These past efforts inspired the present BSSw fellowship proposal, as described below.

Prompt: Describe your background and experience relevant to being a BSSw Fellow.

My Response: The BSSw Fellowship supports preserving code functionality with proper versioning and documentation, and creating resources that aid researchers in doing code-based science effectively. I have extensive professional experience as a bioinformatician working with public data repositories and scientific code. This has enabled me to tackle systemic problems around code access, publication, and reuse for reproducing findings in computational science. At an NIH hackathon event, I contributed to PubCode, which aimed to index scientific software similar to how PubMed indexes journal articles in biomedical science. PubCode would have archived and made citable scientific code used in research papers. This is important: if a developer took down their code or its associated executables, portions of papers using them could become irreproducible. Moreover, when no publication is associated with software, authors may cite a GitHub or Bitbucket repository, or some other website lacking a persistent digital object identifier (DOI). This distributes citations across multiple resources, making it challenging to find all usages of a software package in the literature. Finally, when a publication uses new code, that code is often not released as standalone software. This means the code is not widely identified as an independently citable object. Bureaucratic issues at the NIH unfortunately prevented PubCode from realizing its full potential, and the project I propose here will subsume its ideas.

Section: Proposed work and potential impact

Prompt: What would you do as a BSSw Fellow?

My Response: The problems I seek to solve are: (1) insufficient attribution of scientific code to its authors, because no resource tracks usages of published scripts or software tools (independent from any associated journal article) across the scientific literature; (2) bugs in software can invalidate analyses in scientific papers, but it is difficult to track the papers impacted by bugs; (3) a software and its versions can disappear from the internet, making reliant scientific results challenging or impossible to reproduce. To solve these problems, I propose to develop PubSrc, an open-access resource to crawl scientific papers for computer code. Code, along with affiliated release dates and version information, will be made queryable independently from their original publication or website. PubSrc will thereby empower web searches to rapidly obtain lists of scientific papers that use software, by version and platform. Importantly, PubSrc will recognize that scripts and software themselves have software dependencies, and it will plot dependencies across papers, scripts, and software. In these ways, PubSrc will address problems (1) and (2). To address problem (3), PubSrc will automatically archive source codes of open-source software by crawling scientific papers. To achieve my goals, I plan on involving my PhD adviser Abhinav Nellore and former collaborators on PubCode, especially Ben Busby at NIH/NLM; the $25K will be used for travel and protected time to work on this project.

Prompt: What impact do you foresee from your efforts?

My Response: I anticipate the following positive developments from PubSrc: (1) Sets of scripts that were previously not recognized as independent units of scientific work (e.g., custom scripts for analyses accompanying a lab’s paper) will bear unique identities permitting (a) assessing their impact across the literature, (b) pinpointing bugs and other errors due to software dependencies, and (c) citability, whether or not these scripts were collected and assigned a DOI through other resources like, for example, Figshare. (2) Software developers who resolve bugs will be able to find and therefore apprise authors of scientific works whose results may be affected. Moreover, authors of scientific literature will be able to readily search for any bugs resolved since their papers came out. (3) Over time, results from papers that depend on scientific software will likely become more reproducible, independent of software availability outside PubSrc. We anticipate, in some cases, that code will not be open-source and only executables will be available. Indexing these may be permitted or limited by law. Further, some executables may depend on a particular computing environment for proper execution. A clear next step for PubSrc would then be to archive executables and/or make software containers available for reproduction of scientific results.