Learning Models Of Disease

Modern drug discovery remains an artisanal pursuit, driven in large part by luck and expert knowledge. While this approach has worked spectacularly in the past, the last few years have seen a systematic decrease in the number of new drugs discovered for dollar spent. Eroom’s law empirically demonstrates that the number of new drugs per dollar has been falling exponentially year over year. Eroom is of course Moore spelled backward, where Moore’s law observes that transistor densities on computer chips have been increasing exponentially year over year for the past fifty years. The opposite trends in increasing computational power per dollar versus decreasing number of drugs discovered serve as a reminder that naive computation is insufficient to solve hard biological problems (a topic I’ve written about previously). To reverse Eroom’s law, scientists must combine deep biological insights with computational modeling, and I hypothesize that the best path forward is systematically learning causal models of human disease and drug actions from available experimental data.

To explain this choice, let’s take a quick detour into the history of drug discovery. Early efforts were driven by phenotypic screening, in which drugs were selected based on demonstrated efficacy in sick humans or animals. While this approach was highly effective, it suffered from low throughput; only a few sick humans or animals were available, making it difficult to test many potential drugs. As a result, the pharmaceutical industry increasingly shifted towards targeted drug discovery. Biologists’ increasing knowledge of the molecular mechanisms underlying life made it possible to single out biomolecules and hypothesize that diseases could stopped by chemically inhibiting these with drugs. The targeted approach also facilitated very large scale in-vitro (in the test-tube) experiments since large solutions of biomolecules could be prepared easily, and made it possible to discover unexpected treatments using similarities between molecular machinery in disparate parts of the body.

While the shift to targeted drug discovery facilitated many procedural aspects of the pharmaceutical industry, it also explains in large part the drop in drugs yielded per dollar. Biology is hard, and human understanding of the body’s regulatory mechanisms is highly incomplete at best, and completely wrong at worst. Any pharmaceutical veteran can share stories where a potential drug looks highly promising in early stage targeted tests, but causes catastropic deaths in the first animal trials. The question for forward progess then remains: how can we regain the efficacy of early phenotypic screening, while retaining the scale and precision of targeted discovery campaigns?

A number of innovative experimental solutions have started to mature. For example, PerlsteinLab uses CRISPR to make mutations in non-mammalian organisms (such as nematodes or flies) to enable them to better model human genetic disorders. This approach offers scale, since non-mammalian organisms can often be raised to maturity quickly, allowing for large scale testing of hypotheses. Other companies such as Transcriptic or Emerald Cloud Lab promise to facilitate complex biological experiments, allowing researchers to perform more sanity checks on their proposed drugs. While these advances will likely prove very useful, I suspect that in isolation, they won’t be sufficient to regain the efficiency of systematic phenotypic screening. Any experimental model is limited by human understanding of the biology underlying diseased systems, and our understanding remains too weak. As Donald Rumsfeld famously noted, it’s hard to plan for unknown unknowns, in life and in drug discovery. What then is the path forward?

I propose that we systematically learn algorithmic models of disease, capable of extracting sophisticated insight not obvious to human scientists, and which are capable of predicting responses to introduced drugs. Although the pharmaceutical industry already routinely uses simple statistical models of molecular properties such as solubility or toxicity to guide drug discovery, pharmaceutical scientists typically stop short of computationally modeling diseases themselves. The systems biology community has made significant progress on this front, by creating complex systems of differential or logical equations to model biological dynamics. While these models are quite useful, they can be brittle, and hard to create, requiring significant domain expertise to tune parameters to match experimental results. The path forward requires directly learning disease models from experimental data.

Machine learning has traditionally been the science of learning simple functions which explain observed correlations in datasets. There have been significant advances in recent years, which have allowed practitioners to learn far more sophisticated functions capable of modeling visual and auditory data, but today’s techniques still struggle to extract complex causal models from data. Modern machine learning systems remain incapable of modeling the complex biophysical processes that underly, say, tumor state or tumor response to targeted medication. For scientists to make progress, we need to systematically curate large datasets of high-fidelity disease measurements, from a variety of sources, but with an emphasis on data from human patients. This collection can be used to learn causal disease models, which should then be iteratively tested by comparison to experiment. Most importantly hand-tweaking of parameters should be forbidden; progress must come through algorithmic advancement and better data, not through boutique applications of isolated human insight.

Written on July 26, 2016