Machine Learning for Scientific Datasets
I just read the fascinating paper Could a neuroscientist understand a microprocessor?. The paper simulates a simple microprocessor (the MOS 6502, used in the Apple I and in the Atari video game system) and uses neuroinformatics techniques (mostly statistics/machine-learning) to analyze the simulated microprocessor. More specifically, the authors analyze the connections between transistor on the microprocessor (see Connectomics), ablation of single transistors, covariances of transistor activities, and whole-chip recordings (analogous to whole-brain recordings).
Some of the basic analysis techniques yield interesting insight. For example, PCA recovers the processor clock signal. Nevertheless, the reconstructed insight is very incomplete; none of the techniques succeed in reconstructing standard microprocessor primitives, such as logical (AND, OR) gates or adder circuits. The analysis totally fails to recover the causal structure that underlies the design of chip. The authors conclude that current neuroinformatic techniques can’t provide the insight required to build new microprocessors. It follows naturally that current algorithmic techniques remain insufficient for analyzing the brain, a system significantly more sophisticated than the MOS 6502).
Stepping back from the specifics, the paper raises a basic issue in the computational sciences. Although an abundance of scientific data is starting to become available, the techniques available to analyze this data and gain scientific insight are sorely lacking. This point has been made elsewhere in the literature (for example, see this review for similar points made in the protein simulation literature).
This inadequacy of scientific algorithms is curiously at odds with the recent bonanza of deep-learning successes. What explains the discrepancy? Much of modern machine-learning is designed to solve well-defined problems (predicting the objects in an image, the sentiment of a sentence, the proteins a compound interacts with). However, in the scientific context, the goal is to discover the problems worth analyzing! It’s not clear at all what learning “tasks” could be associated with the microprocessor analysis setting without hand-coding the solution to the scientific discovery problem in this dataset.
A scientist often searches for causal structure in her data. For example, in the case of the microprocessor, these causal structures could include logical-gates or adders. It’s not at all clear that current machine-learning algorithms could have identified logical-gates given signals measurable from the MOS 6502. Although progress has been made in machine-learning “latent-variable” models that explain hidden structure in data, the state-of-the-art remains primitive compared to the needs of scientific datasets. (That said, recent progress in this space has been encouraging; see this fascinating work for example).
It’s incumbent on the machine-learning and scientific communities to build more sophisticated machine-learning benchmarks that rise to the challenge of learning hidden-structure from scientific data. There’s been a spate of sophisticated benchmarks released recently in other parts of machine-learning (for example, OpenAI Gym looks promising for reinforcement learning). Doing the same for scientific machine-learning has the promise to trigger the same dramatic advances that have been seen elsewhere.