Machine Learning with Small Data

Machine learning and big data are broadly believed to be synonymous. The story goes that large amounts of training data are needed for algorithms to discern signal from noise. As a result, machine learning techniques have been most used by web companies with troves of user data. For Google, Facebook, Microsoft, Amazon, Apple (or the “Fearsome Five” as Farhad Manjoo of the New York Times has dubbed them), obtaining large amounts of user data is no issue. Data usage policies have become increasingly broad, allowing these companies to make use of everything from our keystrokes to our personal locations as we use company products. As a result, web companies have been able to offer very useful, but intrusive, products and services that rely on large datasets. Datasets with billions to trillions of datapoints are not unreasonable for these companies.

However, in the academic world, machine learning has been making large in-roads into the sciences. The situation here with respect to data is significantly different. It’s not so easy to obtain large amounts of scientific or medical data. The largest barrier is cost. Traditionally, researchers have relied on tools like Amazon’s Mechanical Turk to harvest data. There, low-paid workers (rates are far below US federal minimum wage, averaging out at something like $1/hr) perform repetitive tasks such as labeling objects and faces in images or annotating speakers in text. These tasks rely on fundamental human skills typically mastered by kindergarten. Performing scientific experiments however, requires significantly greater amounts of expertise. As a result, going rates for experimental workers are much much higher than for mechanical turkers.

One way around this problem is to brute-force a solution through money. Google recently published a landmark study on building deep learning systems for identifying signs of diabetic retinopathy in eye scans. To obtain data for this study, Google paid trained physicians to annotate large amounts of data. The resulting work likely cost hundreds of thousands or millions of dollars to complete. For Google, the expenditure would have amounted to a rounding error in their financials. For academic researchers, performing such a study would have required crafting and receiving a large grant from funding agencies. Needless to say, in today’s troubled scientific funding environment, few researchers can hope to obtain such resources.

What does this state of affairs entail? Are we doomed to live in a world where the best research can only be performed by large corporations with the required monetary resources? Money will always provide an advantage, but perhaps the situation is not as dire as it may seem. Recently, there has been a surge of work in low data machine learning. Work from MIT a few years ago [1] demonstrated that it was possible to build “one-shot” image recognition systems, capable of learning new classes of visual objects from a single example, using probabilistic programming. Follow-up work from DeepMind [2] demonstrated that standard deep-learning toolkits like Tensorflow could replicate the feat. Since then, recent work has shown that one-shot learning can be extended to drug discovery [3] (work by myself and collaborators), robotics [4] and other areas.

The emerging theme is that sometimes, it’s possible to transfer information between different datasets. Although there may only be very limited amounts of data available for a particular machine learning problem, if there are large amounts of data available for related problems, clever techniques can allow models to transfer useful information between the two systems. These techniques may help scientific machine learning overcome its low data problem by transferring knowledge from data-rich to data-poor problem spaces.

To gain an intuitive understanding of how these techniques work, let’s consider the fable of the baby and the giraffe. Let’s suppose that you’re taking your cute baby niece to the zoo. You take her to the Giraffe exhibit where you show her a giraffe. She’s never seen a giraffe before, so she’s very excited and learns to say “Giwaf!” (close enough…). Now a few weeks later, you take her to the zoo again and pass by the giraffe exhibit. Lo and behold, she says, “Giwaf, giwaf!” How does this happen? How can a baby learn to identify giraffes having only seen them once before?

While the developmental psychology and cognitive science for how babies learn to recognize animals is unresolved, we now have working mathematical models that can (roughly) explain the process. The key insight is that although your niece has never seen a giraffe before, she’s seen plenty of other visual objects. In particular, she’s likely learned to tell when objects are the same and when they’re different. Mathematically speaking, this is a metric on image space. A metric is a notion of distance between two objects. To tell if a new object she sees is a giraffe, she simply needs to pull up the giraffe picture from her memory, then use the metric to tell if the new object is close enough to be called a giraffe as well. The series of one-shot learning papers discussed previously have shown that this basic insight can be effectively implemented on real world datasets and extended from the visual domain to molecular and robotic machine learning problems.

How far do these techniques go? Is the era of big data machine learning over? Not quite. Analysis work on one-shot drug discovery [3] has shown that there are currently many limitations to information transfer. For molecules at least, current algorithms can’t generalize to dramatically new systems. The metrics learned are relatively inflexible and are not capable of processing dramatically different datapoints. There’s good reason to suspect that similar limitations hold true for other machine learning applications. While one-shot and low-data techniques allow for some information transfer, they don’t allow for the broad flexibility of experience transfer that humans can perform.

There are some grounds to believe that one of the major roadblocks separating today’s AI from general human level intelligence is the low data information transfer problem. Human scientists are capable of drawing far reaching insights from very limited amounts of information. As the (apocryphal) story goes, Newton could generalize from the instance of a single falling apple to a general principle of gravitation explaining how the planets orbit the earth. What a feat of one-shot learning! In this view, physics itself is an extreme form of low data learning, that seeks to extract general principles from limited data points.

Could we perhaps draw lessons and inspirations from the physicists? How can we design learning systems with similarly desirable learning properties? Physicists often critically rely on invariances and aesthetics as they design theories. From long experience, physicists know that scientific theories often satisfy certain mathematical criteria. Einstein’s search for general relativity’s equations critically depended on his belief in the notation of covariance, the idea that the derived law should not depend on the particular coordinate system at hand. Similarly, we can expect that broadly generalizable learning algorithms must leverage hidden structure in the world.

How might we encode algorithms to extract these broadly generalizable laws? This is a major research question, but my personal hunch is that we will need to find a way to teach our learning systems to understand beauty. Mathematicians, physicists, and scientists train themselves to perceive a sense of beauty in nature’s laws. Algorithms that learn (perhaps from human demonstration) to value this sense of beauty in hidden structure might one day succeed in discovering successors to our greatest scientific theories.

Acknowledgements: Thanks to Sravya Tirukkovalur for the conversation that spurred me to write up my thoughts on low data machine learning.

[1] Lake, Brenden M., Ruslan Salakhutdinov, and Joshua B. Tenenbaum. “Human-level concept learning through probabilistic program induction.” Science 350.6266 (2015): 1332-1338.

[2] Vinyals, Oriol, et al. “Matching networks for one shot learning.” Advances in Neural Information Processing Systems. 2016.

[3] Altae-Tran, Han, et al. “Low Data Drug Discovery with One-Shot Learning.” ACS central science 3.4 (2017): 283-293.

[4] Duan, Yan, et al. “One-Shot Imitation Learning.” arXiv preprint arXiv:1703.07326 (2017).

Written on June 13, 2017