Data Diversity

16 December, 2016

Preserving variety in subsets of unmanageably large data sets should aid machine learning.

When data sets get too big, sometimes the only way to do anything useful with them is to extract much smaller subsets and analyze those instead.

Those subsets have to preserve certain properties of the full sets, however, and one property that’s useful in a wide range of applications is diversity. If, for instance, you’re using your data to train a machine-learning system, you want to make sure that the subset you select represents the full range of cases that the system will have to confront.

Last week at the Conference on Neural Information Processing Systems, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory and its Laboratory for Information and Decision Systems presented a new algorithm that makes the selection of diverse subsets much more practical.

With standard algorithms, selecting a subset of data points from a set with, say, a million data points would have been, effectively, impossible on a desktop computer. With the researchers’ new algorithm, it would take minutes.

“We want to pick sets that are diverse,” says Stefanie Jegelka, the X-Window Consortium Career Development Assistant Professor in MIT’s Department of Electrical Engineering and Computer Science and senior author on the new paper. “Why is this useful? One example is recommendation. If you recommend books or movies to someone, you maybe want to have a diverse set of items, rather than 10 little variations on the same thing. Or if you search for, say, the word ‘Washington.’ There’s many different meanings that this word can have, and you maybe want to show a few different ones. Or if you have a large data set and you want to explore — say, a large collection of images or health records — and you want a brief synopsis of your data, you want something that is diverse, that captures all the directions of variation of the data.

