Succinct Representations of Big Data: Binary Embeddings

02 Apr 2015

Low distortion embeddings that transform high-dimensional points to low-dimensional space have played an important role in dealing with storage, information retrieval and machine learning problems for modern (large scale) datasets. Xinyang Yi and Profs. Constantine Caramanis and Eric Price develop novel algorithms with the best-known results for this important problem.

Indeed, perhaps one of the most famous results along these lines is the Johnson-Lindenstrauss (JL) lemma, which shows that N points can be embedded into an (approximately) log N-dimensional space while preserving pairwise Euclidean distance up to some small distortion. As the interest stems for massive scale problems, significant recent effort has focused on fast algorithms for computing these embeddings.

Binary embedding is a nonlinear dimension reduction methodology where high dimensional data are mapped to short strings of 0s and 1s -- this is called an embedding into the Hamming cube. The goal is to do this while preserving the structure of the original space. Embedding into the binary cube has two advantages in practice: (i) As each data point is represented by a binary code, the disk size for storing the entire dataset is reduced considerably. (ii) Distance in the binary cube is some function of the Hamming distance, which can be computed quickly using computationally efficient bit-wise operators. As a consequence, binary embedding can be applied to a large number of domains such as biology, finance and computer vision where the data are usually high dimensional.

While the problem of binary embedding is therefore important, it is also difficult. Existing binary embedding algorithms either lack theoretical guarantees or suffer from running time orders of magnitude too large. We make three contributions: (1) we establish a lower bound on the number of points required by any binary embedding; (2) we propose a novel fast binary embedding algorithm with provably optimal bit complexity, and near linear running time.

* Paper:

This research was partially funded by the National Science Foundation (NSF) and the U.S. Department of Transportation through the Data-Supported Transportation Operations and Planning (D-STOP) Tier 1 University Transportation Center.