In this talk, I will discuss some of my recent and surprising findings on the use of hashing algorithms for large-scale estimations. Locality Sensitive Hashing (LSH) is a hugely popular algorithm for sub-linear near neighbor search. However, it turns out that fundamentally LSH is a constant time (amortized) adaptive sampler from which efficient near-neighbor search is one of the many possibilities. Our observation adds another feather in the cap for LSH. LSH offers a unique capability to do smart sampling and statistical estimations at the cost of few hash lookups. Our observation bridges data structures (probabilistic hash tables) with efficient unbiased statistical estimations. I will demonstrate how this dynamic and efficient sampling beak the computational barriers in adaptive estimations where it is possible that we pay roughly the cost of uniform sampling but get the benefits of adaptive sampling. We will demonstrate the power of one simple idea for three favorite problems 1) Partition function estimation for large NLP models such as word2vec, 2) Adaptive Gradient Estimations for efficient SGD, 3) Sub-Linear Deep Learning with Huge Parameter Space.
I will show the power of these randomized algorithm by introducing SLIDE system. SLIDE is an auspicious illustration of the power of smart randomized algorithms over CPUs in outperforming the best available GPU with an optimized implementation for training large neural networks. Our evaluations on large industry-scale datasets, with some large fully connected architectures, show that training with SLIDE on a 44 core CPU is more than 2.7 times (2 hours vs. 5.5 hours) faster than the same network trained using Tensorflow on Tesla V100 at any given accuracy level.
In the end, if time permits, we will switch to memory cost and see the power of a simple hashing that can shrink memory requirements associated with classification problems exponentially! Using our algorithms, we can train 100,000 classes with 400,000 features, on a single Titan X while only needing 5% or less memory required to store all the weights. Running a simple logistic regression on this data, the model size of 160 GB is unavoidable.