WNCG Prof. Constantine Caramanis along with Ph.D. student Ioannis Mitliagkas and MSR Bangalore researcher Dr. Prateek Jain, have obtained the first-ever linear-memory algorithm for Principal Component Analysis. Their algorithm is efficient to implement, needs to see each data point only once, and works even in the setting of many missing entries.
Principal component analysis is a fundamental tool for dimensionality reduction, clustering, classification, and many more learning tasks. It is a basic preprocessing step for learning, recognition, and estimation procedures. The core computational element of PCA is performing a (partial) singular value decomposition, and much work over the last half century has focused on efficient algorithms and hence on computational complexity. The recent focus on understanding high-dimensional data (examples: video or image data, medical or DNA data), where the dimensionality of the data scales together with the number of available sample points, has led to an exploration of the sample complexity of covariance estimation. What has not been considered is the memory complexity of PCA algorithms. The only algorithms with known performance guarantees thus far, require O(p2) memory, in p dimensions. This can be prohibitive for modern high-dimensional applications.
This work fills precisely this need. We develop an algorithm with O(p) memory requirement (the best possible) and with performance matching state-of-the-art memory-intensive algorithms. Moreover, in followup work, we also develop an algorithm that works even when each data point has suffered a vast number of deletions or erasures.
- Paper 1: Memory-Limited Streaming PCA
- Paper 2: Streaming PCA with Many Missing Entries
This research was partiall funded by the National Science Foundation (NSF) and the Defense Threat Reduction Agency (DTRA).