Robustness in High-Dimensional Data Analysis
The analysis of very high dimensional data data sets where the dimensionality of each obser-vation is comparable to or even larger than the number of observations has drawn increasing attention in the last decade due to a broad array of applications, from DNA microarrays to SmartGrid, to consumer preference modeling and collaborative filtering, and beyond. This exciting newregime poses severe challenges computationally, statistically, and conceptually. As a result, manyof our tried-and-true statistical techniques fail in this regime, and new tools specifically designedfor high-dimensionality are needed. In this talk, we focus on an essential property of algorithms inhigh-dimensional data analysis: robustness to parameter uncertainty and corrupted observations.Robustness is not only an important property with its own merit, but as we show, also providesremarkable new insights unifying other previously seemingly disconnected properties including con-sistency and sparsity.
In the first part of the talk, we revisit one of the perhaps most widely used statistical techniquesfor dimensionality reduction: Principal Component Analysis (PCA). PCA is well-known to beexceptionally brittle even a single corrupted point can lead to arbitrarily bad PCA output. Weconsider PCA in the high-dimensional regime, where a constant fraction of the observations in thedata set are arbitrarily corrupted. We show that many existing techniques for controlling the effectof outliers fail in this setting, and discuss some of the unique challenges (and also opportunities)that the high-dimensional regime poses. Then, we propose a High-dimensional Robust PrincipalComponent Analysis (HR-PCA) algorithm that is computationally tractable, provably robust tocontaminated points, and easily kernelizable. The resulting subspace has a bounded deviationfrom the desired one, achieves maximal robustness, and unlike ordinary PCA algorithms, achievesoptimality (i.e., exact recovery) in the limiting case where the proportion of corrupted points goesto zero. To the best of our knowledge, this is the very first robust PCA algorithm that works inthe high-dimensional regime.
In the second part we turn to regularized regression, also known as Lasso. Lasso is arguably the most widely applied feature-selection algorithm in the high-dimensional regime due to its remark-able sparsity properties. We show that the solution to Lasso is the solution to a robust optimization problem. By connecting the regularizer to a physical robustness property, we obtain a rich algorithmic generalization of Lasso. Robustness also allows us to provide new analysis re-deriving the consistency and sparsity of Lasso. In short, we show that Lasso is consistent because it is robust. Similarly, it is sparse because it is robust. This points to the beginning of a much broader theory, that shows that robustness, beyond its own merit, provides a unifying algorithmic and analytic perspective for understanding the high-dimensional regime.
Huan Xu is a postdoctoral associate in the Department of Electrical and Computer Engi-neering at The University of Texas at Austin. He received the B.Eng. degree in automation fromShanghai Jiaotong University, Shanghai, China in 1997, the M.Eng. degree in electrical engineeringfrom the National University of Singapore in 2003, and the Ph.D. degree in electrical engineering from McGill University, Canada in 2009. His research interests include high-dimensional data anal-ysis, machine learning, robust optimization, and decision making and control under uncertainty.