Statistical inference of information measures beyond i.i.d.
Recent years have witnessed significant progress in entropy estimation, in particular in the large alphabet regime. Concretely, there exist efficiently computable information theoretically optimal estimators whose performance with n samples is essentially that of the maximum likelihood estimator with n log(n) samples, a phenomenon termed ``effective sample size boosting''. Generalizations to processes with memory (estimation of the entropy rate) and continuous distributions (estimation of the differential entropy) have remained largely open. This talk is about the challenges behind those generalizations and recent progress in this direction. For estimating the entropy rate of a Markov chain, we show that when the mixing time is not too slow, the information theoretic limit is S^2/log(S) samples, where S is the size of the state space. In contrast, the empirical entropy rate requires S^2 samples to achieve consistency even if the Markov chain is memoryless. We propose a general approach to achieve the S^2/log(S) sample complexity, and illustrate our results through estimating the entropy rate of the English language from the Google 1 Billion Word Dataset. For differential entropy estimation, we characterize the minimax behavior over Lipschitz balls in both the compactly supported case and sub-Gaussian case without assuming the usual assumption that the density is bounded away from zero, and show that a fixed-k nearest neighbor estimator adaptively achieves the minimax rates up to logarithmic factors without knowing the smoothness of the density. The "effective sample size boosting" phenomenon holds in both the Markov case and the case of continuous distributions.