Miro Dudík
Maximum Entropy Density Estimation

We consider the problem of estimating probability densities. Maximum entropy principle (maxent) states that density estimates should respect empirical information, expressed as constraints, and be as close to the uniform distribution as possible, thus avoiding any bias beyond the constraints. Constraints are specified in terms of real valued "features" defined over a sample space. Most commonly, they require that means of features with respect to a density estimate match empirical means determined from data. This approach is, however, bound to overfit when we have a large number of constraints and too little data.

There are many ways to smooth maxent and thus avoid overfitting. The purpose of this work is to understand relationships between various smoothing techniques and, more importantly, to derive performance guarantees. Our result is that smoothing by regularization is equivalent to relaxation of constraints. We also provide guarantees that give insights into which types of relaxed constraints will lead to good performance.

Publications

  • Maximum entropy density estimation and modeling geographic distributions of species, PhD thesis, Department of Computer Science, Princeton University, 2007, [pdf] [tech report link]
  • Maximum entropy density estimation with generalized regularization and an application to species distribution modeling, with S. J. Phillips and R. E. Schapire, Journal of Machine Learning Research 8, 2007, 1217-1260, [journal]
  • Maximum entropy distribution estimation with generalized regularization, with R. E. Schapire, Proceedings of the 19th Annual Conference on Learning Theory, 2006, 123-138, [pdf]
  • Performance guarantees for regularized maximum entropy density estimation, with S. J. Phillips and R. E. Schapire, Proceedings of the 17th Annual Conference on Learning Theory, 2004, 472-486, [ps] [pdf]
Modeling Geographic Distributions of Species

Our goal is to model geographic distributions of biological species based on (i) their observed occurrence localities and (ii) environmental characteristics of a given region. Such models are used in conservation biology, ecology and land-use planning. The richest source of data are museums and herbaria, but the number of occurrence records for many species of interest (e.g. endangered species) is quite small by machine learning standards (20-50 or even less) and they are often collected in a highly biased manner. These issues pose a significant challenge for statistical methods. Coping with this challenge has been the focus of this work.

Together with Rob Schapire and Steven Phillips, we proposed to use the maximum entropy approach to model species distributions. We developed the program MaxEnt available for download.

Publications

  • Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation, with S. J. Phillips, Ecography 31:2, 2008, 161-175, [pdf]
  • Maximum entropy density estimation and modeling geographic distributions of species, PhD thesis, Department of Computer Science, Princeton University, 2007, [pdf] [tech report link]
  • Novel methods improve prediction of species' distributions from occurrence data, with J. Elith, C. Graham et al., Ecography 29:2, 2006, 129-151, [pdf]
  • Correcting sample selection bias in maximum entropy density estimation, with R. E. Schapire and S. J. Phillips, Advances in Neural Information Processing Systems 18, 2005, [ps] [pdf]
  • A maximum entropy approach to species distribution modeling, with S. J. Phillips and R. E. Schapire, Proceedings of the 21st International Conference on Machine Learning, 2004, 655-662, [ps] [pdf]
Last modified: Apr 19th, 2008