S. Guiasu and A. Shenitzer. The principle of maximum entropy. The Mathematical Intelligencer, 7(1), 1985. (An overview paper)
E. Jaynes. Notes on present status and future prospects. In W.T. Grandy and L.H. Schick, editors, Maximum Entropy and Bayesian Methods, pages 1-13. Kluwer, 1990. (Depending on your viewpoint, Jaynes deserves credit for either inventing maxent or, at the very least, formalizing it, in 1957.)
S. Della Pietra, V. Della Pietra, and J. Lafferty. Inducing features of random fields. IEEE Transactions on pattern analysis and machine intelligence, 19(4), 380-393, April, 1997 (Introduces an iterative algorithm for constructing an exponential model from ``informative'' features selected automatically from a large candidate set.)
D. Brown. A note on approximations to discrete probability distributions. Information and Control, 2:386-392, 1959.
I. Csiszár. I-divergence geometry of probability distributions and minimization problems. The Annals of Probability, 3(1):146-158, 1975.
I. Csiszár and G. Tusnády. Information geometry and alternating minimization procedures. Statistics & Decisions, Supplemental Issue:1, pages 205-237, 1984.
I. Csiszár. A geometric interpretation of Darroch and Ratcliff's generalized iterative scaling. The Annals of Statistics, 17(3):1409-1413, 1989.
J. Darroch and D. Ratcliff. Generalized iterative scaling for log-linear models. Ann. Math. Statistics, 43:1470-1480, 1972.
The [Della Pietra, Della Pietra, Lafferty] reference above also formally introduces the improved iterative scaling algorithm, a procedure for computing maximum-likelihood estimates of the parameters in a maxent distribution.
The proceedings of the yearly conference Maximum Entropy and Bayesian Methods has been published by Kluwer for at least the last ten years and always contains interesting applications of maxent to areas as diverse as portfolio optimization, signal processing, nuclear physics, and, of all things, the ``two envelope'' paradox.
A. Berger, S. Della Pietra, and V. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39-71, 1996. (Covers selected applications in machine translation, including word-sense disambiguation and word reordering)
R. Rosenfeld. A maximum entropy approach to adaptive statistical language modelling. Computers, Speech and Language, 1996 (Uses exponential models to construct a conditional model of language which improves upon the standard ``trigram'' model.)
A. Ratnaparkhi. A maximum entropy part of speech tagger Proceedings of the conference on empirical methods in natural language processing, May 1996, University of Pennsylvania. (Adwait has done applied maxent to several problems in natural language processing; see his web page for a more complete list.