ABSTRACT
Conditional maximum entropy (ME) models provide a general purpose machine learning technique which has been successfully applied to fields as diverse as computer vision and econometrics, and which is used for a wide variety of classification problems in natural language processing. However, the flexibility of ME models is not without cost. While parameter estimation for ME models is conceptually straightforward, in practice ME models for typical natural language tasks are very large, and may well contain many thousands of free parameters. In this paper, we consider a number of algorithms for estimating the parameters of ME models, including iterative scaling, gradient ascent, conjugate gradient, and variable metric methods. Sur-prisingly, the standardly used iterative scaling algorithms perform quite poorly in comparison to the others, and for all of the test problems, a limited-memory variable metric algorithm outperformed the other choices.
- Steven P. Abney. 1997. Stochastic attribute-value grammars. Computational Linguistics, 23:597--618. Google ScholarDigital Library
- Satish Balay, William D. Gropp, Lois Curfman McInnes, and Barry F. Smith. 1997. Efficienct management of parallelism in object oriented numerical software libraries. In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern Software Tools in Scientific Computing, pages 163--202. Birkhauser Press. Google ScholarDigital Library
- Satish Balay, Kris Buschelman, William D. Gropp, Dinesh Kaushik, Lois Curfman McInnes, and Barry F. Smith. 2001. PETSc home page. http://www.mcs.anl.gov/petsc.Google Scholar
- Satish Balay, William D. Gropp, Lois Curfman McInnes, and Barry F. Smith. 2002. PETSc users manual. Technical Report ANL-95/11--Revision 2.1.2, Argonne National Laboratory.Google Scholar
- Steven J. Benson and Jorge J. Moré. 2001. A limited memory variable metric method for bound constrained minimization. Preprint ANL/ACS-P909--0901, Argonne National Laboratory.Google Scholar
- Steven J. Benson, Lois Curfman McInnes, Jorge J. Moré, and Jason Sarich. 2002. TAO users manual. Technical Report ANL/MCS-TM-242--Revision 1.4, Argonne National Laboratory.Google Scholar
- Adam Berger, Stephen Della Pietra, and Vincent Della Pietra. 1996. A maximum entropy approach to natural language processing. Computational Linguistics, 22. Google ScholarDigital Library
- Gosse Bouma, Gertjan van Noord, and Robert Mal-ouf. 2001. Alpino: wide coverage computational analysis of Dutch. In W. Daelemans, K. Sima'an, J. Veenstra, and J. Zavrel, editors, Computational Linguistics in the Netherlands 2000, pages 45--59. Rodolpi, Amsterdam.Google Scholar
- Zhiyi Chi. 1998. Probability models for complex systems. Ph.D. thesis, Brown University. Google ScholarDigital Library
- J. Darroch and D. Ratcliff. 1972. Generalized iterative scaling for log-linear models. Ann. Math. Statistics, 43:1470--1480.Google Scholar
- Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. 1997. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19:380--393. Google ScholarDigital Library
- W. E. Deming and F. F. Stephan. 1940. On a least squares adjustment of a sampled frequency table when the expected marginals are known. Annals of Mathematical Statistics, 11:427--444.Google ScholarCross Ref
- Elizabeth D. Dolan and Jorge J. Moré. 2002. Benchmarking optimization software with performance profiles. Mathematical Programming, 91:201--213.Google Scholar
- Mark Johnson, Stuart Geman, Stephen Canon, Zhiyi Chi, and Stefan Riezler. 1999. Estimators for stochastic "unification-based" grammars. In Proceedings of the 37th Annual Meeting of the ACL, pages 535--541, College Park, Maryland. Google ScholarDigital Library
- John Lafferty and Bernhard Suhm. 1996. Cluster expansions and iterative scaling for maximum entropy language models. In K. Hanson and R. Silver, editors, Maximum Entropy and Bayesian Methods. Kluwer.Google Scholar
- John Lafferty, Fernando Pereira, and Andrew McCallum. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In International Conference on Machine Learning (ICML). Google ScholarDigital Library
- Thomas P. Minka. 2001. Algorithms for maximum-likelihood logistic regression. Statistics Tech Report 758, CMU.Google Scholar
- Jorge Nocedal and Stephen J. Wright. 1999. Numerical Optimization. Springer, New York.Google Scholar
- Jorge Nocedal. 1997. Large scale unconstrained optimization. In A. Watson and I. Duff, editors, The State of the Art in Numerical Analysis, pages 311--338. Oxford University Press.Google Scholar
- Miles Osborne. 2002. Shallow parsing using noisy and non-stationary training material. Journal of Machine Learning Research, 2:695--719. Google ScholarDigital Library
- Miles Osborne. to appear. Using maximum entropy for sentence extraction. In Proceedings of the ACL 2002 Workshop on Automatic Summarization, Philadelphia. Google ScholarDigital Library
- Adwait Ratnaparkhi. 1998. Maximum entropy models for natural language ambiguity resolution. Ph.D. thesis, University of Pennsylvania. Google ScholarDigital Library
- Jun Wu and Sanjeev Khudanpur. 2000. Efficient training methods for maximum entropy language modelling. In Proceedings of ICSLP2000, volume 3, pages 114--117, Beijing.Google Scholar
- A comparison of algorithms for maximum entropy parameter estimation
Recommendations
MAP-MRF super-resolution image reconstruction using maximum pseudo-likelihood parameter estimation
ICIP'09: Proceedings of the 16th IEEE international conference on Image processingIn this paper, we address the parameter estimation of a Super-Resolution Image Reconstruction approach following a Maximum a Posteriori Probability (MAP) algorithm. The Generalized Isotropic Multi-Level Logistic (GIMLL) Markov Random Field (MRF) model ...
An exact forward-backward maximum likelihood autoregressive parameter estimation method
A method for obtaining an exact maximum likelihood estimate (MLE) of the autoregressive (AR) parameters is proposed. The method is called the forward-backward maximum likelihood algorithm. Based on a new form of the log likelihood function for a ...
Second-order parameter estimation
This work provides a general framework for the design of second-order blind estimators without adopting any approximation about the observation statistics or the a priori distribution of the parameters. The proposed solution is obtained minimizing the ...
Comments