skip to main content
10.3115/1118853.1118871dlproceedingsArticle/Chapter ViewAbstractPublication PagescolingConference Proceedingsconference-collections
Article
Free Access

A comparison of algorithms for maximum entropy parameter estimation

Published:31 August 2002Publication History

ABSTRACT

Conditional maximum entropy (ME) models provide a general purpose machine learning technique which has been successfully applied to fields as diverse as computer vision and econometrics, and which is used for a wide variety of classification problems in natural language processing. However, the flexibility of ME models is not without cost. While parameter estimation for ME models is conceptually straightforward, in practice ME models for typical natural language tasks are very large, and may well contain many thousands of free parameters. In this paper, we consider a number of algorithms for estimating the parameters of ME models, including iterative scaling, gradient ascent, conjugate gradient, and variable metric methods. Sur-prisingly, the standardly used iterative scaling algorithms perform quite poorly in comparison to the others, and for all of the test problems, a limited-memory variable metric algorithm outperformed the other choices.

References

  1. Steven P. Abney. 1997. Stochastic attribute-value grammars. Computational Linguistics, 23:597--618. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Satish Balay, William D. Gropp, Lois Curfman McInnes, and Barry F. Smith. 1997. Efficienct management of parallelism in object oriented numerical software libraries. In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern Software Tools in Scientific Computing, pages 163--202. Birkhauser Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Satish Balay, Kris Buschelman, William D. Gropp, Dinesh Kaushik, Lois Curfman McInnes, and Barry F. Smith. 2001. PETSc home page. http://www.mcs.anl.gov/petsc.Google ScholarGoogle Scholar
  4. Satish Balay, William D. Gropp, Lois Curfman McInnes, and Barry F. Smith. 2002. PETSc users manual. Technical Report ANL-95/11--Revision 2.1.2, Argonne National Laboratory.Google ScholarGoogle Scholar
  5. Steven J. Benson and Jorge J. Moré. 2001. A limited memory variable metric method for bound constrained minimization. Preprint ANL/ACS-P909--0901, Argonne National Laboratory.Google ScholarGoogle Scholar
  6. Steven J. Benson, Lois Curfman McInnes, Jorge J. Moré, and Jason Sarich. 2002. TAO users manual. Technical Report ANL/MCS-TM-242--Revision 1.4, Argonne National Laboratory.Google ScholarGoogle Scholar
  7. Adam Berger, Stephen Della Pietra, and Vincent Della Pietra. 1996. A maximum entropy approach to natural language processing. Computational Linguistics, 22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Gosse Bouma, Gertjan van Noord, and Robert Mal-ouf. 2001. Alpino: wide coverage computational analysis of Dutch. In W. Daelemans, K. Sima'an, J. Veenstra, and J. Zavrel, editors, Computational Linguistics in the Netherlands 2000, pages 45--59. Rodolpi, Amsterdam.Google ScholarGoogle Scholar
  9. Zhiyi Chi. 1998. Probability models for complex systems. Ph.D. thesis, Brown University. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Darroch and D. Ratcliff. 1972. Generalized iterative scaling for log-linear models. Ann. Math. Statistics, 43:1470--1480.Google ScholarGoogle Scholar
  11. Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. 1997. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19:380--393. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. W. E. Deming and F. F. Stephan. 1940. On a least squares adjustment of a sampled frequency table when the expected marginals are known. Annals of Mathematical Statistics, 11:427--444.Google ScholarGoogle ScholarCross RefCross Ref
  13. Elizabeth D. Dolan and Jorge J. Moré. 2002. Benchmarking optimization software with performance profiles. Mathematical Programming, 91:201--213.Google ScholarGoogle Scholar
  14. Mark Johnson, Stuart Geman, Stephen Canon, Zhiyi Chi, and Stefan Riezler. 1999. Estimators for stochastic "unification-based" grammars. In Proceedings of the 37th Annual Meeting of the ACL, pages 535--541, College Park, Maryland. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. John Lafferty and Bernhard Suhm. 1996. Cluster expansions and iterative scaling for maximum entropy language models. In K. Hanson and R. Silver, editors, Maximum Entropy and Bayesian Methods. Kluwer.Google ScholarGoogle Scholar
  16. John Lafferty, Fernando Pereira, and Andrew McCallum. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In International Conference on Machine Learning (ICML). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Thomas P. Minka. 2001. Algorithms for maximum-likelihood logistic regression. Statistics Tech Report 758, CMU.Google ScholarGoogle Scholar
  18. Jorge Nocedal and Stephen J. Wright. 1999. Numerical Optimization. Springer, New York.Google ScholarGoogle Scholar
  19. Jorge Nocedal. 1997. Large scale unconstrained optimization. In A. Watson and I. Duff, editors, The State of the Art in Numerical Analysis, pages 311--338. Oxford University Press.Google ScholarGoogle Scholar
  20. Miles Osborne. 2002. Shallow parsing using noisy and non-stationary training material. Journal of Machine Learning Research, 2:695--719. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Miles Osborne. to appear. Using maximum entropy for sentence extraction. In Proceedings of the ACL 2002 Workshop on Automatic Summarization, Philadelphia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Adwait Ratnaparkhi. 1998. Maximum entropy models for natural language ambiguity resolution. Ph.D. thesis, University of Pennsylvania. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Jun Wu and Sanjeev Khudanpur. 2000. Efficient training methods for maximum entropy language modelling. In Proceedings of ICSLP2000, volume 3, pages 114--117, Beijing.Google ScholarGoogle Scholar
  1. A comparison of algorithms for maximum entropy parameter estimation

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image DL Hosted proceedings
          COLING-02: proceedings of the 6th conference on Natural language learning - Volume 20
          August 2002
          207 pages

          Publisher

          Association for Computational Linguistics

          United States

          Publication History

          • Published: 31 August 2002

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate1,537of1,537submissions,100%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader