skip to main content
article

A multistrategy approach for digital text categorization from imbalanced documents

Published:01 June 2004Publication History
Skip Abstract Section

Abstract

The goal of the research described here is to develop a multistrategy classifier system that can be used for document categorization. The system automatically discovers classification patterns by applying several empirical learning methods to different representations for preclassified documents belonging to an imbalanced sample. The learners work in a parallel manner, where each learner carries out its own feature selection based on evolutionary techniques and then obtains a classification model. In classifying documents, the system combines the predictions of the learners by applying evolutionary techniques as well. The system relies on a modular, flexible architecture that makes no assumptions about the design of learners or the number of learners available and guarantees the independence of the thematic domain.

References

  1. Attardi G., Gulli A., Sebastiani F.: Automatic Web Page Categorization by Link and Content Analysis. Proceedings of THAI'99, European Symposium on Telematics, Hypermedia and Artificial Intelligence. Varese (1999) 105--119.Google ScholarGoogle Scholar
  2. Brank, J., Groblenik, M., Milic-Frayling, N., Mladenic, D.: Interaction of Feature Selection Methods and Linear Classification Models. Proceedings of the Nineteenth International Conference on Machine Learning (ICML'02). Sydney, Australia (2002).Google ScholarGoogle Scholar
  3. Castillo, Ma. D. del, Gasós, J., García-Alegre, M. C.: Genetic Processing of the Sensorial Information. Sensors & Actuators A, 37--38 (1993) 255--259.Google ScholarGoogle Scholar
  4. Castillo, Ma. D. del, Barrios, L. J.: Knowledge Acquisition from Batch Semiconductor Manufacturing Data. Intelligent Data Analysis IDA, 3, Elsevier Science Inc. (1999) 399--408.Google ScholarGoogle Scholar
  5. Castillo, Ma. D. del, Sesmero, P.: Perception and Representation in a Multistrategy Learning Process. Proceedings of Learning'00. Madrid (2000).Google ScholarGoogle Scholar
  6. Cohen, W.: Text categorization and relational learning. Proceedings of the Twelfth International Conference on Machine Learning. Lake Tahoe, California (1995) 124--132.Google ScholarGoogle Scholar
  7. Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., Slattery, S.: Learning to Construct Knowledge Bases from the World Wide Web. Artificial Intelligence, 118(1--2) (2000) 69--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Doan, A., Domingos, P., Halevy, A.: Learning to Match the Schemas of Data Sources: A Multistrategy Approach. Machine Learning, Vol. 50 (2003) 279--301. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Dumais, S. T., Platt, J., Heckerman, D., and Sahami, M.: Inductive Learning Algorithms and Representation for Text Categorization. In CIKM-98: Proceedings of the Seventh International Conference on Information and Knowledge Management (1998) 148--155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Freitag, D.: Multistrategy Learning for Information Extraction. Proceedings of the 15th International Conference on Machine Learning (1998) 161--169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Goldberg, D.: Genetic Algorithms in Search, Optimization and Machine Learning. Addison Wesley (1989). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Grobelnik, M., Mladenic, D.: Efficient Text Categorization. Proceedings of the ECML-98 Text Mining Workshop (1998).Google ScholarGoogle Scholar
  13. John, G. H., Kohavi, R., Pfleger, K.: Irrelevant Features and the Subset Selection Problems. Proceedings of the 11th International Conference on Machine Learning (1994).Google ScholarGoogle Scholar
  14. Langdon, W. B., Buxton, B. F.: Genetic Programming for Combining Classifiers. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2001) (2001) 66--73.Google ScholarGoogle Scholar
  15. Lewis, D.: Feature selection and feature extraction for text categorization. Proceedings of Speech and Natural Language Workshop. Defense Advanced Research Projects Agency, Morgan Kaufmann, February (1992) 212--217. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Lewis, D., Ringuette, M.: A Comparison of Two Learning Algorithms for Text Categorization. Symposium on Document Analysis and IR, ISRI, April 11--13, Las Vegas (1994) 81--93.Google ScholarGoogle Scholar
  17. Michalski, R. S., Carbonell J. G., Mitchell T. M.: A theory and methodology of inductive learning. Machine Learning: An Artificial Intelligence Approach. Springer-Verlag (1983). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Mladenic, D.: Feature Subset Selection in Text-Learning. European Conference on Machine Learning (1998) 95--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Mladenic, D., Grobelnik, M.: Feature selection for classification based on text hierarchy. Working notes of Learning from Text and the Web, Conference on Automated Learning and Discovery CONALD-98 (1998).Google ScholarGoogle Scholar
  20. Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and Naïve Bayes. Proceedings of the 16th International Conference on Machine Learning (ICML'99) (1999) 258--267. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Oliveira, L. S.: Feature Selection Using Multi-Objective Genetic Algorithms for Hand-written Digit Recognition, ICPR (2002). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Porter, M. F.: An algorithm for suffix stripping. Program, 14(3) (1980) 130--137.Google ScholarGoogle ScholarCross RefCross Ref
  23. Quinlan J. R.: C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann (1993). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys, Vol. 34, Number 1 (2002) 1--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Yang, Y., Pedersen, J. P.: A Comparative Study on Feature Selection in Text Categorization. Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97) (1997) 412--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Yang, J. and Honavar, V.: Feature subset selection using a genetic algorithm. IEEE Intelligent Systems and their Applications. 13(2) (1998) 44--49. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A multistrategy approach for digital text categorization from imbalanced documents
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGKDD Explorations Newsletter
          ACM SIGKDD Explorations Newsletter  Volume 6, Issue 1
          Special issue on learning from imbalanced datasets
          June 2004
          117 pages
          ISSN:1931-0145
          EISSN:1931-0153
          DOI:10.1145/1007730
          Issue’s Table of Contents

          Copyright © 2004 Authors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 June 2004

          Check for updates

          Qualifiers

          • article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader