article

A multistrategy approach for digital text categorization from imbalanced documents

Authors:
M. Dolores del Castillo

Instituto de Automática Industrial (CSIC), Madrid, Spain

Instituto de Automática Industrial (CSIC), Madrid, Spain
View Profile

,
José Ignacio Serrano

Instituto de Automática Industrial (CSIC), Madrid, Spain

Instituto de Automática Industrial (CSIC), Madrid, Spain
View Profile

Authors Info & Claims

ACM SIGKDD Explorations Newsletter Volume 6 Issue 1June 2004pp 70–79https://doi.org/10.1145/1007730.1007740

Published:01 June 2004Publication History

ACM SIGKDD Explorations Newsletter

Abstract

The goal of the research described here is to develop a multistrategy classifier system that can be used for document categorization. The system automatically discovers classification patterns by applying several empirical learning methods to different representations for preclassified documents belonging to an imbalanced sample. The learners work in a parallel manner, where each learner carries out its own feature selection based on evolutionary techniques and then obtains a classification model. In classifying documents, the system combines the predictions of the learners by applying evolutionary techniques as well. The system relies on a modular, flexible architecture that makes no assumptions about the design of learners or the number of learners available and guarantees the independence of the thematic domain.

References

Attardi G., Gulli A., Sebastiani F.: Automatic Web Page Categorization by Link and Content Analysis. Proceedings of THAI'99, European Symposium on Telematics, Hypermedia and Artificial Intelligence. Varese (1999) 105--119.Google Scholar
Brank, J., Groblenik, M., Milic-Frayling, N., Mladenic, D.: Interaction of Feature Selection Methods and Linear Classification Models. Proceedings of the Nineteenth International Conference on Machine Learning (ICML'02). Sydney, Australia (2002).Google Scholar
Castillo, Ma. D. del, Gasós, J., García-Alegre, M. C.: Genetic Processing of the Sensorial Information. Sensors & Actuators A, 37--38 (1993) 255--259.Google Scholar
Castillo, Ma. D. del, Barrios, L. J.: Knowledge Acquisition from Batch Semiconductor Manufacturing Data. Intelligent Data Analysis IDA, 3, Elsevier Science Inc. (1999) 399--408.Google Scholar
Castillo, Ma. D. del, Sesmero, P.: Perception and Representation in a Multistrategy Learning Process. Proceedings of Learning'00. Madrid (2000).Google Scholar
Cohen, W.: Text categorization and relational learning. Proceedings of the Twelfth International Conference on Machine Learning. Lake Tahoe, California (1995) 124--132.Google Scholar
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., Slattery, S.: Learning to Construct Knowledge Bases from the World Wide Web. Artificial Intelligence, 118(1--2) (2000) 69--113. Google ScholarDigital Library
Doan, A., Domingos, P., Halevy, A.: Learning to Match the Schemas of Data Sources: A Multistrategy Approach. Machine Learning, Vol. 50 (2003) 279--301. Google ScholarDigital Library
Dumais, S. T., Platt, J., Heckerman, D., and Sahami, M.: Inductive Learning Algorithms and Representation for Text Categorization. In CIKM-98: Proceedings of the Seventh International Conference on Information and Knowledge Management (1998) 148--155. Google ScholarDigital Library
Freitag, D.: Multistrategy Learning for Information Extraction. Proceedings of the 15th International Conference on Machine Learning (1998) 161--169. Google ScholarDigital Library
Goldberg, D.: Genetic Algorithms in Search, Optimization and Machine Learning. Addison Wesley (1989). Google ScholarDigital Library
Grobelnik, M., Mladenic, D.: Efficient Text Categorization. Proceedings of the ECML-98 Text Mining Workshop (1998).Google Scholar
John, G. H., Kohavi, R., Pfleger, K.: Irrelevant Features and the Subset Selection Problems. Proceedings of the 11th International Conference on Machine Learning (1994).Google Scholar
Langdon, W. B., Buxton, B. F.: Genetic Programming for Combining Classifiers. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2001) (2001) 66--73.Google Scholar
Lewis, D.: Feature selection and feature extraction for text categorization. Proceedings of Speech and Natural Language Workshop. Defense Advanced Research Projects Agency, Morgan Kaufmann, February (1992) 212--217. Google ScholarDigital Library
Lewis, D., Ringuette, M.: A Comparison of Two Learning Algorithms for Text Categorization. Symposium on Document Analysis and IR, ISRI, April 11--13, Las Vegas (1994) 81--93.Google Scholar
Michalski, R. S., Carbonell J. G., Mitchell T. M.: A theory and methodology of inductive learning. Machine Learning: An Artificial Intelligence Approach. Springer-Verlag (1983). Google ScholarDigital Library
Mladenic, D.: Feature Subset Selection in Text-Learning. European Conference on Machine Learning (1998) 95--100. Google ScholarDigital Library
Mladenic, D., Grobelnik, M.: Feature selection for classification based on text hierarchy. Working notes of Learning from Text and the Web, Conference on Automated Learning and Discovery CONALD-98 (1998).Google Scholar
Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and Naïve Bayes. Proceedings of the 16th International Conference on Machine Learning (ICML'99) (1999) 258--267. Google ScholarDigital Library
Oliveira, L. S.: Feature Selection Using Multi-Objective Genetic Algorithms for Hand-written Digit Recognition, ICPR (2002). Google ScholarDigital Library
Porter, M. F.: An algorithm for suffix stripping. Program, 14(3) (1980) 130--137.Google ScholarCross Ref
Quinlan J. R.: C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann (1993). Google ScholarDigital Library
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys, Vol. 34, Number 1 (2002) 1--47. Google ScholarDigital Library
Yang, Y., Pedersen, J. P.: A Comparative Study on Feature Selection in Text Categorization. Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97) (1997) 412--420. Google ScholarDigital Library
Yang, J. and Honavar, V.: Feature subset selection using a genetic algorithm. IEEE Intelligent Systems and their Applications. 13(2) (1998) 44--49. Google ScholarDigital Library

Index Terms

A multistrategy approach for digital text categorization from imbalanced documents
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
2. Hardware
  1. Power and energy
    1. Power estimation and optimization
      1. Platform power issues

Index terms have been assigned to the content through auto-classification.

Recommendations

Classification of Imbalanced Documents by Feature Selection
ICCDA '17: Proceedings of the International Conference on Compute and Data Analysis

We previously worked on category classification problem of reuter 's newspaper article using SVM and feature selection. In the study, feature selection by SVM-score [Sakai, Hirokawa, 2012] showed high accuracy. It was also expected to be superior to ...
Read More
An effective feature selection approach driven genetic algorithm wrapped Bayes naïve

In this paper, an advanced novel feature selection FS algorithm is presented, the hybrid genetic algorithm GA with Bayes naïve BN, which selects the most relevant optimum feature subset to increase the classification accuracy performance and ...
Read More
A new rule-based knowledge extraction approach for imbalanced datasets
Abstract
Classification consists of extracting a classifier from large datasets. A dataset is imbalanced if it contains more instances in one class compared to the others. An imbalanced dataset contains majority instances and minority ones. It is worth ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGKDD Explorations Newsletter Volume 6, Issue 1
Special issue on learning from imbalanced datasets
June 2004
117 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/1007730
Issue’s Table of Contents

Copyright © 2004 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2004
Check for updates
Author Tags
feature selection
genetic algorithms
multistrategy learning
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 35
  Total Citations
  View Citations
- 781
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A multistrategy approach for digital text categorization from imbalanced documents

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

Classification of Imbalanced Documents by Feature Selection

An effective feature selection approach driven genetic algorithm wrapped Bayes naïve

A new rule-based knowledge extraction approach for imbalanced datasets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A multistrategy approach for digital text categorization from imbalanced documents

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

Classification of Imbalanced Documents by Feature Selection

An effective feature selection approach driven genetic algorithm wrapped Bayes naïve

A new rule-based knowledge extraction approach for imbalanced datasets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media