Skip to main content

Impact of Sampling on Neural Network Classification Performance in the Context of Repeat Movie Viewing

  • Conference paper

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 383))

Abstract

This paper assesses the impact of different sampling approaches on neural network classification performance in the context of repeat movie going. The results showed that synthetic oversampling of the minority class, either on its own or combined with under-sampling and removal of noisy examples from the majority class offered the best overall performance. The identification of the best sampling approach for this data set is not trivial since the alternatives would be highly dependent on the metrics used, as the accuracy ranks of the approaches did not agree across the different accuracy measures used. In addition, the findings suggest that including examples generated as part of the oversampling procedure in the holdout sample, leads to a significant overestimation of the accuracy of the neural network. Further research is necessary to understand the relationship between degree of synthetic over-sampling and the efficacy of the holdout sample as a neural network accuracy estimator.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: Special Issue on Learning from Imbalanced Data Sets. SIGKDD Explor. Newsl. 6(1), 1–6 (2004)

    Article  Google Scholar 

  2. Japkowicz, N., Stephen, S.: The Class Imbalance Problem: A Systematic Study. Intell. Data. Anal. 6(5), 429–449 (2002)

    MATH  Google Scholar 

  3. Fernández, A., García, S., Herrera, F.: Addressing the Classification with Imbalanced Data: Open Problems and New Challenges on Class Distribution. In: Corchado, E., Kurzyński, M., Woźniak, M. (eds.) HAIS 2011, Part I. LNCS, vol. 6678, pp. 1–10. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  4. Pearson, R., Goney, G., Shwaber, J.: Imbalanced Clustering of Microarray Time-Series. In: Fawcett, T., Mishra, S. (eds.) 12th International Conference on Machine Learning Workshop on Learning from Imbalanced Datasets II, Washington DC, vol. 3 (2003)

    Google Scholar 

  5. Kubat, M., Matwin, S.: Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. In: 14th International Conference on Machine Learning, Nashville, Tennessee, USA, pp. 179–186 (1997)

    Google Scholar 

  6. Manevitz, L.M., Yousef, M.: One-Class SVMs for Document Classification. JMLR 2, 139–154 (2002)

    MATH  Google Scholar 

  7. Thai-Nghe, N., Busche, A., Schmidt-Thieme, L.: Improving Academic Performance Prediction by Dealing with Class Imbalance. In: 9th IEEE International Conference on Intelligent Systems Design and Applications, Pisa, Italy, pp. 878–883 (2009)

    Google Scholar 

  8. Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. SIGKDD Explor. Newsl. 6(1), 20–29 (2004)

    Article  Google Scholar 

  9. Folorunso, S.O., Adeyemo, A.B.: Theoretical Comparison of Undersampling Techniques Against Their Underlying Data Reduction Techniques. In: EIE 2nd International Conference Computing, Energy, Networking, Robotics and Telecommunications (EIECON 2012), Lagos, Nigeria, pp. 92–97 (2012)

    Google Scholar 

  10. Kotsiantis, S., Kanellopoulos, D., Pintelas, P.: Handling Imbalanced Datasets: A Review. GESTS International Transactions on Computer Science and Engineering 30(1), 25–36 (2006)

    Google Scholar 

  11. Zhou, Z.-H., Liu, X.-Y.: Training Cost-Sensitive Neural Networks with Methods Addressing the Class Imbalance Problem. IEEE T. Knowl. Data. En. 18(1), 63–77 (2006)

    Article  Google Scholar 

  12. Mazurowski, M.A., Habas, P.A., Zurada, J.M., Lo, J.Y., Baker, J.A., Tourassi, G.D.: Training Neural Network Classifiers for Medical Decision Making: The Effects of Imbalanced Datasets on Classification Performance. Neural Networks 21(2), 427–436 (2008)

    Article  Google Scholar 

  13. Crone, S.F., Finlay, S.: Instance Sampling in Credit Scoring: an Empirical Study of Sample Size and Balancing. Int. J. Forecasting 28(1), 224–238 (2011)

    Article  Google Scholar 

  14. Collins, A., Hand, C., Linnell, M.: Analyzing Repeat Consumption of Identical Cultural Goods: Some Exploratory Evidence from Moviegoing. J. Cult. Econ. 32(3), 187–199 (2008)

    Article  Google Scholar 

  15. Sawhney, M., Eliashberg, J.: A Parsimonious Model for Forecasting Gross Box-Office Revenues of Motion Pictures. Market. Sci., 113–131 (2001)

    Google Scholar 

  16. Sharda, R., Delen, D.: Predicting Box-Office Success of Motion Pictures with Neural Networks. Expert Syst. Appl. 30(2), 243–254 (2006)

    Article  Google Scholar 

  17. Paliwal, M., Kumar, U.A.: Neural Networks and Statistical Techniques: A Review of Applications. Expert Syst. Appl. 36(1), 2–17 (2009)

    Article  Google Scholar 

  18. Fitkov-Norris, E., Vahid, S., Hand, C.: Evaluating the Impact of Categorical Data Encoding and Scaling on Neural Network Classification Performance: The Case of Repeat Consumption of Identical Cultural Goods. In: Jayne, C., Yue, S., Iliadis, L. (eds.) EANN 2012. CCIS, vol. 311, pp. 343–352. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  19. Hart, P.E.: The Condensed Nearest Neighbor Rule. IEEE T. Inform. Theory 14(3), 515–516 (1968)

    Article  Google Scholar 

  20. Laurikkala, J.: Improving Identification of Difficult Small Classes by Balancing Class Distribution. In: Quaglini, S., Barahona, P., Andreassen, S. (eds.) AIME 2001. LNCS (LNAI), vol. 2101, pp. 63–66. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  21. Tomek, I.: Two Modifications of CNN. IEEE T. Syst. Man. Cyb. 11(6), 769–772 (1976)

    Google Scholar 

  22. Wilson, D.L.: Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE T. Syst. Man. Cyb. SMC-2(3), 408–421 (1972)

    Article  Google Scholar 

  23. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-Sampling Technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    MATH  Google Scholar 

  24. Ramentol, E., Caballero, Y., Bello, R., Herrera, F.: SMOTE-RSB*: a Hybrid Preprocessing Approach Based on Oversampling and Undersampling for High Imbalanced Data-Sets Using SMOTE and Rough Sets Theory. Knowl. Inf. Syst. 33(2), 245–265 (2011)

    Article  Google Scholar 

  25. García, S., Herrera, F.: Evolutionary Undersampling for Classification with Imbalanced Datasets: Proposals and Taxonomy. Evol. Comput. 17(3), 275–306 (2009)

    Article  Google Scholar 

  26. Chen, S., He, H., Garcia, E.A.: RAMOBoost: Ranked Minority Oversampling in Boosting. IEEE T. Neural Networ. 21(10), 1624–1642 (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Fitkov-Norris, E., Folorunso, S.O. (2013). Impact of Sampling on Neural Network Classification Performance in the Context of Repeat Movie Viewing. In: Iliadis, L., Papadopoulos, H., Jayne, C. (eds) Engineering Applications of Neural Networks. EANN 2013. Communications in Computer and Information Science, vol 383. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41013-0_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-41013-0_22

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-41012-3

  • Online ISBN: 978-3-642-41013-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics