Abstract
Standard pattern discovery techniques, such as association rules, suffer an extreme risk of finding very large numbers of spurious patterns for many knowledge discovery tasks. The direct-adjustment approach to controlling this risk applies a statistical test during the discovery process, using a critical value adjusted to take account of the size of the search space. However, a problem with the direct-adjustment strategy is that it may discard numerous true patterns. This paper investigates the assignment of different critical values to different areas of the search space as an approach to alleviating this problem, using a variant of a technique originally developed for other purposes. This approach is shown to be effective at increasing the number of discoveries while still maintaining strict control over the risk of false discoveries.
Article PDF
Similar content being viewed by others
References
Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining associations between sets of items in massive databases. In Proceedings of the 1993 ACM-SIGMOD international conference on management of data (pp. 207–216). Washington, DC, May 1993.
Aumann, Y., & Lindell, Y. (1999). A statistical theory for quantitative association rules. In Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-99) (pp. 261–270).
Bastide, Y., Pasquier, N., Taouil, R., Stumme, G., & Lakhal, L. (2000). Mining minimal non-redundant association rules using frequent closed itemsets. In First international conference on computational logic—CL 2000 (pp. 972–986). Berlin: Springer.
Bay, S. D., & Pazzani, M. J. (2001). Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery, 5(3), 213–246.
Bayardo, R. J. Jr., Agrawal, R., & Gunopulos, D. (2000). Constraint-based rule mining in large, dense databases. Data Mining and Knowledge Discovery, 4(2/3), 217–240.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A new and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57, 289–300.
Brijs, T., Swinnen, G., Vanhoof, K., & Wets, G. (1999). Using association rules for product assortment decisions: A case study. In Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 254–260). New York: ACM.
Brin, S., Motwani, R., & Silverstein, C. (1997). Beyond market baskets: Generalizing association rules to correlations. In J. Peckham (Ed.), SIGMOD 1997, Proceedings ACM SIGMOD international conference on management of data, May 1997 (pp. 265–276). New York: ACM.
Dong, G., & Li, J. (1999). Efficient mining of emerging patterns: Discovering trends and differences. In Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-99) (pp. 15–18). New York: ACM.
DuMouchel, W., & Pregibon, D. (2001). Empirical Bayes screening for multi-item associations. In KDD-2001: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, August 2001 (pp. 76–76). New York: ACM.
Gionis, A., Mannila, H., Mielikainen, T., & Tsaparas, P. (2006). Assessing data mining results via swap randomization. In 12th international conference on knowledge discovery and data mining (KDD) (pp. 167–176).
Han, J., Wang, J., Lu, Y., & Tzvetkov, P. (2002). Mining top-K frequent closed patterns without minimum support. In International conference on data mining (pp. 211–218).
Hettich, S., & Bay, S. D. (2007). The UCI KDD archive. Department of Information and Computer Science, University of California, Irvine, CA. http://kdd.ics.uci.edu.
Holland, B. S., & Copenhaver, M. D. (1988). Improved Bonferroni-type multiple testing procedures. Psychological Bulletin, 104(1), 145–149.
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65–70.
International Business Machines (1996). IBM intelligent miner user’s guide, version 1, release 1.
Jaroszewicz, S., & Simovici, D. A. (2004). Interestingness of frequent itemsets using Bayesian networks as background knowledge. In R. Kohavi, J. Gehrke, & J. Ghosh (Eds.), KDD-2004: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, August 2004 (pp. 178–186). New York: ACM.
Klösgen, W. (1996). Explora: A multipattern and multistrategy discovery assistant. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy (Eds.), Advances in knowledge discovery and data mining (pp. 249–271). Menlo Park: AAAI.
Liu, B., Hsu, W., & Ma, Y. (1999). Pruning and summarizing the discovered associations. In Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-99), August 1999 (pp. 125–134). New York: AAAI.
Megiddo, N., & Srikant, R. (1998). Discovering predictive association rules. In Proceedings of the fourth international conference on knowledge discovery and data mining (KDD-98) (pp. 27–78). Menlo Park: AAAI.
Možina, M., Demšar, J., Žabkar, J., & Bratko, I. (2006). Why is rule learning optimistic and how to correct it. In Machine learning: ECML 2006 (pp. 330–340). Berlin: Springer.
Newman, D. J., Hettich, S., Blake, C., & Merz, C. J. (2007). UCI repository of machine learning databases (Machine-readable data repository). Department of Information and Computer Science, University of California, Irvine, CA.
Piatetsky-Shapiro, G. (1991). Discovery, analysis, and presentation of strong rules. In G. Piatetsky-Shapiro & J. Frawley (Eds.), Knowledge discovery in databases (pp. 229–248). Menlo Park: AAAI/MIT Press.
Scheffer, T. (1995). Finding association rules that trade support optimally against confidence. Intelligent Data Analysis, 9(4), 381–395.
Scheffer, T., & Wrobel, S. (2002). Finding the most interesting patterns in a database quickly by using sequential sampling. Journal of Machine Learning Research, 3, 833–862.
Webb, G. I. (1995). OPUS: An efficient admissible algorithm for unordered search. Journal of Artificial Intelligence Research, 3, 431–465.
Webb, G. I. (2001). Discovering associations with numeric variables. In Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2001) (pp. 383–388). New York: ACM.
Webb, G. I. (2002). Magnum opus, version 1.3. Software. Melbourne: G.I. Webb & Associates.
Webb, G. I. (2006). Discovering significant rules. In L. Ungar, M. Craven, D. Gunopulos, & T. Eliassi-Rad (Eds.), Proceedings of the twelfth ACM SIGKDD international conference on knowledge discovery and data mining, KDD-2006 (pp. 434–443). New York: ACM.
Webb, G. I. (2007). Discovering significant patterns. Machine Learning, 68(1), 1–33.
Webb, G. I., & Zhang, S. (2005). K-optimal rule discovery. Data Mining and Knowledge Discovery, 10(1), 39–79.
Zaki, M. J. (2004). Mining non-redundant association rules. Data Mining and Knowledge Discovery, 9(3), 223–248.
Zhang, H., Padmanabhan, B., & Tuzhilin, A. (2004). On the discovery of significant statistical quantitative rules. In Proceedings of the tenth international conference on knowledge discovery and data mining (KDD-2004), August 2004 (pp. 374–383). New York: ACM.
Zheng, Z., Kohavi, R., & Mason, L. (2001). Real world performance of association rule algorithms. In Proceedings of the seventh international conference on knowledge discovery and data mining (KDD-2001), August 2001 (pp. 401–406). New York: ACM.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Johannes Fuernkranz
Rights and permissions
About this article
Cite this article
Webb, G.I. Layered critical values: a powerful direct-adjustment approach to discovering significant patterns. Mach Learn 71, 307–323 (2008). https://doi.org/10.1007/s10994-008-5046-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-008-5046-x