research-article

Catalysis Clustering with GAN by Incorporating Domain Knowledge

Authors:
Olga Andreeva

University of Massachusetts Boston, Boston, MA, USA

University of Massachusetts Boston, Boston, MA, USA
View Profile

,
Wei Li

Jiangnan University & Jiangsu Key Laboratory of Media Design and Software Technology, Wuxi, Jiangsu, China

Jiangnan University & Jiangsu Key Laboratory of Media Design and Software Technology, Wuxi, Jiangsu, China
View Profile

,
Wei Ding

University of Massachusetts Boston, Boston, MA, USA

University of Massachusetts Boston, Boston, MA, USA
View Profile

,
Marieke Kuijjer

University of Oslo, Oslo, Norway

University of Oslo, Oslo, Norway
View Profile

,
John Quackenbush

Harvard T. H. Chan School of Public Health, Boston, MA, USA

Harvard T. H. Chan School of Public Health, Boston, MA, USA
View Profile

,
Ping Chen

University of Massachusetts Boston, Boston, MA, USA

University of Massachusetts Boston, Boston, MA, USA
View Profile

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningAugust 2020Pages 1344–1352https://doi.org/10.1145/3394486.3403187

Published:20 August 2020Publication History

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 1344–1352

ABSTRACT

Clustering is an important unsupervised learning method with serious challenges when data is sparse and high-dimensional. Generated clusters are often evaluated with general measures, which may not be meaningful or useful for practical applications and domains. Using a distance metric, a clustering algorithm searches through the data space, groups close items into one cluster, and assigns far away samples to different clusters. In many real-world applications, the number of dimensions is high and data space becomes very sparse. Selection of a suitable distance metric is very difficult and becomes even harder when categorical data is involved. Moreover, existing distance metrics are mostly generic, and clusters created based on them will not necessarily make sense to domain-specific applications. One option to address these challenges is to integrate domain-defined rules and guidelines into the clustering process. In this work we propose a GAN-based approach called Catalysis Clustering to incorporate domain knowledge into the clustering process. With GANs we generate catalysts, which are special synthetic points drawn from the original data distribution and verified to improve clustering quality when measured by a domain-specific metric. We then perform clustering analysis using both catalysts and real data. Final clusters are produced after catalyst points are removed. Experiments on two challenging real-world datasets clearly show that our approach is effective and can generate clusters that are meaningful and useful for real-world applications.

References

L. N. Allen and L. C. Rose. 2006. Financial Survival Analysis of Defaulted Debtors. The Journal of the Operational Research Society, Vol. 57, 6 (2006), 630 -- 636.Google ScholarCross Ref
Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017).Google Scholar
Yale Chang, Junxiang Chen, Michael H Cho, Peter J Castaidi, Edwin K Silverman, and Jennifer G Dy. 2017. Clustering with domain-specific usefulness scores. In Proceedings of the 2017 SIAM International Conference on Data Mining. SIAM, 207--215.Google ScholarCross Ref
Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, Vol. 16 (2002), 321--357.Google ScholarCross Ref
Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems. 2172--2180.Google ScholarDigital Library
The International Cancer Genome Consortium. 2010. International network of cancer genome projects. Nature, Vol. 464 (15 04 2010), 993 -- 998. http://dx.doi.org/10.1038/nature08987Google Scholar
Pietro Coretto, Angela Serra, Roberto Tagliaferri, and Jonathan Wren. 2018. Robust clustering of noisy high-dimensional gene expression data for patients subtyping. Bioinformatics (2018).Google Scholar
Kamran Ghasedi, Xiaoqian Wang, Cheng Deng, and Heng Huang. 2019. Balanced self-paced learning for generative adversarial clustering network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4391--4400.Google ScholarCross Ref
Manish Kumar Goel, Pardeep Khanna, and Jugal Kishore. 2010. Understanding survival analysis: Kaplan-Meier estimate. International journal of Ayurveda research, Vol. 1, 4 (2010), 274.Google Scholar
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672--2680.Google Scholar
Pranab Haldar, Ian D Pavord, Dominic E Shaw, Michael A Berry, Michael Thomas, Christopher E Brightling, Andrew J Wardlaw, and Ruth H Green. 2008. Cluster analysis and clinical asthma phenotypes. American journal of respiratory and critical care medicine, Vol. 178, 3 (2008), 218--224.Google Scholar
Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. 2005. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing. Springer, 878--887.Google ScholarDigital Library
J. A. Hartigan and M. A. Wong. 1979. Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 28, 1 (1979), 100--108. https://doi.org/10.2307/2346830Google ScholarCross Ref
Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. 2008. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on. IEEE, 1322--1328.Google Scholar
Haibo He and Edwardo A Garcia. 2009. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, Vol. 21, 9 (2009), 1263--1284.Google ScholarDigital Library
Matan Hofree, John P Shen, Hannah Carter, Andrew Gross, and Trey Ideker. 2013. Network-based stratification of tumor mutations. Nature Methods, Vol. 10 (15 09 2013), 1108 -- 1115. http://dx.doi.org/10.1038/nmeth.2651Google ScholarCross Ref
Christian O Jacke, Iris Reinhard, and Ute S Albert. 2013. Using relative survival measures for cross-sectional and longitudinal benchmarks of countries, states, and districts: the BenchRelSurv-and BenchRelSurvPlot-macros. BMC public health, Vol. 13, 1 (2013), 34.Google Scholar
Michael S Lawrence, Petar Stojanov, Craig H Mermel, James T Robinson, Levi A Garraway, Todd R Golub, Matthew Meyerson, Stacey B Gabriel, Eric S Lander, and Gad Getz. 2014. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature, Vol. 505, 7484 (2014), 495.Google Scholar
Michael S Lawrence, Petar Stojanov, Paz Polak, Gregory V Kryukov, Kristian Cibulskis, Andrey Sivachenko, Scott L Carter, Chip Stewart, Craig H Mermel, Steven A Roberts, et al. 2013. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature, Vol. 499, 7457 (2013), 214.Google Scholar
Daniel D Lee and H Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature, Vol. 401, 6755 (1999), 788--791.Google Scholar
Fang Liu, Licheng Jiao, and Xu Tang. 2019. Task-oriented GAN for PolSAR image classification and clustering. IEEE transactions on neural networks and learning systems, Vol. 30, 9 (2019), 2707--2719.Google ScholarCross Ref
Stefano Monti, Pablo Tamayo, Jill Mesirov, and Todd Golub. 2003. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Machine learning, Vol. 52, 1--2 (2003), 91--118.Google Scholar
Sudipto Mukherjee, Himanshu Asnani, Eugene Lin, and Sreeram Kannan. 2019. Clustergan: Latent space clustering in generative adversarial networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 4610--4617.Google ScholarDigital Library
The Cancer Genome Atlas Research Network. 2011. Integrated genomic analyses of ovarian carcinoma. Nature, Vol. 474 (29 06 2011), 609 -- 615. http://dx.doi.org/10.1038/nature10166Google Scholar
The Cancer Genome Atlas Research Network. 2013. Integrated genomic characterization of endometrial carcinoma. Nature, Vol. 497 (01 05 2013), 67 -- 73. http://dx.doi.org/10.1038/nature12113Google Scholar
José Pereira. 2014. Survival Analysis Employed in Predicting Corporate Failure: A Forecasting Model Proposal., Vol. 7 (04 2014).Google Scholar
Catherine R Planey and Olivier Gevaert. 2016. CoINcIDE: A framework for discovery of patient subtypes across multiple datasets. Genome medicine, Vol. 8, 1 (2016), 27.Google Scholar
Jost Tobias Springenberg. 2015. Unsupervised and semi-supervised learning with categorical generative adversarial networks. arXiv preprint arXiv:1511.06390 (2015).Google Scholar
Michael Steinbach, Levent Ertöz, and Vipin Kumar. 2004. The challenges of clustering high dimensional data. In New directions in statistical physics. Springer, 273--309.Google Scholar
Mark Stevenson and IVABS EpiCentre. 2009. An introduction to survival analysis. EpiCentre, IVABS, Massey University (2009).Google Scholar
Mike Stoolmiller and James Snyder. 2013. Embedding multilevel survival analysis of dyadic social interaction in structural equation models: hazard rates as both outcomes and predictors. Journal of pediatric psychology, Vol. 39, 2 (2013), 222--232.Google ScholarCross Ref
Kelly C Vranas, Jeffrey K Jopling, Timothy E Sweeney, Meghan C Ramsey, Arnold S Milstein, Christopher G Slatore, Gabriel J Escobar, and Vincent X Liu. 2017. Identifying Distinct Subgroups of Intensive Care Unit Patients: a Machine Learning Approach. Critical care medicine, Vol. 45, 10 (2017), 1607.Google Scholar

Index Terms

Catalysis Clustering with GAN by Incorporating Domain Knowledge
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
    2. Machine learning approaches
      1. Neural networks

Recommendations

The impact of random models on clustering similarity

Clustering is a central approach for unsupervised learning. After clustering is applied, the most fundamental analysis is to quantitatively compare clusterings. Such comparisons are crucial for the evaluation of clustering methods as well as other tasks ...
Read More
Modified CURE algorithm with enhancement to identify number of clusters

In this paper, we present an effective way of identifying number of clusters k based on density of data in given dataset and optimality of clusters formed. We have used internal evaluation of clustering to choose optimal set of clusters after narrowing ...
Read More
Atomic-layer-deposited Al2O3 and HfO2 on GaN: A comparative study on interfaces and electrical characteristics

Al"2O"3, HfO"2, and composite HfO"2/Al"2O"3 films were deposited on n-type GaN using atomic layer deposition (ALD). The interfacial layer of GaON and HfON was observed between HfO"2 and GaN, whereas the absence of an interfacial layer at Al"2O"3/GaN was ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
August 2020
3664 pages
ISBN:9781450379984
DOI:10.1145/3394486
General Chairs:
Rajesh Gupta
UC San Diego, USA
,
Yan Liu
USC, USA
,
Program Chairs:
Mohak Shah
LG Electronics, USA
,
Suju Rajan
Linkedin, USA
,
Publications Chairs:
Jiliang Tang
Michigan State, USA
,
B. Aditya Prakash
Georgia Tech, USA
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 August 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GAN
cancer subtyping
clustering evaluation
domain-informed clustering
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 875
  Total Downloads
- Downloads (Last 12 months)33
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Catalysis Clustering with GAN by Incorporating Domain Knowledge

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

The impact of random models on clustering similarity

Modified CURE algorithm with enhancement to identify number of clusters

Atomic-layer-deposited Al2O3 and HfO2 on GaN: A comparative study on interfaces and electrical characteristics

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Catalysis Clustering with GAN by Incorporating Domain Knowledge

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

The impact of random models on clustering similarity

Modified CURE algorithm with enhancement to identify number of clusters

Atomic-layer-deposited Al2O3 and HfO2 on GaN: A comparative study on interfaces and electrical characteristics

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media