research-article

Perform-ML: performance optimized machine learning by platform and content aware customization

Authors:
Azalia Mirhoseini

Rice University

Rice University
View Profile

,
Bita Darvish Rouhani

UC San Diego

UC San Diego
View Profile

,
Ebrahim M. Songhori

Rice University

Rice University
View Profile

,
Farinaz Koushanfar

UC San Diego

UC San Diego
View Profile

DAC '16: Proceedings of the 53rd Annual Design Automation ConferenceJune 2016Article No.: 20Pages 1–6https://doi.org/10.1145/2897937.2898060

Published:05 June 2016Publication History

DAC '16: Proceedings of the 53rd Annual Design Automation Conference

Pages 1–6

ABSTRACT

We propose Perform-ML, the first Machine Learning (ML) framework for analysis of massive and dense data which customizes the algorithm to the underlying platform for the purpose of achieving optimized resource efficiency. Perform-ML creates a performance model quantifying the computational cost of iterative analysis algorithms on a pertinent platform in terms of FLOPs, communication, and memory, which characterize runtime, storage, and energy. The core of Perform-ML is a novel parametric data projection algorithm, called Elastic Dictionary (ExD), that enables versatile and sparse representations of the data which can help in minimizing performance cost. We show that Perform-ML can achieve the optimal performance objective, according to our cost model, by platform aware tuning of the ExD parameters. An accompanying API ensures automated applicability of Perform-ML to various algorithms, datasets, and platforms. Proof-of-concept evaluations of massive and dense data on different platforms demonstrate more than an order of magnitude improvements in performance compared to the state of the art, within guaranteed user-defined error bounds.

References

Demmel, J., Eliahu, D., Fox, A., Lipshitz, B., and Spillinger, O. Communication optimal parallel recursive rectangular matrix multiplication. IPDPS'13. Google ScholarDigital Library
Drineas, P., and Mahoney, M. On the nyström method for approximating a gram matrix for improved kernel-based learning. JMLR'05. Google ScholarDigital Library
Dyer, E., Goldstein, T., Patel, R., Kording, K., and Baraniuk, R. Self-expressive decompositions for matrix approximation and clustering. arXiv:1505.00824 (2015).Google Scholar
Farahat, A., Elgohary, A., Ghodsi, A., and Kamel, M. Greedy column subset selection for large-scale data sets. Knowledge and Information Systems (2014). Google ScholarDigital Library
Ferris, M. C., and Munson, T. S. Interior-point methods for massive support vector machines. SIAM J. on Optimization (2002). Google ScholarDigital Library
Figueiredo, M., Nowak, R., and Wright, S. Gradient projections for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE J. Select. Top. Signal Processing 1, 4 (2007).Google ScholarCross Ref
Fine, S., and Scheinberg, K. Efficient svm training using low-rank kernel representations. JMLR'02. Google ScholarDigital Library
Fowlkes, C., Belongie, S., Chung, F., and Malik, J. Spectral grouping using the nystrom method. TPAMI'04. Google ScholarDigital Library
Gilbert, A., Strauss, M., Tropp, J., and Vershynin, R. One sketch for all: Fast algorithms for compressed sensing. STOC '07. Google ScholarDigital Library
Gittens, A., and Mahoney, M. Revisiting the nystrom method for improved large-scale machine learning. ICML'13.Google Scholar
Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., and Hellerstein, J. M. GraphLab: A new parallel framework for machine learning. UAI'10.Google Scholar
Malewicz, G., Austern, M., Bik, A., Dehnert, J., Horn, I., Leiser, N., and Czajkowski, G. Pregel: a system for large-scale graph processing. SIGMOD'10. Google ScholarDigital Library
Mirhoseini, A., Baraniuk, R., and Koushanfar, F. Rankmap: A platform-aware framework for distributed learning from dense datasets. preprint:1503.08169 (2015).Google Scholar
Patel, R., Goldstein, T., and Baraniuk, R. Adaptive column sampling and nystrom approximation via oasis. SDM'16.Google Scholar
Pham, V., Ghaoui, L., and A., F. Lsrn: A parallel iterative solver for strongly over or under-determined systems. SIAM Journal of Scientific Computing (2014).Google Scholar
Rubinstein, R., Zibulevsky, M., and Elad, M. Efficient implementation of the k-svd algorithm using batch orthogonal matching pursuit. CS Technion'08.Google Scholar
Tibshirani, R. Regression shrinkage and selection via the lasso. J. Royal Statist. Soc B 58, 1 (1996).Google Scholar
www.ehu.es/ccwintco/. Aviris hyperspectral data.Google Scholar
www.lightfield.stanford.edu. The Light Field Archive.Google Scholar
Vu Pham, Laurent El Ghaoui, A. F. Robust sketching for multiple square-root lasso problems. Optimization and Control (2014).Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M., Shenker, S., and Stoica, I. Spark: cluster computing with working sets. USENIX CHTCC'10. Google ScholarDigital Library
Zhang, T. Solving large scale linear prediction problems using stochastic gradient descent algorithms. ICML '04. Google ScholarDigital Library

Recommendations

Snap ML: a hierarchical framework for machine learning
NIPS'18: Proceedings of the 32nd International Conference on Neural Information Processing Systems

We describe a new software framework for fast training of generalized linear models. The framework, named Snap Machine Learning (Snap ML), combines recent advances in machine learning systems and algorithms in a nested manner to reflect the hierarchical ...
Read More
Accelerating ML Workloads using GPU Tensor Cores: The Good, the Bad, and the Ugly
ICPE '24: Proceedings of the 15th ACM/SPEC International Conference on Performance Engineering

Machine Learning (ML) workloads generally contain a significant amount of matrix computations; hence, hardware accelerators for ML have been incorporating support for matrix accelerators. With the popularity of GPUs as hardware accelerators for ML, ...
Read More
Interpretable ML enhanced CNN Performance Analysis of cuBLAS, cuDNN and TensorRT
SAC '23: Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing

Deep learning models such as convolutional neural networks (CNNs) have a wide range of perception applications in image classification and object detection. However, despite the same CNN architectures, inference performance is different from ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

DAC '16: Proceedings of the 53rd Annual Design Automation Conference
June 2016
1048 pages
ISBN:9781450342360
DOI:10.1145/2897937

Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 June 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,770of5,499submissions,32%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 216
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Perform-ML: performance optimized machine learning by platform and content aware customization

DAC '16: Proceedings of the 53rd Annual Design Automation Conference

ABSTRACT

References

Cited By

Recommendations

Snap ML: a hierarchical framework for machine learning

Accelerating ML Workloads using GPU Tensor Cores: The Good, the Bad, and the Ugly

Interpretable ML enhanced CNN Performance Analysis of cuBLAS, cuDNN and TensorRT

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Perform-ML: performance optimized machine learning by platform and content aware customization

DAC '16: Proceedings of the 53rd Annual Design Automation Conference

ABSTRACT

References

Cited By

Recommendations

Snap ML: a hierarchical framework for machine learning

Accelerating ML Workloads using GPU Tensor Cores: The Good, the Bad, and the Ugly

Interpretable ML enhanced CNN Performance Analysis of cuBLAS, cuDNN and TensorRT

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media