ABSTRACT
We propose Perform-ML, the first Machine Learning (ML) framework for analysis of massive and dense data which customizes the algorithm to the underlying platform for the purpose of achieving optimized resource efficiency. Perform-ML creates a performance model quantifying the computational cost of iterative analysis algorithms on a pertinent platform in terms of FLOPs, communication, and memory, which characterize runtime, storage, and energy. The core of Perform-ML is a novel parametric data projection algorithm, called Elastic Dictionary (ExD), that enables versatile and sparse representations of the data which can help in minimizing performance cost. We show that Perform-ML can achieve the optimal performance objective, according to our cost model, by platform aware tuning of the ExD parameters. An accompanying API ensures automated applicability of Perform-ML to various algorithms, datasets, and platforms. Proof-of-concept evaluations of massive and dense data on different platforms demonstrate more than an order of magnitude improvements in performance compared to the state of the art, within guaranteed user-defined error bounds.
- Demmel, J., Eliahu, D., Fox, A., Lipshitz, B., and Spillinger, O. Communication optimal parallel recursive rectangular matrix multiplication. IPDPS'13. Google ScholarDigital Library
- Drineas, P., and Mahoney, M. On the nyström method for approximating a gram matrix for improved kernel-based learning. JMLR'05. Google ScholarDigital Library
- Dyer, E., Goldstein, T., Patel, R., Kording, K., and Baraniuk, R. Self-expressive decompositions for matrix approximation and clustering. arXiv:1505.00824 (2015).Google Scholar
- Farahat, A., Elgohary, A., Ghodsi, A., and Kamel, M. Greedy column subset selection for large-scale data sets. Knowledge and Information Systems (2014). Google ScholarDigital Library
- Ferris, M. C., and Munson, T. S. Interior-point methods for massive support vector machines. SIAM J. on Optimization (2002). Google ScholarDigital Library
- Figueiredo, M., Nowak, R., and Wright, S. Gradient projections for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE J. Select. Top. Signal Processing 1, 4 (2007).Google ScholarCross Ref
- Fine, S., and Scheinberg, K. Efficient svm training using low-rank kernel representations. JMLR'02. Google ScholarDigital Library
- Fowlkes, C., Belongie, S., Chung, F., and Malik, J. Spectral grouping using the nystrom method. TPAMI'04. Google ScholarDigital Library
- Gilbert, A., Strauss, M., Tropp, J., and Vershynin, R. One sketch for all: Fast algorithms for compressed sensing. STOC '07. Google ScholarDigital Library
- Gittens, A., and Mahoney, M. Revisiting the nystrom method for improved large-scale machine learning. ICML'13.Google Scholar
- Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., and Hellerstein, J. M. GraphLab: A new parallel framework for machine learning. UAI'10.Google Scholar
- Malewicz, G., Austern, M., Bik, A., Dehnert, J., Horn, I., Leiser, N., and Czajkowski, G. Pregel: a system for large-scale graph processing. SIGMOD'10. Google ScholarDigital Library
- Mirhoseini, A., Baraniuk, R., and Koushanfar, F. Rankmap: A platform-aware framework for distributed learning from dense datasets. preprint:1503.08169 (2015).Google Scholar
- Patel, R., Goldstein, T., and Baraniuk, R. Adaptive column sampling and nystrom approximation via oasis. SDM'16.Google Scholar
- Pham, V., Ghaoui, L., and A., F. Lsrn: A parallel iterative solver for strongly over or under-determined systems. SIAM Journal of Scientific Computing (2014).Google Scholar
- Rubinstein, R., Zibulevsky, M., and Elad, M. Efficient implementation of the k-svd algorithm using batch orthogonal matching pursuit. CS Technion'08.Google Scholar
- Tibshirani, R. Regression shrinkage and selection via the lasso. J. Royal Statist. Soc B 58, 1 (1996).Google Scholar
- www.ehu.es/ccwintco/. Aviris hyperspectral data.Google Scholar
- www.lightfield.stanford.edu. The Light Field Archive.Google Scholar
- Vu Pham, Laurent El Ghaoui, A. F. Robust sketching for multiple square-root lasso problems. Optimization and Control (2014).Google Scholar
- Zaharia, M., Chowdhury, M., Franklin, M., Shenker, S., and Stoica, I. Spark: cluster computing with working sets. USENIX CHTCC'10. Google ScholarDigital Library
- Zhang, T. Solving large scale linear prediction problems using stochastic gradient descent algorithms. ICML '04. Google ScholarDigital Library
Recommendations
Snap ML: a hierarchical framework for machine learning
NIPS'18: Proceedings of the 32nd International Conference on Neural Information Processing SystemsWe describe a new software framework for fast training of generalized linear models. The framework, named Snap Machine Learning (Snap ML), combines recent advances in machine learning systems and algorithms in a nested manner to reflect the hierarchical ...
Accelerating ML Workloads using GPU Tensor Cores: The Good, the Bad, and the Ugly
ICPE '24: Proceedings of the 15th ACM/SPEC International Conference on Performance EngineeringMachine Learning (ML) workloads generally contain a significant amount of matrix computations; hence, hardware accelerators for ML have been incorporating support for matrix accelerators. With the popularity of GPUs as hardware accelerators for ML, ...
Interpretable ML enhanced CNN Performance Analysis of cuBLAS, cuDNN and TensorRT
SAC '23: Proceedings of the 38th ACM/SIGAPP Symposium on Applied ComputingDeep learning models such as convolutional neural networks (CNNs) have a wide range of perception applications in image classification and object detection. However, despite the same CNN architectures, inference performance is different from ...
Comments