Skip to main content
Log in

Double-Parallel Monte Carlo for Bayesian analysis of big data

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

This paper proposes a simple, practical, and efficient MCMC algorithm for Bayesian analysis of big data. The proposed algorithm suggests to divide the big dataset into some smaller subsets and provides a simple method to aggregate the subset posteriors to approximate the full data posterior. To further speed up computation, the proposed algorithm employs the population stochastic approximation Monte Carlo algorithm, a parallel MCMC algorithm, to simulate from each subset posterior. Since this algorithm consists of two levels of parallel, data parallel and simulation parallel, it is coined as “Double-Parallel Monte Carlo.” The validity of the proposed algorithm is justified mathematically and numerically.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  • Andrieu, C., Moulines, E., Priouret, P.: Stability of stochastic approximation under verifiable conditions. SIAM J. Control Optim. 44, 283–312 (2005). https://doi.org/10.1137/S0363012902417267

    Article  MathSciNet  MATH  Google Scholar 

  • Carvalho, C.M., Polson, N.G., Scott, J.G.: The horseshoe estimator for sparse signals. Biometrika 97, 465–480 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6, 721–741 (1984). https://doi.org/10.1109/TPAMI.1984.4767596

    Article  MATH  Google Scholar 

  • Hasting, W.: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97–109 (1970). https://doi.org/10.1093/biomet/57.1.97

    Article  MathSciNet  Google Scholar 

  • Kass, R.E., Tierney, L., Kadane, J.B.: The validity of posterior expansions based on Laplace’s method. In: Geisser, S., Hodges, J.S., Press, S.J., ZeUnerv, A. (eds.) Bayesian and Likelihood Methods in Statistics and Econometrics: Essays in Honor of George A. Barnard, vol. 7, pp. 473–488. North Holland, Amsterdam (1990)

    Google Scholar 

  • Li, C., Srivastava, S., Dunson, D.B.: Simple, scalable and accurate posterior interval estimation. Biometrika 104, 665–680 (2017)

  • Liang, F.: On the use of stochastic approximation Monte Carlo for Monte Carlo integration. Stat. Probab. Lett. 79, 581–587 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  • Liang, F., Liu, C., Carroll, R.J.: Stochastic approximation in Monte Carlo computation. J. Am. Stat. Assoc. 102, 305–320 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  • Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. J. Chem. Phys. 21, 1087 (1953)

    Article  Google Scholar 

  • Minsker, S., Srivastava, S., Lin, L., Dunson, D.B.: Robust and scalable Bayes via a median of subset posterior measures. ArXiv e-prints (2014)

  • Neiswanger, W., Wang, C., Xing, E.P.: Asymptotically exact, embarrassingly parallel MCMC, In: Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, UAI’14, pp. 623–632, AUAI Press, Arlington. http://dl.acm.org/citation.cfm?id=3020751.3020816 (2014)

  • Ripley, B.D.: Stochastic Simulation. Wiley, New York (1987)

    Book  MATH  Google Scholar 

  • Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951). https://doi.org/10.1214/aoms/1177729586

    Article  MathSciNet  MATH  Google Scholar 

  • Roe, B.P., Yang, H.-J., Zhu, J., Liu, Y., Stancu, I., McGregor, G.: Boosted decision trees as an alternative to artificial neural networks for particle identification. Nucl. Instrum. Methods Phys. Res. Sect. A Accel. Spectrom. Detect. Assoc. Equip. 543, 577–584 (2005)

    Article  Google Scholar 

  • Scott, S.L., Blocker, A.W., Bonassi, F.V.: Bayes and big data: the consensus Monte Carlo algorithm. Int. J. Manag. Sci. Eng. Manag. 11, 78–88 (2016)

    Google Scholar 

  • Song, Q., Wu, M., Liang, F.: Weak convergence rates of population versus single-chain stochastic approximation MCMC algorithms. Adv. Appl. Probab. 46, 1059–1083 (2014). https://doi.org/10.1239/aap/1418396243

    Article  MathSciNet  MATH  Google Scholar 

  • Srivastava, S., Cevher, V., Tran-Dinh, Q., Dunson, D.B.: WASP: scalable Bayes via barycenters of subset posteriors. In: AISTATS, volume 38 of JMLR: Workshop and Conference Proceedings (JMLR.org) (2015a)

  • Srivastava, S., Li, C., Dunson, D.B.: Scalable Bayes via barycenter in Wasserstein space. ArXiv e-prints (2015b)

  • Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. (Ser. B) 58, 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  • Wand, M.: KernSmooth: functions for kernel smoothing supporting Wand & Jones (1995). r package version 2.23-15 (2015)

  • Wang, X., Dunson, D.: Parallelizing MCMC via Weierstrass sampler. ArXiv e-prints (2013)

  • Wang, X., Guo, F., Heller, K.A., Dunson, D.B.: parallelizing MCMC with random partition trees. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (Eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 451–459. Curran Associates, Inc. http://papers.nips.cc/paper/5986-parallelizing-mcmc-with-random-partition-trees.pdf (2015)

Download references

Acknowledgements

Liang’s research was supported in part by the Grants DMS-1545202, DMS-1612924 and R01-GM117597. The authors thank the Editor, Associate Editor and two referees for their constructive comments which has led to significant improvement of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Faming Liang.

Appendix A Proof of Theorem 1

Appendix A Proof of Theorem 1

Under conditions A1 and A2, we can expand the mean of each subposterior at the corresponding MLE, \(\hat{\varvec{\theta }}^{(j)}\), as follows:

$$\begin{aligned} \varvec{\mu }^{(j)}= & {} E_{\tilde{\pi }_j} (\varvec{\theta }) =\hat{\varvec{\theta }}^{(j)}+\frac{\hat{I}^{(j)-1}}{n} \\&\times \left[ \frac{\partial \log g({\varvec{\theta }})}{\partial {\varvec{\theta }}}|_{\hat{{\varvec{\theta }}}^{(j)}} -\frac{1}{2}\hat{H}^{(j)}\hat{I}^{(j)-1} \right] +O(n^{-2}), \end{aligned}$$

where \(\tilde{\pi }_j=\tilde{\pi }({\varvec{\theta }}|{\varvec{X}}_{[j]})\), \(\hat{I}^{(j)}=-\frac{1}{m} \frac{\partial ^{2} \log f({\varvec{X}}_{[j]}|\varvec{\theta })}{\partial \varvec{\theta }\partial \varvec{\theta }^{T}}|_{\varvec{\theta }=\hat{\varvec{\theta }}^{(j)}}\), \(\hat{H}^{(j)}=-\frac{1}{m}\frac{\partial ^{3} logf({\varvec{X}}_{[j]}|\varvec{\theta })}{\partial \varvec{\theta } \partial \varvec{\theta }^{T}\partial \varvec{\theta }}|_{\varvec{\theta }=\hat{\varvec{\theta }}^{(j)}}\), and \(\hat{H}^{(j)}\hat{I}^{(j)-1}\) is a vector whose rth element equals \(\sum _{st}\hat{H}_{rst}^{(j)}\hat{I}_{st}^{(j)-1}\). To simplify the notation, we denote \(\hat{I}^{(j)-1}[\partial \log g({\varvec{\theta }})/\partial {\varvec{\theta }}|_{\hat{{\varvec{\theta }}}^{(j)}} -\frac{1}{2}\hat{H}^{(j)}\hat{I}^{(j)-1}]\) by \(\varvec{\nu }^{(j)}\). Here we would like to mention that although A1 and A2 are directly associated with the Laplace approximation of the full posterior, they can also lead to the conditions that guarantee the Laplace approximation of each subposterior, as long as m goes to \(\infty \) with n.

Moreover, for each \(\hat{\varvec{\theta }}^{(j)}\), we have

$$\begin{aligned} \hat{\varvec{\theta }}^{(j)}=\varvec{\theta }^{*}+\frac{\varvec{\xi }^{(j)}}{\sqrt{m}}+O_{p}(m^{-1}), \end{aligned}$$

where

$$\begin{aligned} \varvec{\xi }^{(j)}= & {} \frac{1}{\sqrt{m}}I^{-1}\sum _{i=1}^{m}\frac{\partial \log f(X_{ji}|\varvec{\theta }^{*})}{\partial \varvec{\theta }},\\ I= & {} -E_{{\varvec{X}}|\varvec{\theta }^{*}}\frac{\partial ^2 \log f({\varvec{X}}|\varvec{\theta }^{*}) }{\partial \varvec{\theta } \partial \varvec{\theta }^{T} }. \end{aligned}$$

Therefore, the mean of the mixture distribution \(\tilde{\pi }(\varvec{\theta }|{\varvec{X}})\) is given by

$$\begin{aligned} E_{\tilde{\pi }} (\varvec{\theta })= & {} \frac{1}{k}\sum _{j=1}^{k}\varvec{\mu }^{(j)}\\= & {} \varvec{\theta }^{*}+\frac{1}{k}\sum _{j=1}^{k}\frac{\varvec{\nu }^{(j)}}{n}+\frac{1}{k}\sum _{j=1}^{k} \frac{\varvec{\xi }^{(j)}}{\sqrt{m}}\\&+\,O_{p}(m^{-1})+O(n^{-2}). \end{aligned}$$

We can also implement a similar procedure on the full data posterior \(\pi (\varvec{\theta }|{\varvec{X}})\) and obtain

$$\begin{aligned} E_{\pi }(\varvec{\theta })= & {} \varvec{\theta }^{*}+\frac{\varvec{\nu }}{n} +\frac{\varvec{\xi }}{\sqrt{n}}+O_{p}(n^{-1})+O(n^{-2}), \end{aligned}$$

where \(\varvec{\nu }=\hat{I}^{-1}[\partial \log g({\varvec{\theta }})/\partial {\varvec{\theta }}|_{\hat{{\varvec{\theta }}} } -\frac{1}{2}\hat{H}\hat{I}^{-1}]\), \(\varvec{\xi }=\frac{1}{\sqrt{n}}I^{-1}\sum _{i=1}^{n}\frac{\partial \log f(X_{i}|\varvec{\theta }^{*})}{\partial \varvec{\theta }}\), \(\hat{I}=-\frac{1}{n} \frac{\partial ^{2} \log f({\varvec{X}}|\varvec{\theta })}{\partial \varvec{\theta }\partial \varvec{\theta }^{T}}|_{\varvec{\theta }=\hat{\varvec{\theta }}}\), \(\hat{H}=-\frac{1}{m}\frac{\partial ^{3} logf({\varvec{X}}|\varvec{\theta })}{\partial \varvec{\theta } \partial \varvec{\theta }^{T}\partial \varvec{\theta }}|_{\varvec{\theta }=\hat{\varvec{\theta }}}\), and \(\hat{\varvec{\theta }}\) denotes the MLE of \(\theta \) calculated with the full dataset. By noting that \(\frac{\varvec{\xi }}{\sqrt{n}}=\frac{1}{k}\sum _{j=1}^{k}\frac{\varvec{\xi }^{(j)}}{\sqrt{m}}\), \(\varvec{\nu }^{(j)}\) and \(\varvec{\nu }\) are O(1), \(m<n\), Eq. (4) in Theorem 1 is thereby verified:

$$\begin{aligned} E_{\tilde{\pi }} (\varvec{\theta })-E_{\pi } (\varvec{\theta })= & {} \frac{1}{k}\sum _{j=1}^{k}\frac{\varvec{\nu }^{(j)}}{n}+O_{p}(m^{-1})\\&-\frac{\varvec{\nu }}{n} +O_{p}(n^{-1})+O(n^{-2}) \\= & {} O(n^{-1})+O_{p}(m^{-1})+O(n^{-1})\\&+\,O_{p}(n^{-1})+O(n^{-2}) \\= & {} O_{p}(m^{-1}). \end{aligned}$$

The variance of each subposterior can be approximated as follows:

$$\begin{aligned} \mathrm{Var}_{\tilde{\pi }_{j}}(\varvec{\theta })=\frac{\hat{I}^{(j)-1}}{n}+O(n^{-2}). \end{aligned}$$

Therefore, the variance of the mixture distribution \(\tilde{\pi }(\varvec{\theta }|{\varvec{X}})\) is given by

$$\begin{aligned} \mathrm{Var}_{\tilde{\pi }}(\varvec{\theta })=\frac{1}{k}\sum _{j=1}^{k}\mathrm{Var}_{\tilde{\pi }_{j}}(\varvec{\theta }) =\frac{1}{k}\sum _{j=1}^{k}\frac{\hat{I}^{(j)-1}}{n}+O(n^{-2}). \end{aligned}$$

In addition, we have

$$\begin{aligned} \mathrm{Var}_{\pi }(\varvec{\theta })=\frac{\hat{I}^{-1}}{n}+O(n^{-2}). \end{aligned}$$

Since \(\hat{I}^{(j)}=I+o_{p}(1)\) and \(\hat{I}=I+o_{p}(1)\), we further have \(\hat{I}^{(j)-1}=I^{-1}+o_{p}(1)\) and \(\hat{I}^{-1}=I^{-1}+o_{p}(1)\). Equation (5) in Theorem 1 is thereby verified:

$$\begin{aligned} \mathrm{Var}_{\tilde{\pi }}(\varvec{\theta })-\mathrm{Var}_{\pi }(\varvec{\theta })= & {} \frac{1}{k}\sum _{j=1}^{k}\frac{\hat{I}^{(j)-1}}{n}-\frac{\hat{I}^{-1}}{n}+O(n^{-2})\\= & {} \frac{I^{-1}}{n}+o_{p}(n^{-1})-\frac{I^{-1}}{n}\\&+\,o_{p}(n^{-1})+O(n^{-2})\\= & {} o_{p}(n^{-1}). \end{aligned}$$

Finally, for the square of the Wasserstein distance of order 2, we have

$$\begin{aligned}&\bigg |d^{2}(\tilde{\pi }(\varvec{\theta }|{\varvec{X}}),\delta _{{\varvec{\theta }}^{*}})-d^{2}(\pi (\varvec{\theta }|{\varvec{X}}),\delta _{{\varvec{\theta }}^{*}})\bigg |\\&\quad =\bigg |\int _{\Theta }\Vert \varvec{\theta }- \varvec{\theta }^{*}\Vert _{2}^{2}\tilde{\pi }(\varvec{\theta }|{\varvec{X}})\mathrm{d}\varvec{\theta }\\&\quad \quad -\int _{\Theta }\Vert \varvec{\theta }- \varvec{\theta }^{*}\Vert _{2}^{2}\pi (\varvec{\theta }|{\varvec{X}})\mathrm{d}\varvec{\theta }\bigg |\\&\quad =\left| \left\| E_{\tilde{\pi }}(\varvec{\theta })-\varvec{\theta }^{*}\right\| _2^2+tr\left( \mathrm{Var}_{\tilde{\pi }}(\varvec{\theta }) \right) \right. \\&\quad \quad \left. -\left\| E_{\pi }(\varvec{\theta })-\varvec{\theta }^{*}\right\| _2^2 -tr\left( \mathrm{Var}_{\pi }(\varvec{\theta }) \right) \right| \\&\quad \le \left| \left\| E_{\tilde{\pi }}(\varvec{\theta })-E_{\pi }(\varvec{\theta })+E_{\pi }(\varvec{\theta })-\varvec{\theta }^{*}\right\| _2^2\right. \\&\left. \quad \quad -\left\| E_{\pi }(\varvec{\theta })-\varvec{\theta }^{*}\right\| _2^2\right| +|tr\left( \mathrm{Var}_{\tilde{\pi }}(\varvec{\theta }) -\mathrm{Var}_{\pi }(\varvec{\theta }) \right) |\\&\quad \le \Vert E_{\tilde{\pi }}(\varvec{\theta })-E_{\pi }(\varvec{\theta })\Vert _2^2+2\Vert E_{\tilde{\pi }}(\varvec{\theta })\\&\quad \quad -\,E_{\pi }(\varvec{\theta })\Vert _{2}\Vert E_{\pi }(\varvec{\theta })-\varvec{\theta }^{*}\Vert _{2}+o_{p}(n^{-1})\\&\quad =O_{p}(m^{-2})+2O_{p}(m^{-1})O_{p}(n^{-1/2})+o_{p}(n^{-1})\\&\quad =o_{p}(n^{-1}), \end{aligned}$$

where the last equality follows from the fact \(n=o(m^{2})\). This verifies Eq. (5) in Theorem 1.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xue, J., Liang, F. Double-Parallel Monte Carlo for Bayesian analysis of big data. Stat Comput 29, 23–32 (2019). https://doi.org/10.1007/s11222-017-9791-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-017-9791-1

Keywords

Navigation