Published August 30, 2017 | Version v1
Dataset Open

Supplementary Data: UFBoot2: Improving the Ultrafast Bootstrap Approximation

  • 1. University of Engineering and Technology, Vietnam National University, Hanoi, Vietnam
  • 2. Center for Integrative Bioinformatics Vienna, Max F. Perutz Laboratories, University of Vienna, Medical University Vienna, Vienna, Austria

Description

Supplementary Data
UFBoot2: Improving the Ultrafast Bootstrap Approximation
doi: https://doi.org/10.1101/153916
http://www.biorxiv.org/content/early/2017/06/22/153916

This record contains PANDIT based dataset and TreeBASE dataset (Nguyen et al. 2015) which are analyzed by different bootstrap methods in the study "UFBoot2: Improving the Ultrafast Bootstrap Approximation". The PANDIT based dataset (compressed in file data_pandit.tar.gz) is used to benchmark the accuracy of bootstrap estimates. The TreeBASE dataset (compressed in file data_treebase.tar.gz) is used to benchmark runtimes. 

After being uncompressed, the PANDIT based dataset comprises:

  • 5,690 numbered directories corresponding to 5,690 DNA MSAs simulated by Seq-Gen (Rambaut and Grass 1997), where the model parameters and true tree were inferred from the original MSAs downloaded from the PANDIT database (Whelan et al. 2006). Note that the numbering of these directories is not consecutive because we kept only MSAs that can be tested under the mild and severe model violations as defined in the UFBoot paper (Minh et al. 2013).
  • In each numbered directory N, there are three files: (1) data.N contains the simulated MSA in PHYLIP format; (2) model.N contains the best-fit model detected from the corresponding original MSA; (3) tree.N contains the tree (in Newick format) inferred from the corresponding original MSA. tree.N and model.N are used by Seq-Gen to simulate the MSA in data.N.

After being uncompressed, the TreeBASE dataset comprises 115 files corresponding to 115 MSAs. There are:

  • 70 DNA MSAs in PHYLIP format. These files follow the naming scheme dna_[number of sequences]_[number of sites].phy.
  • 45 protein MSAs in PHYLIP format. These files follow the naming scheme prot_[number of sequences]_[number of sites].phy.

Files

Files (44.0 MB)

Name Size Download all
md5:69b5bba573068d98ab3b578c6d4dbca4
26.7 MB Download
md5:78b42861ca561761032a2c0bcf8a9acc
17.3 MB Download

Additional details

Related works

Is supplement to
10.1101/153916 (DOI)

References

  • Minh BQ, Nguyen MAT, von Haeseler A. 2013. Ultrafast approximation for phylogenetic bootstrap. Mol. Biol. Evol. 30:1188–1195.
  • Nguyen L-T, Schmidt HA, von Haeseler A, Minh BQ. 2015. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32:268–274.
  • Rambaut A, Grass NC. 1997. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Bioinformatics 13:235–238.
  • Whelan S, de Bakker PIW, Quevillon E, Rodriguez N, Goldman N. 2006. PANDIT: an evolution-centric database of protein and associated nucleotide domains with inferred trees. Nucleic Acids Res. 34:D327–D331.