Similarity Score Estimation and Gaps Trimming of Multiple Sequence Alignment for Phylogenetic Tree Analysis
Main Article Content
Abstract
Phylogenetic tree analysis is a process for finding the highest possible revolution tree history of an interested organism. The important step of the process is multiple sequences alignment (MSA) which is operated using any MSA tool that produces a result in blocks of the Phylip format. Bioinformaticians have to manually determine and trim gaps of the MSA blocks using relevant tools of a software package in the off-line mode. The data blocks need to be manually cut-and-pasted between these tools. This working steps tend to be error-prone and time consuming. In addition, improper algorithm selection for tree inferring without applying an MSA similarity score tends to generate the phylogenetic tree with low accuracy and also take much more time. In this work, we present a new practical approach for the phylogenetic tree analysis applying our enhancement for the similarity score estimation and gaps trimming of the MSA blocks. We propose \textit{in-silico} algorithms for automating the concerned similarity score estimation and gaps trimming, and deploy them as web services. We demonstrate the web services utilized by composing them into an integrated stateful WSDL workflow. Our case study datasets are a complete coding sequences (CDS) and sets of complete genome of Dengue Viruses - 2, fetched from the NCBI RefSeq nucleotide database. Our proposed algorithms have correctly returned results, verified and satisfied by our bioinformaticians. Our distributions, user manuals and endpoints of the web services, and the open source programs are available at https://bioservices.sci.psu.ac.th.
Article Details
References
R. Page and E. Holmes, Molecular evolution: a phylogenetic approach. New Jersey, USA: WileyBlackwell, 1998.
S. Guindon and O. Gascuel, “A simple, fast, and accurate algorithm to estimate large phylo- genies by maximum likelihood," Systematic Biology, vol. 52, no. 5, pp. 696-704, 2003.
M. Wu and J. A. Eisen, “A simple, fast, and accurate method of phylogenomic inference,"Genome Biology, vol. 9, no. 10, p. R151, 2008.
M. Binet, O. Gascuel, C. Scornavacca, E. J.P. Douzery, and F. Pardi, “Fast and accurate branch lengths estimation for phylogenomic trees," BMC Bioinformatics, vol. 17, no. 1, p.23, 2016. [Online]. Available: https://dx.doi.org/10.1186/s12859-015-0821-8
J. Burleigh et al., “Genome-scale phylogenetics:inferring the plant tree of life from 18,896
gene trees," Syst Biol., vol. 60, no. 2, pp. 117-125, Mar. 2011.
S. Guindon, J. -F. Dufayard, V. Lefort, M. Anisimova, W. Hordijk, and O. Gascuel, “New algorithms and methods to estimate maximum- likelihood phylogenies: Assessing the performance of PhyML 3.0," Systematic Biology, vol. 59, no.3, pp. 307-321, 2010.
J. Felsenstein, “PHYLIP - Phylogeny inference package (version 3.2)," Cladistics, vol. 5, pp.164-
, 1989.
P. Rice, I. Longden, and A. Bleasby, “EMBOSS:The European Molecular Biology Open Software Suite (2000)," Trends in Genetics, vol. 16, no. 6, pp. 276-277, 2000.
K. Tamura, G. Stecher, D. Peterson, A. Filipski, and S. Kumar, “MEGA6: Molecular Evolutionary Genetics Analysis version 6.0," Molecular Biology and Evolution, vol. 30, no. 12, pp.2725-2729, Oct. 2013.
W. Li, A. Cowley, M. Uludag, T. Gur, H. McWilliam, S. Squizzato, Y. M. Park, N. Buso, and R. Lopez, “The EMBL-EBI bioinformatics web and programmatic tools frame-work," Nucleic Acids Research, vol. 43, pp.W580-W584, Apr. 2015.
EMBL-EBI, The European Bioinformatics Institute, Part of the European Molecular Biology
Laboratory," https://www.ebi.ac.uk/, 2017, [Online; accessed 25-July-2017].
M. Pagni, J. Hau, and H. Stockinger, “A Multi-protocol Bioinformatics Web Service: Use SOAP, Take a REST or Go with HTML," in Proc. IEEE International Symposium on Cluster Computing and the Grid, Lyon, France, pp. 728-734, May 2008.
L. J. Revell and S. A. Chamberlain, “Rphylip:an R interface for PHYLIP," Methods in Ecology and Evolution, vol. 5, pp. 976-981, 2014.
A. L. Bazinet, D. J. Zwickl, and M. P. Cummings, “A Gateway for Phylogenetic Analysis
Powered by Grid Computing Featuring GARLI 2.0," Syst Biol, vol. 63, no. 5, pp. syu031v1-syu031, Apr. 2014.
R. Snchez, F. Serra, J. Trraga, I. Medina, J. Carbonell, L. Pulido, A. de Mara, S. Capella Guterrez, J. Huerta-Cepas, T. Gabaldn, D. J., and H. Dopazo, “Phylemon 2.0: a suite of web-tools for molecular evolution, phylogenetics, phylogenomics and hypotheses testing," Nucleic Acids Research, vol. 10, no. 1093, pp. 1-5, Jun. 2011.
F. Sievers, A. Wilm, D. Dineen, T. J. Gibson, K. Karplus, W. Li, R. Lopez, H. McWilliam, M.
Remmert, J. Soding, J. D. Thompson, and D. G. Higginsa, Fast, scalable generation of high quality protein multiple sequence alignments using clustal omega," Molecular Systems Biology, vol. 7, no. 539, pp. 1-6, Oct. 2011.
L. Kannan and W. Wheeler, “Maximum parsimony on phylogenetic networks," Algorithms for
Molecular Biology, vol. 7, no. 9, pp. 1-10, May 2012.
N. Saitou and M. Nei, “The neighbor-joining method: a new method for reconstructing phylogenetic trees," Molecular Biology and Evolution, vol. 4, no. 4, pp. 406-425, 1987.
J. Felsenstein, “Evolutionary trees from dna sequences: a maximum likelihood approach," J
Mol Evol, vol. 17, pp. 368-376, 1981.
S. Capella-Gutierrez, J. Silla-Martinez, and T.Gabaldon, “trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses," Bioinformatics, vol. 25, no. 15, pp.1972-1973, Aug. 2009.
K. Wolstencroft, R. Haines, D. Fellows, A. Williams, D. Withers, S. Owen, S. Soiland Reyes, I. Dunlop, A. Nenadic, P. Fisher, J. Bhagat, K. Belhajjame, F. Bacall, A. Hardisty, A. Nieva de la Hidalga, M. P. Balcazar Vargas, S. Su_, and C. Goble, “The Taverna workow suite: designing and executing workows of WebServices on the desktop, web or in the cloud," Nucleic Acids Research, vol. 41, no. Web Server issue, pp. W557-W561, May 2013.
W. Tan, K. Chard, D. Sulakhe, R. Madduri, I. Foster, S. Soiland, and C. Goble, “Scientific workflows as services in caGrid: a Taverna and gRAVI approach," in Proc. IEEE International Conference on Web Services, Los Angeles, CA, pp. 413-420, Sep. 2009.
T. Tatusova, S. Ciufo, B. Fedorov, K. O'Neill, and I. Tolstoy, “RefSeq microbial genomes database: new representation and annotation strategy," Nucleic Acids Research, vol. 42, no.1, pp. D553-D559, Jan. 2014.
C. Mathew, A. Guntsch, M. Obst, S. Vicario, R. Haines, A. Williams, Y. de Jong, and C.Goble, “A semi-automated workflow for biodiversity data retrieval, cleaning, and quality control," Biodiversity Data Journal, vol. 2, p.e4221,Dec. 2014.
J.E. Ruiz, J. Garrido, J.D. Santander-Vela, S. Sanchez-Exposito and L. Verdes-Montenegro, “AstroTavernaBuilding workflows with Virtual Observatory services," in Astronomy and Computing, Volumes 78, Pages 3-11, 2014, special Issue on The Virtual Observatory: I.
I. Altintas, J.Wang, D. Crawl, and W. Li, “Challenges and approaches for distributed workflow driven analysis of large-scale biological data," in Proc. Workshop on Data analytics in the Cloud at EDBT/ICDT 2012 Conference, Berlin, Germany, pp. 73-78, Mar. 2012.
Y. Zhao, Y. Li, I. Raicu, S. Lu, W. Tian, and H. Liu, “Enabling scalable scientific workflow management in the Cloud," Future Generation Computer Systems, vol. 46, no. Issue C, pp. 3-16, May
Y. Zhao, Y. Li, I. Raicu, S. Lu, C. Lin, Y. Zhang, W. Tian, and R. Xue, “A service framework for scientific workflow management in the Cloud," IEEE Transactions on Services Computing, vol. PP, no. 99, pp. 1-14, Aug. 2014.
Y. Zhao, Y. Li, I. Raicu, C. Lin, W. Tian, and R. Xue, “Migrating Scienti_c Workow Management
Systems from the Grid to the Cloud," Cloud Computing for Data Intensive Applications, pp.
-256, Nov. 2014.
J. Thompson, D. Higgins, and T. Gibson, “CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice," Nucleic Acids Research, vol. 22, no. 22, pp. 4673-4680, Nov. 1994.
T. Lassmann, O. Frings, and E. L. L. Sonnhammer, “Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features," Nucleic Acids Research,
vol. 37, no. 3, pp. 858-865, Feb. 2009.
K. Katoh and H. Toh, “Recent developments in the MAFFT multiple sequence alignment program," Briefings in Bioinformatics, vol. 9, no.4, pp. 286-298, Mar. 2008.
R. C. Edgar, “MUSCLE: multiple sequence alignment with high accuracy and high throughput," Nucleic Acids Research, vol. 32, no. 5, pp.1792-1797, Mar. 2004.
B. P. Blackburne and S. Whelan, “Measuring the distance between multiple sequence alignments," BIOINFORMATICS, vol. 28, no. 4, pp. 495-502, Dec. 2012.
J. Felsenstein, PHYLIP (Phylogeny Inference Package) version 3.6. Seattle: Distributed by the author, Department of Genome Sciences, University of Washington, 2005.
S. Kumar, G. Stecher, and K. Tamura, “MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for Bigger Datasets," Molecular Biology and Evolution, vol. 33, no. 7, pp. 1870-1874, Mar. 2016.
S. Perera, C. Herath, J. Ekanayake, E. Chinthaka, A. Ranabahu, D. Jayasinghe, S. Weerawarana, and G. Daniels2, “Axis2, middleware for next generation web services," in Proc. IEEE International Conference on Web Services (ICWS'06), Chicago, USA, pp. 833-840, Sep. 2006.
K. Damkliang, Workow of MSA Similarity Score Estimation," https://www.myexperiment.org/workflows/4803.html, 2017, [Online; accessed 25-July-2017].
K. Damkliang, “Workflow of MSA Gaps Trimming, "https://www.myexperiment.org/workflows/4804.html, 2017, [Online; accessed 25-July-2017].
K. Damkliang, “Workflow of MSA, Similarity Score Estimation, and Gaps Trimming," https://www.myexperiment.org/workflows/4805.html, 2017, [Online; accessed 25-July-2017].