INDIGO Home University of Illinois at Urbana-Champaign logo uic building uic pavilion uic student center

Virus Classification Based On Alignment-Free Methods

Show full item record

Bookmark or cite this item:

Files in this item

File Description Format
PDF Zheng_Hui.pdf (2MB) (no description provided) PDF
Title: Virus Classification Based On Alignment-Free Methods
Author(s): Zheng, Hui
Advisor(s): Yau, Stephen S.
Contributor(s): Yang, Jie; Nicholls, David; He, Rong L.; Jia, Lixing
Department / Program: Mathematics, Statistics, and Computer Science
Graduate Major: Mathematics
Degree Granting Institution: University of Illinois at Chicago
Degree: PhD, Doctor of Philosophy
Genre: Doctoral
Subject(s): Virus classification Alignment Ebolavirus Vector Virus database
Abstract: We compared the advantages and disadvantages of alignment-based and alignment-free sequences analysis methods. We analyzed and classified all single-segmented viruses reference sequences by the natural vector method. Natural graphs of each Baltimore groups are displayed, which showed different family and genus classes are separated clearly. We derived the distance matrix of multiple segmented viruses, through Hausdorff distance and natural vector. West Nile virus and Influenza viruses were included in the dataset and they are classified in the correct family and genus by natural vector. Based on previous work, we applied natural vectors on ebola viruses of the 2014 outbreak. The accuracy rates of family and genus labels classification are as high as 100\%. We also display the phelogenetic relationship between species of EBOV by their whole genome sequences and 7 proteins (Nucleoprotein (NP), VP35, VP40, Glycoprotein (GP), VP30, VP24, and RNA polymerase (L)). The phylogenetic trees indicate that VP24 is the most consistent to the variation of virulence, suggesting VP24 is a pharmaceutical target for treating or preventing the Ebola virus. Based on a Markov Model, we proposed a new alignment-free sequences analysis method, the Q-vector. It keeps the sequence length information and reflects the relation between lower mers and higher mers. After applying the Q-vector, k-mer method and composition vector to classify the viruses’ reference sequences, Q-vector displays big advantages in both effectiveness and accuracy. By combining the distance matrix derived through Q-vector and natural vector method, we defined a distance matrix, which lowest the classification error to its smallest. Based on this new distance, we display the phylogenetic trees. We built a virus database called VirusDB ( (for users in USA)) or ( (for users in China) ) and an online system to serve those people who are interested in virus classification and prediction based on the natural vector method. The database stores the nucleotide sequences, natural vectors, and classification information of the single-segmented and multiple-segmented referenced viruses which were downloaded from NCBI. The online inquiry system serves the purpose of computing natural vectors and their distances in between of sequences, providing backend processes for automatic and manual updating of database content to synchronize with the GenBank copy, and providing online interface for accessing and using the database for classification and prediction.
Issue Date: 2015-10-21
Genre: thesis
Rights Information: Copyright 2015 Hui Zheng
Date Available in INDIGO: 2017-10-22
Date Deposited: 2015-08

This item appears in the following Collection(s)

Show full item record


Country Code Views
China 232
United States of America 171
Russian Federation 25
Ukraine 23
Germany 13


My Account


Access Key