INDIGO Home University of Illinois at Urbana-Champaign logo uic building uic pavilion uic student center

Vectorization Generalizations in Genomics and Transportation

Show full item record

Bookmark or cite this item: http://hdl.handle.net/10027/10009

Files in this item

File Description Format
PDF Hernandez_Troy.pdf (13MB) (no description provided) PDF
Title: Vectorization Generalizations in Genomics and Transportation
Author(s): Hernandez, Troy A.
Advisor(s): Yang, Jie
Contributor(s): Yau, Stephen; Wang, Jing; Wang, Junhui; He, Rong
Department / Program: Mathematics, Statistics, and Computer Science
Graduate Major: Mathematics
Degree Granting Institution: University of Illinois at Chicago
Degree: PhD, Doctor of Philosophy
Genre: Doctoral
Subject(s): Statistical Learning Machine Learning Genomics Bioinformatics Virology Transportation Bus Arrival Time Prediction
Abstract: The process of transforming a sample to a pair of input and output vectors is sometimes referred to as ``vectorization''. Those samples and their respective vectorizations are used within various learning algorithms to create a model that makes predictions about unknown output vectors given known input vectors. This thesis aims to compare, generalize, and improve existing vectorizations within the fields of bioinformatics and transportation. We extend the natural vector description of genomes to handle viruses and various issues unique to viral genomes. We provide an alternative definition of the the natural vector that is able to handle ambiguous nucleotides. We provide a bound on the distance induced by the natural vector between a genome and a mutation of that genome due to a single-nucleotide polymorphism. We then present a new family of alignment-free vectorizations. This new alignment-free vectorization uses the frequency of genomic words, as is done in the composition vector, and incorporates descriptive statistics of those k-mers' positional information, as inspired by the natural vector. We provide a comparison of 5 popular characterizations of genome similarity using k-nearest neighbor classification, and evaluate these on two collections of viruses. The prediction of bus arrival times is important for users of public transportation. We first generalize existing vectorizations and representations. We then propose a method of recovering the schedule and show that the use of this schedule uniformly improves all existing methods using 3 weeks of Chicago Transit Authority bus data. Lastly, we analyze data usage from reporting real-time GPS traces. The problem of tracking a GPS device relies upon predicting vehicle location in general, as opposed to predicting vehicle location on fixed routes as above. Comparison of 12 different tracking methods are done on two data sets. We show that at low-error tolerances the methods are equivalent, but at higher-error tolerances the proposed method is greatly more efficient.
Issue Date: 2013-06-28
Genre: thesis
URI: http://hdl.handle.net/10027/10009
Rights Information: Copyright 2013 Troy A. Hernandez
Date Available in INDIGO: 2013-06-28
Date Deposited: 2013-05
 

This item appears in the following Collection(s)

Show full item record

Statistics

Country Code Views
United States of America 369
China 134
Russian Federation 21
Germany 8
Ukraine 8

Browse

My Account

Information

Access Key