High throughput next generation sequencing (NGS) technologies generate enormous amounts of fragmented genome sequences, revolutionizing genetic and genomics research. Thousands of individual genomes and metagenomes consisting of natural mixtures of individual organisms from various environments have been sequenced using NGS. These developments play essential roles in understanding the genetic basis of complex diseases, the effects of environment on public health, the impacts of environmental changes such as global warming and pollution on the environments, and the detection of pathogens including viruses. Development of analytical methods to make full use of NGS data is essential in advancing public health, improving the environment, and strengthening national security. Although significant progress has been made in the analysis of NGS data, there are still wide gaps between the current available analytical tools and the full potential that can be achieved through the analysis of NGS data. This research project aims to further advance recently-developed statistical and computational methods for the comparison of genomes and metagenomes using NGS reads, without the need for assembly into genomes, avoiding many pitfalls that make assembly problematic. The research will make the computational tools more efficient and powerful and will employ them to analyze metagenomic data to study the effects of environmental factors on marine microbial communities. Both the algorithms and results will be disseminated through the web. The results from this study will be important for both genomics and metagenomics studies under a variety of environments. In more detail, statistical and computational methods for the inference of Markovian properties of molecular sequences based on NGS short reads will be developed and the methods will then be used to study the relationships among individual genomes and metagenomic samples. Firstly, methods to estimate the order and the transition probability matrix and their asymptotic distributions will be developed. Methods to infer variable length Markov chains (VLMC) will also be developed. Secondly, new alignment-free statistics taking into account the Markov chain (MC) properties of the sequences will be developed to study the relationships among genome sequences. Iterative approaches for choosing the word length will be developed. Thirdly, Markov chain models derived from NGS reads will be used to identify species or strains in metagenomic communities and to compare metagenomic samples based on the MC models. Finally, a suite of computer algorithms related to the inference of MCs based on NGS reads and applications to genome and metagenomic data analysis will be developed. The broad impacts of the project include computational tools for genome and metagenome comparison based on NGS data together with software packages for public usage, graduate and undergraduate training across multiple disciplines of statistics and biology, and outreach lectures for K-12 teachers and students.


Funding Source

Project Period