Population-based studies identifying common genetic variants that affect complex human diseases have relied heavily on population-genetic principles in important tasks such as study design, quality control, and genotype imputation. As the emphasis of mapping studies has now shifted to investigating rare variants in next- generation sequencing projects, new opportunities exist for leveraging population genetics to maximize the return from these investigations. Because studies thus far have often focused on populations of European descent, it is critical that new methods provide tools to analyze data from a greater diversity of populations. This project builds on productive efforts in the first funding period, proposing methods that capitalize on the study of human population genetics to enhance the design, analysis, and interpretation of genome sequencing studies, and focusing on analysis of rare risk variants in diverse human populations. (1) We will devise methods for selecting subsamples of individuals for genome and exome sequencing, particularly in admixed and structured populations. Such subsamples will make it possible for researchers to maximize their potential for achieving statistical power to detect rare disease variants. (2) We will enhance variant-calling accuracy, particularly in low-coverage data and for challenging indels and copy-number variants, by including in the variant-calling pipeline evidence accumulated from closely related haplotypes in the population. This approach will be particularly beneficial in admixed and genetically diverse populations, in which haplotype variation is especially significant and selecting an informative haplotype subset to assist in variant-calling is of greatest value. (3) We will use population-genetic principles to improve sample quality control in sequencing studies. First, we address the common challenge of sample contamination, which adversely affects variant-calling and downstream analyses. We will produce a method to estimate the genotypes of the minor contributor of a mixed sample, thus enabling the population of origin of a contaminating signal to be identified. This identification further facilitates variant-calling and permits in silico deconvolution of mixed samples. Second, to enhance the sharing of samples in large projects, we will devise methods to uncover duplicate or related samples from non- overlapping marker sets. Our approach will reduce the risk of expending effort to obtain sequence that will not be fully utilized, and will also assist in making use of historical low-density data in understudied populations. (4) We will incorporate new advances in the study of human population growth and natural selection for evaluating rare-variant tests and identifying powerful testing strategies. Evaluations of current tools often ignore important population-genetic factors such as selection or accelerating growth; our methods will enhance models for analyzing rare-variant testing methods, tailoring them to populations of interest. Throughout the project, we will use multi-population genome sequence data from the TopMed and InPSYght studies to test our approaches. To facilitate use of our methods, we will produce, test, and distribute new publicly available software programs.
NATIONAL HUMAN GENOME RESEARCH INSTITUTE