Duplication count distributions in DNA sequences.


We study quantitative features of complex repetitive DNA in several genomes by studying sequences that are sufficiently long that they are unlikely to have repeated by chance. For each genome we study, we determine the number of identical copies, the "duplication count," of each sequence of length 40, that is of each "40-mer." We say a 40-mer is "repeated" if its duplication count is at least 2. We focus mainly on "complex" 40-mers, those without short internal repetitions. We find that we can classify most of the complex repeated 40-mers into two categories: one category has its copies clustered closely together on one chromosome, the other has its copies distributed widely across multiple chromosomes. For each genome and each of the categories above, we compute N(c), the number of 40-mers that have duplication count c, for each integer c. In each case, we observe a power-law-like decay in N(c) as c increases from 3 to 50 or higher. In particular, we find that N(c) decays much more slowly than would be predicted by evolutionary models where each 40-mer is equally likely to be duplicated. We also analyze an evolutionary model that does reflect the slow decay of N(c).

MIDAS Network Members