Culturomics as a data playground for tests of selection: Mathematical approaches to detecting selection in word use.


In biological evolution traits may rise and fall in frequency due to genetic drift, where variant frequencies change by chance, or by selection where advantageous variants will rise in frequency. The neutral model of evolution, first developed by Kimura in the 1960s, has become the standard against which selection is detected. While the balance between these two important forces - drift and selection - has been well established in biology there are other domains where the contribution of these processes is still coming together. Although the idea of natural selection has been applied to the cultural domain since the time of Darwin, it has proven more challenging to positively identify cultural traits under selection both because of a lack of established tests for selection and a lack of large cultural data sets. However, in recent years with the accumulation of large cultural data sets many cultural features from pre-history pottery to modern baby names have been shown to evolve according to the neutral theory. But there is accumulating empirical evidence from cultural processes suggesting that the neutral theory alone cannot account for all features of the data. As such, there has been a renewed interest in determining whether there is selection amidst drift. Here we analyze a subset English word frequencies, and determine whether frequency change reveals processes of selection. Inspired by the Moran and Wright-Fisher models in population genetics, we developed a neutral model of word frequency variation to assess when linguistic data appears to depart from neutral evolution. As such, our model represents a possible "test for selection" in the linguistic domain. We explore how the distribution of word use has changed for sets of words in English for more than 100 years (1901-2008) as expressed in vocabulary usage in published books, made available by Google Ngram. When comparing empirical word frequency changes to our neutral model we find pervasive and systematic departures from neutrality.

MIDAS Network Members