Efficient n-gram analysis in R with cmscu.


We present a new R package, cmscu, which implements a Count-Min-Sketch with conservative updating (Cormode and Muthukrishnan Journal of Algorithms, 55(1), 58-75, 2005), and its application to n-gram analyses (Goyal et al. 2012). By writing the core implementation in C++ and exposing it to R via Rcpp, we are able to provide a memory-efficient, high-throughput, and easy-to-use library. As a proof of concept, we implemented the computationally challenging (Heafield et al. 2013) modified Kneser-Ney n-gram smoothing algorithm using cmscu as the querying engine. We then explore information density measures (Jaeger Cognitive Psychology, 61(1), 23-62, 2010) from n-gram frequencies (for n=2,3) derived from a corpus of over 2.2 million reviews provided by a Yelp, Inc. dataset. We demonstrate that these text data are at a scale beyond the reach of other more common, more general-purpose libraries available through CRAN. Using the cmscu library and the smoothing implementation, we find a positive relationship between review information density and reader review ratings. We end by highlighting the important use of new efficient tools to explore behavioral phenomena in large, relatively noisy data sets.

MIDAS Network Members