University of Virginia
Agencies such as the Centers for Disease Control and Prevention (CDC) currently release incidence data (e.g., Influenza), along with descriptive summaries of simple spatio-temporal patterns and trends. However, public health researchers, government agencies, as well as the general public, are often interested in deeper patterns and insights into how the disease is spreading, with additional context. Analysis by domain experts is needed for deriving such insights from incidence data.
We apply our methods for finding spatio-temporal patterns in the spread of seasonal Influenza in the United States (US) using state level ILI activity indicator data from the CDC. We observe that the compression ratios are over 2.5 for 50% of the chosen sets, when approximate descriptions and negative clauses are allowed. Sets with high compression ratios (e.g., over 2.5) correspond to interesting patterns in the spatio-temporal dynamics of ILI. Our approach also outperforms other baselines in terms of the compression ratio.
Our goal is to develop an automated approach for finding interesting spatio-temporal patterns in the spread of a disease over a large region, such as: regions which have specific characteristics, e.g., high incidence in a particular week, those which showed a sudden change in incidence, or regions which have significantly different incidence compared to earlier seasons.
Our approach, which is an unsupervised machine learning method, can provide new insights into the patterns and trends in disease spread in an automated manner. Our results show that the description complexity is an effective approach for characterizing sets of interest, which can be easily extended to other diseases and regions, beyond Influenza in the US. Our approach can be easily adapted for automated generation of narratives.
We develop techniques from the area of transactional data mining for characterizing and finding interesting spatio-temporal patterns in disease spread in an automated manner. A key part of our approach involves using the principle of minimum description length (MDL) for representing a given target set in terms of combinations of attributes (referred to as clauses); we consider both positive and negative clauses, and relaxed descriptions, which approximately represent the set, and use integer programming to find such descriptions. Finally, we design an automated approach, which examines a large space of sets corresponding to different spatio-temporal patterns, and ranks them based on the ratio of their size to their description length (referred to as their compression ratio).