Close

Using spatiotemporal models to generate synthetic data for public use.

Abstract

When agencies release public-use data, they must be cognizant of the potential risk of disclosure associated with making their data publicly available. This issue is particularly pertinent in disease mapping, where small counts pose both inferential challenges and potential disclosure risks. While the small area estimation, disease mapping, and statistical disclosure limitation literatures are individually robust, there have been few intersections between them. Here, we formally propose the use of spatiotemporal data analysis methods to generate synthetic data for public use. Specifically, we analyze ten years of county-level heart disease death counts for multiple age-groups using a Bayesian model that accounts for dependence spatially, temporally, and between age-groups; generating synthetic data from the resulting posterior predictive distribution will preserve these dependencies. After demonstrating the synthetic data's privacy-preserving features, we illustrate their utility by comparing estimates of urban/rural disparities from the synthetic data to those from data with small counts suppressed.

MIDAS Network Members

Citation: