Close

Data Mining & Transfer Learning for Modeling COVID

Abstract

As of Oct. 2020, there are more than 8,000 peer-reviewed and published transmission models on the current COVID-19 epidemic, based on recent search result in the NIH NCBI LitCovid online database. The predominant modeling approach is to construct a compartmental SEIR-type model, fit the published case series to the model, and quantify key parameters such as basic reproduction number (R0). Other approaches include cross-scale modeling, agent-based modeling, and more recent advances in machine learning and deep learning models. However, there are major challenges of the current modeling efforts. First, most mechanistic models, including SEIR-type models, are not flexible to incorporate new epidemiological and clinical knowledge without substantially changing model structure and estimation of R0. This is especially challenging during the emerging and ongoing COVID-19 pandemic, where our knowledge rapidly refreshes. The original Kermack McKendrick compartment model was implemented retrospectively when the epidemic had ended, and most epidemiological mechanisms were revealed.  Therefore, developing mechanistic model when the details of the disease are not clear may lead to incorrect projections and conclusions. Second, there is a lack of holistic view and approach to incorporate behavioral, societal, and cultural factors. Individual host’s behavior is not only influenced by but also influencing their own surrounding environment. Human hosts’ perception of the risk, attitude towards the epidemic, trust in government interventions, and willingness to obey to various non-pharmaceutical interventions (NPIs) all impact and determine transmission dynamics. Inability to incorporate these factors makes comparing and generalizing different models across socio-cultural backgrounds extremely difficult. The third and the most fundamental challenge is the lack of data sources to accurately characterize the epidemic from multiple angles. This is an under-explored yet critical aspect in modeling efforts. We currently do not have adequate, accurate and comprehensive data sources for this complex clinical, societal, and epidemiological system of the COVID-19 pandemic. Relying on a few case series could result in large uncertainty and seriously bias subsequent conclusions. In addition, it hinders the efforts of effectively transferring insights from one region to another with distinct socio-cultural backgrounds, assessing uncertainty, and evaluating the effectiveness of various interventions. To address the challenges of modeling today’s complex epidemiological systems including COVID-19, I propose a novel data mining and annotation pipeline with subsequent data-driven multivariate deep learning-transfer learning technique.