SaTC: CORE: Small: Multi-Party High-dimensional Machine Learning with Privacy


Individuals and organizations can frequently benefit from combining their data to learn collective models. However, combining data to enable multi-party learning is often not possible. It may not be permitted due to privacy policies, or may be considered too risky for a business to expose its own data to others. In addition, high-dimensional data are prevalent in modern data-driven applications. Learning from high-dimensional data owned by differential organizations is even more challenging, due to the bias introduced by the high-dimensional machine learning methods. The overarching goal of this project is to address these challenges by developing methods that enable a group of mutually distrusting parties to securely collaborate to apply high dimensional machine learning methods to produce a joint model without exposing their own data. This project enables owners of sensitive data to jointly learn models across their datasets without exposing that data and providing meaningful privacy guarantees. It produces open source software tools and has many important societal applications, including its use in analyzing electronic health records across multiple hospitals to identify medical correlations what could not be found by any individual hospital. The key of multi-party high-dimensional machine learning is to find an efficient way to produce an accurate aggregate model that reflects all of the data, by combining local models that are developed independently based on individual data sets. The strategy of this project is to combine two emerging research directions: distributed machine learning, which seeks to distribute machine learning algorithms across hosts and produce an aggregate model by combining multiple local models; and secure multi-party computation, which enables a group of mutually distrusting parties to jointly compute a function without leaking information about their private inputs or any intermediate results. It also incorporates differential privacy-based mechanisms into multi-party high dimensional learning, which further protects the individual data points in each party. The results of this research have the potential to impact both the machine learning and security research communities. The education plan of this project includes developing open course materials that integrate privacy and machine learning, and provide research-based training opportunities for both undergraduate and graduate students in computer science, systems engineering, and medical informatics. It actively gets underrepresented groups involved in research projects, and trains a new generation of interdisciplinary researchers.


Funding Source

Project Period