Casper van Aarle graduates on Federated Regression Analysis

Federated Regression Analysis on Personal Data Stores: Improving the Personal Health Train

by Casper van Aarle

Due to regulations and increased privacy awareness, patients may be reticent in sharing data with any institution. The Personal Health Train is an initiative to connect different data institutions for data analysis while maintaining full authority over their data. The Personal Health Train may not only connect larger institutions but also connect smaller, possibly on-device personal data stores, where data is safely and separately stored.
This thesis explores possible solutions in the literature that guarantee data-privacy and model-privacy, and it shows the practical feasibility when learning over a large number of personal data stores. We specifically regard the generation of linear regression and logistic regression models over personal data stores. We experiment with different design choices to optimise the convergence of our training architecture.
We discuss the PrivFL protocol* which takes into account both data-privacy and model-privacy when learning a regression model and is applicable to personal data stores. We further propose a standardisation protocol, Secure Scaling Operation, that guarantees data-privacy for patients, and experiments concluded that it improves convergence better than an adaptive gradient.
We implement an architecture that can learn over personal data stores and which preserves user privacy in FedLinReg-v2 and FedLogReg-v2. While, in theory, no convergence is guaranteed, training over various datasets shows a difference of 0 to 0.33% in loss differences over both training and test sets compared to models that are centrally optimised. No parameter optimisation was necessary. The coefficients however may deviate from centrally trained models. We were able to train regression models while preserving data-privacy over 150 personal data stores in minutes. An even higher level of data-privacy will cause a strong linear increase in computation-time in relation to the amount of personal data stores included.

[download pdf]