Ph.D. Dissertation Defense “Towards Data and Model Confidentiality in Outsourced Machine Learning” By Sagar Sharma
Ph.D. Committee: Drs. Keke Chen, Advisor, Krishnaprasad Thirunarayan, Junjie Zhang, and Xiaoyu Liu (Mathematics & Statistics)
ABSTRACT
With massive data collections and needs for building powerful predictive models, data owners may choose to outsource storage and expensive machine learning computations to public cloud providers. This happens due to the lack of in-house storage and computation resources and/or the expertise of building models. Similarly, users, who subscribe to specialized services such as movie streaming and social networking, voluntarily upload their data to the service providers' site for storage, analytics, and better services. The service provider may also choose to benefit from ubiquitous cloud computing.
However, outsourcing to the public cloud may raise privacy concerns when sensitive personal or corporate data is involved. A cloud provider (Cloud) may mishandle sensitive data and models. Moreover, Cloud's resources, if poorly maintained, become vulnerable to privacy breaches from external and internal adversaries. Such potential threats are out of the control of the data owners or general users. One way to address the privacy concerns is through confidential machine learning (CML). In CML, data owners protect their data with encryption or other methods before outsourcing, and Cloud learns predictive models from such protected data.
Existing crypto and privacy-protection methods cannot be directly applied in building CML frameworks in the outsourced setting. Although theoretically sound, a naïve adaptation of fully homomorphic encryption (FHE) and garbled circuits (GC) that enable evaluation of any arbitrary function in a privacy-preserving manner is impractically expensive. Differential privacy (DP), on the other hand, does not exactly fit the outsourced setting as data and the learned models are leaked to the Cloud. Moreover, DP significantly degrades model quality. A practical CML framework must also minimize the client-side (e.g., data owners) cost, moving the expensive and scalable components to Cloud, to justify the choice of outsourcing. Thus, novel solutions are needed to construct privacy-preserving learning algorithms that have a good balance among privacy protection, costs, and model quality.
In this dissertation, I present three confidential machine learning frameworks for the outsourcing setting: 1) PrivateGraph for unsupervised learning (e.g., graph spectral analysis), 2) SecureBoost for supervised learning (e.g., boosting), 3) DisguisedNets for deep learning (e.g., convolutional neural networks), respectively. The first two frameworks provide semantic security and follow the decomposition-mapping-composition (DMC) process. The DMC process includes three critical steps: 1) Decomposition of the target machine learning algorithm into its sub-components, 2) Mapping of the selected sub-components to appropriate cryptographic and privacy primitives, and finally, 3) Composition of the CML protocols. It is critical that one identifies the ``crypto-unfriendly" subcomponents and their alteration or replacement with ``crypto-friendly" subcomponents before the final composition of the CML frameworks. The Disguised-Nets framework, however, due to the intrinsically expensive nature of deep neural networks (DNN) and size of the training images, relies on a perturbation based CML construction. By relaxing the overall security and disguising the training images with cheaper transformations, Disguised-Nets enables training confidential DNN models over the protected images very efficiently.
I have conducted the formal cost and security analysis and performed extensive experiments for all three CML frameworks. The results have shown that the costs are practical in real-world scenarios and the quality of the generated models is comparable with those learned over unprotected data.