In general, the more data you have, the better your machine learning model is going to be. But stockpiling vast amounts of data also carries a certain privacy, security, and regulatory risks. With new privacy-preserving techniques, however, data scientists can move forward with their AI projects without putting privacy at risk.
To get the low down on privacy-preserving machine learning (PPML), we talked to Intelâs Casimir Wierzynski, a senior director in the office of the CTO in the companyâs AI Platforms Group. Wierzynski leads Intelâs research efforts to âidentify, synthesize, and incubateâ emerging technologies for AI.
According to Wierzynski, Intel is offering several techniques that data science practitioners can use to preserve private data while still benefiting from machine learning. Whatâs more, data science teams donât have to make major sacrifices in terms of performance or accuracy of the models, he said.
If sometimes sounds too good to be true, Wierzynski admits. âWhen I describe some of these new techniques that weâre making available to developers, on their face, theyâre like, really? You can do that?â he said. âThat sounds kind of magical.â
But itâs not magic. In fact, the three PPML techniques that Wierzynski explained to Datanamiâ"including federated learning, homomorphic encryption, and differential privacyâ"are all available today.
Federated LearningData scientists have long known about the advantages of combining multiple data sets into one massive collection. By pooling the data together, itâs easier to spot new correlations, and machine learning models can be built to take advantage of the novel connections.
But pooling large amounts of data into a data lake carries its own risks, including the possibility of the data falling into the wrong hands. There are also the logistical hassles of ETL-ing large amounts of data around, which also opens up the data to security lapses. For that reason, some organizations deem creating large pools of data too risky for some data.
With the trick of federated learning, data scientists can build and train machine learning models using data thatâs physically stored in separate silos, which eliminates the risk of bringing all the data together. This is an important breakthrough for certain data sets that organizations could not pool together
âOne of the things that weâre trying to enable with these privacy-preserving ML techniques is to unlock these data silos to make data source that previously couldnât be pooled together,â Wierzynski said. âNow itâs OK to do that, but still preserve the underlying privacy and security.â
Homomorphic EncryptionIntel is working with others in industry, government, and academia to develop homomorphic encryption techniques, which essentially allow sensitive data to be processed and statistical operations to be performed while itâs encrypted, thereby eliminating the need to expose the data in plain text.
âIt means that you can move your sensitive data into this encrypted scape, do the math in this encrypted space that you were hoping to do in the raw data space, and then when you bring the answer back to the raw data space, itâs actually the answer you would have gotten if you just stayed in that space the whole time,â he said.
Homomorphic encryption isnât new. According to Wierzynski, the cryptographic schemes that support homomorphic encryption have been around for 15 to 20 years. But there have been a number of improvements in the last five years that enable this technique to run faster, and so itâs increasingly one of the tools that data scientists can turn to when handling sensitive data.
âOne of the things my team has done specifically is around homomorphic encryption is to provide open source libraries,â Wierzynski says. âOne is called HE Transformer, which lets data scientists use their usual tools like TensorFLow and PyTorch and deploy their models under the hood using homomorphic encryption without having to change their code.â
There are no standards yet around homomorphic encryption, but progress is being made on that front, and Wierzynski anticipates a standard being established perhaps in the 2023-24 timeframe. The chipmaker is also working on hardware acceleration options for homomorphic encryption, which would further boost performance.
Differential PrivacyOne of the bizarre characteristics of machine learning models is the capability to extract details of the data used to train the model just by exercising the model itself. Thatâs not a big issue in some domains, but it certainly is a problem when some of the training set contains private information.
âYou definitely want your machine learning system to learn the key trends and the core relationships,â Wierzynski said. âBut you donât want them to take that a step too far and now kind of overlearn in some sense and learn aspects of the data that are very idiosyncratic and specific to one person, which can then be teased out by a bad person later and violate privacy.â
For example, say a text prediction algorithm was developed to accelerate typing on a mobile phone. The system should be smart enough to be able to predict the next word with some level of accuracy, but it should not return a value when a phrase like âBobâs Social Security number isâ¦.â is typed in. If it does that, then itâs not only learned the rules of English, âbut itâs learned very specific things about individuals in the data set, and thatâs too far,â Wierzynski said.
The most common way to implement differential privacy is to add some noise to the training process, or to âfuzzâ the data in some way,â Wierzynski said. âAnd if you do that in the right amount, then you are still able to extract the key relationships and obscure the idiosyncratic information, the individual data,â he continues. âYou can imagine if you add a lot of noise, if you take it too far, youâll end up obscuring the key relationships too, so the trick with these use cases is to find that sweet spot.â
ML Data CombosEvery organization is different, and chief data officers should be ready to explore multiple privacy-preserving techniques to fit their specific use cases. âThereâs no single technology thatâs a silver bullet for privacy,â Wierzynski said. âItâs usually a combination of techniques.â
For example, you might want to fuzz the data a bit when utilizing federated learning techniques, Wierzynski said. âWhen you decentralize the learning, the machine learning model usually needs additional privacy protection just because the intermediate calculations that go between users in federated learning can actually reveal something about the model or reveal something about the underlying data,â he said.
As data privacy laws like CCPA and GDPR proliferate, organizations will be forced to account for privacy of their customersâ data. The threat of steep fines and public shaming for mishandling sensitive data is a strong motivator for organizations to enact strong data privacy and security standards.
But these laws also potentially have a dampening effect on advanced analytics and AI use cases. With PPML, organizations can continue to explore these powerful AI techniques while working to minimize some of the security risks associated with handling large amounts of sensitive data.
Related Items:
Weighing the Impact of a Facial Recognition Ban
Data Privacy Day: Putting Good (and Bad) Practices in the Spotlight
Keeping Your Models on the Straight and Narrow
No comments:
Post a Comment