Math Magic Tricks: Yes, You Can Do AI Without Sacrificing Privacy

(archy13/Shutterstock)

In general, the more data you have, the better your machine learning model is going to be. But stockpiling vast amounts of data also carries a certain privacy, security, and regulatory risks. With new privacy-preserving techniques, however, data scientists can move forward with their AI projects without putting privacy at risk.

To get the low down on privacy-preserving machine learning (PPML), we talked to Intelâ€™s Casimir Wierzynski, a senior director in the office of the CTO in the companyâ€™s AI Platforms Group. Wierzynski leads Intelâ€™s research efforts to â€œidentify, synthesize, and incubateâ€ emerging technologies for AI.

According to Wierzynski, Intel is offering several techniques that data science practitioners can use to preserve private data while still benefiting from machine learning. Whatâ€™s more, data science teams donâ€™t have to make major sacrifices in terms of performance or accuracy of the models, he said.

If sometimes sounds too good to be true, Wierzynski admits. â€œWhen I describe some of these new techniques that weâ€™re making available to developers, on their face, theyâ€™re like, really? You can do that?â€ he said. â€œThat sounds kind of magical.â€

But itâ€™s not magic. In fact, the three PPML techniques that Wierzynski explained to Datanamiâ€"including federated learning, homomorphic encryption, and differential privacyâ€"are all available today.

Federated Learning

Data scientists have long known about the advantages of combining multiple data sets into one massive collection. By pooling the data together, itâ€™s easier to spot new correlations, and machine learning models can be built to take advantage of the novel connections.

But pooling large amounts of data into a data lake carries its own risks, including the possibility of the data falling into the wrong hands. There are also the logistical hassles of ETL-ing large amounts of data around, which also opens up the data to security lapses. For that reason, some organizations deem creating large pools of data too risky for some data.

With the trick of federated learning, data scientists can build and train machine learning models using data thatâ€™s physically stored in separate silos, which eliminates the risk of bringing all the data together. This is an important breakthrough for certain data sets that organizations could not pool together

â€œOne of the things that weâ€™re trying to enable with these privacy-preserving ML techniques is to unlock these data silos to make data source that previously couldnâ€™t be pooled together,â€ Wierzynski said. â€œNow itâ€™s OK to do that, but still preserve the underlying privacy and security.â€

Homomorphic Encryption

Intel is working with others in industry, government, and academia to develop homomorphic encryption techniques, which essentially allow sensitive data to be processed and statistical operations to be performed while itâ€™s encrypted, thereby eliminating the need to expose the data in plain text.

â€œIt means that you can move your sensitive data into this encrypted scape, do the math in this encrypted space that you were hoping to do in the raw data space, and then when you bring the answer back to the raw data space, itâ€™s actually the answer you would have gotten if you just stayed in that space the whole time,â€ he said.

Homomorphic encryption isnâ€™t new. According to Wierzynski, the cryptographic schemes that support homomorphic encryption have been around for 15 to 20 years. But there have been a number of improvements in the last five years that enable this technique to run faster, and so itâ€™s increasingly one of the tools that data scientists can turn to when handling sensitive data.

â€œOne of the things my team has done specifically is around homomorphic encryption is to provide open source libraries,â€ Wierzynski says. â€œOne is called HE Transformer, which lets data scientists use their usual tools like TensorFLow and PyTorch and deploy their models under the hood using homomorphic encryption without having to change their code.â€

There are no standards yet around homomorphic encryption, but progress is being made on that front, and Wierzynski anticipates a standard being established perhaps in the 2023-24 timeframe. The chipmaker is also working on hardware acceleration options for homomorphic encryption, which would further boost performance.

Differential Privacy

One of the bizarre characteristics of machine learning models is the capability to extract details of the data used to train the model just by exercising the model itself. Thatâ€™s not a big issue in some domains, but it certainly is a problem when some of the training set contains private information.

(Olga-Salt/Shutterstock)

â€œYou definitely want your machine learning system to learn the key trends and the core relationships,â€ Wierzynski said. â€œBut you donâ€™t want them to take that a step too far and now kind of overlearn in some sense and learn aspects of the data that are very idiosyncratic and specific to one person, which can then be teased out by a bad person later and violate privacy.â€

For example, say a text prediction algorithm was developed to accelerate typing on a mobile phone. The system should be smart enough to be able to predict the next word with some level of accuracy, but it should not return a value when a phrase like â€œBobâ€™s Social Security number isâ€¦.â€ is typed in. If it does that, then itâ€™s not only learned the rules of English, â€œbut itâ€™s learned very specific things about individuals in the data set, and thatâ€™s too far,â€ Wierzynski said.

The most common way to implement differential privacy is to add some noise to the training process, or to â€œfuzzâ€ the data in some way,â€ Wierzynski said. â€œAnd if you do that in the right amount, then you are still able to extract the key relationships and obscure the idiosyncratic information, the individual data,â€ he continues. â€œYou can imagine if you add a lot of noise, if you take it too far, youâ€™ll end up obscuring the key relationships too, so the trick with these use cases is to find that sweet spot.â€

ML Data Combos

Every organization is different, and chief data officers should be ready to explore multiple privacy-preserving techniques to fit their specific use cases. â€œThereâ€™s no single technology thatâ€™s a silver bullet for privacy,â€ Wierzynski said. â€œItâ€™s usually a combination of techniques.â€

Casimir Wierzynski is a senior director in the office of the CTO in Intelâ€™s AI Platforms Group

For example, you might want to fuzz the data a bit when utilizing federated learning techniques, Wierzynski said. â€œWhen you decentralize the learning, the machine learning model usually needs additional privacy protection just because the intermediate calculations that go between users in federated learning can actually reveal something about the model or reveal something about the underlying data,â€ he said.

As data privacy laws like CCPA and GDPR proliferate, organizations will be forced to account for privacy of their customersâ€™ data. The threat of steep fines and public shaming for mishandling sensitive data is a strong motivator for organizations to enact strong data privacy and security standards.

But these laws also potentially have a dampening effect on advanced analytics and AI use cases. With PPML, organizations can continue to explore these powerful AI techniques while working to minimize some of the security risks associated with handling large amounts of sensitive data.

Math Magic Tricks

Wednesday, March 25, 2020

Yes, You Can Do AI Without Sacrificing Privacy

No comments:

Post a Comment