Introduction

Differential Privacy has been gaining significant attention in research. Working in tech industry, I have mixed feeling about topics related to data privacy. There’s no doubt that we deal with huge amount of data from users, and privacy should be valued. However, data seems to be a concept that covers basically everything, which leaves us wondering what kind of information should be private and how we monitor the whole cycle from collecting data to using them. Some people feel uncomfortable about being a record in database, especially with identifications. Once we have raw data in database, to prevent malicious people from seeing or using the data, access control is always necessary. And of course, the black box where data is leverage to learn your behavior, which raises lots of concerns.

While there’re a lot of methods dealing with privacy, the overall procedure that processes raw data to derive information, is exactly where differential privacy kicks in. It provides a rigorous privacy framework without which people may feel confusing about what we are talking about when we talk about privacy.

A simple method is to add a noise to every piece of data so that you’ll never know individual’s activity correctly, but some useful insights can still be derived from large number of records.

Definition

Differential privacy is defined upon randomized algorithms \(M : A \to \Delta(B)\), which maps a domain A to a probability simplex over discrete range B. For each \(a \in A: Pr(M(a) = b) = (M(a))_{b}\).

\((\epsilon, \delta)\)-differentially private randomized algorithm is defined as \(\forall S \subseteq B\):

\(Pr[M(x) \in S] \le exp(\epsilon) Pr[M(y) \in S] + \delta\).

where \(x, y \in S\) and \(\|x-y\|_{1} = 1\) (l1 distance).

Another concept is privacy loss. It quantifies differential privacy:

\[L(\xi)_{M(x)\mid M(y)} = \ln \frac{Pr[M(x) = \xi]}{Pr[M(y) = \xi]}\]

The definition is pretty straightforward. The core how we protect privacy is about no exposure on individual data, so the above term guarantees a single record should make little difference to the overall outcome. \(\epsilon\) indicates how private it ensures, every small value would be pretty much the same (like all \(\epsilon\) in maths). \(\delta\) means there’s still a bit possibility that privacy loss is not preserved in the certain bound, and in many circumstances \(\delta\) is set to 0.

This definition also promises in economical view, utility for each person will be same no matter they participate the survey or not. I find this aligns with GDPR on the principle that data subject has the right to decide whether their data can be used to do certain tasks or not. Theoretically, if an individual understands that this action about their data doesn’t affect anything in terms of their future utility, he or she should give the consent on collecting the data and it will also be easier to collect data for a differentially private survey.

Techniques

Randomized Response

This is a simple mechanism, an example is that flip a coin before giving the answer (Y or N), give the original response if head and give opposite response if tail.

The Laplace Mechanism

For any query function \(f\) that gives continuous variable, we add noises to the result, and the noise \(Y_{i}\) follows Laplace distribution, which is symmetric version of the exponential distribution:

\[M(x) = f(x) + Y_{1}...Y_{k}\]

SmallDB

For a series of queries, privacy may be increased if Q1 and Q2 are asking similar questions with small differences. In this setting, correlated noises could be added to each query which makes more sense than independent random noises.

The idea of SmallDB Algorithm is for a series of queries, we sample a small portion of the entire database to preserve given accuracy and privacy. The size of this small database is determined by number of queries and required accuracy, and it uses exponential mechanism to generate probability distribution with utility function of measuring error incurred by approximation.

There is also an online mechanism with the same purpose: private multiplicative weights.

Differentially Private SGD

In deep learning, parameters are learned from a black box consist of neural networks. Differentially Private SGD is proposed to make the black box differentially private by controlling the influence of the training data during SGD computation.

The algorithm maintains the normal SGD procedure, while minimizing loss function, the gradient is computed for a randomly chosen subset of data, and add gaussian noise to gradient obtained.

In Industry

Plenty of organizations are starting utilizing Differential Privacy for privacy concerns. US 2020 Census also leverages it, with no surprise. I did little research on the current progress from some companies.

Apple

According to Apple’s Differential Privacy Overview, differential privacy is adopted when apple collecting data from users. A noise will be added locally to user’s data before sending it to Apple.

Google

Google open sourced a library which supports differential privacy mechanism. Also Google has also rolled out support for differentially private queries in query engine.

Facebook

Opacus was released for training PyTorch models with differential privacy. This uses DP SGD discussed above.

Now -> Future

Differential Privacy theory is promising in research. Dozens of new ideas worth exploring in the future under the currently framework. I would be happy to see algorithms performing well and in the mean time preserving privacy for everyone.

While data has been weaponized (exaggerated though) and privacy talks is prevailing, finishing up this blog, I’m still holding back about applying the concept to actual usages. Unless more strict regulations closely related to differential privacy are enforced, companies may not see enough benefits for them to be motivated to have it in a wide range, especially smaller businesses. Even under the assumption that efficiency and accuracy stay the same, not sure is there any privacy concern over current data query models. But it’s just like many other research areas, improvements step by step can lead to huge changes.