Data privacy (DP), Privacy Preserving Data Mining (PPDM), Statistical Disclosure Control (SDC) and Inference Control (IC) are disciplines whose goal is to allow dissemination/transfer of respondent data while preserving respondent privacy.
To that end, techniques have been defined that transform an original dataset into a protected dataset such that:
i) analyses on the original and protected datasets yield similar results (data utility);
ii) information in the protected dataset is unlikely to be linkable to the particular respondent it originated from (data safety).
I like to classify protection procedures in data-driven (or general purpose), computation-driven (or specific purpose), and result-driven protection procedures.
We are working on data-driven protection methods. These methods can be classified into three categories, according to their manipulation of the original data:
- Data-driven or general purpose: when it is not known the intended use of the data to be prepared for publication. E.g., some users might apply regression, other compute means, and AI-related people clustering or association rule mining. Perturbative methods are appropriate for this purpose.
- Computation-driven or specific purpose: when it is known the type of analysis to be performed on the data (e.g., association rules). In this case, protection can be done so that the results on the protected data are the same than on the original data. Nevertheless, in this case, the best approach is that the data owner and the data analyser agree on a cryptographic protocol so that the analysis can be done with no information loss. The case of distributed data with a specific goal falls also in this class.
- Result-driven: when privacy concerns to the result of applying a particular data mining method to some particular data. Protection methods have been designed so that e.g. the resulting association rules from a data set do not disclosure sensitive information for a particular individual.
More details in:
- Perturbative: Data is distorted in some way that causes the protected data set to contain some errors. The simplest approach is to add noise (additive noise). Other methods exists as e.g. microaggregation, rank swapping, additive and multiplicative noise, PRAM
- Non perturbative: Data is distorted but no errors are included in the protected data set. Protection is achieved replacing values by less specific ones (e.g., a number is replaced by an interval). In short, non perturbative methods reduce the level of detail of the dataset.
- Synthetic Data Generators: Data is not distorted, but new data is created and used to replace the original one. Some claim that synthetic data avoids disclosure risk, but this is not so if synthetic data has enough quality. See our paper at PSD 2006: (PSD 2006) (full reference here)
Torra, V. (2017) Data privacy: Foundations, New Developments and the Big Data Challenge, Springer. (access)
Torra, V., Navarro-Arribas, N. (2016) Big Data Privacy and Anonymization. in Privacy and Identity Management 15-26. (open access)