Arrel     Temes     Book     Xerrades     Publicacions     Enllaços     Congressos     Història            20 20 20

Privadesa de dades

Mineria de dades preservant-ne la privadesa

Control estadístic

  • Transparència i privadesa: This includes transparency-aware masking methods, disclosure risk assessment under the transparency principle. Detalls aquí.

  • Revelació d'informació. There are two main types of disclosure:
    • Identity disclosure. We have identity disclosure when we are able to identify someone in a database. There are several privacy models that focus on this type of disclosure (e.g., privacy for re-identification, k-anonymity, uniqueness). Identity disclosure is a type of disclosure that we need to avoid in most of the cases.
    • Attribute disclosure. We have attribute disclosure when we increase our knowledge on a particular attribute of a particular individual. There are authors that distinguish attribute disclosure from inference/inferential disclosure. The difference is whether this additional knowledge is obtained from the database directly or inferred from the database using e.g. a statistical model. While it is usual to avoid identity disclosure, the extend we need to avoid attribute disclosure is not always so clear. When we build statistical and machine learning models is usual to expect some increment in the knowledge of particular individuals.
      There are privacy models that focus on this type of disclosure. An example is privacy for interval disclosure. Differential privacy can also be seen from this perspective: its goal is to avoid inferring that an individual was present or absent in the database used to compute a function. Integral privacy has a similar goal. Secure multiparty computation wants different involved parties to avoid learning anything but the outcome of the computation. So, in particular, avoid learning any unauthorized attribute of any individual whose data is used in the computation. Some extensions of k-anonymity are defined to avoid attribute disclosure: l-diversity.

Models de privadesa:
  • Un model de privadesa és una definició formal de privadesa que ens permet després construir algorismes i validar-los respecte la definició formal. Examples of privacy models include the following:
    • Privacy from re-identification. When a database is available for its analysis we want to avoid that someone identifies an individual in the database. This applies to any type of database from standard SQL databases to e.g. non-SQL ones. For example, for graphs representing social networks, re-identification applies when we know that a node of the graph is a particular person.
    • k-Anonymity. This privacy model is also related to privacy from re-identification. In this case, we require that when an intruder looks for an individual using some prior knowledge, there are at least k individuals with exactly the same information that can be the one being looked for. For a standard database, this means that given some values to be looked for, there are at least k records in the database with those exactly k values.
    • Differential privacy. This model is related to the computation of a query or a function given a database. The objective of the model is to avoid that from the output of the function or query, we can learn that the data of a particular individual was used. We have that the function satisfies differential privacy when the output does not change much under the presence or absence of an individual.
    • Integral privacy. We proposed this privacy model as an alternative to differential privacy. Informació i resultats aquí.
    • Secure multiparty computation. The goal is to compute a function in a distributed way, i.e. using data from different parties, so that the only knowledge parties acquire from the process is the result of the function. No additional knowledge should be obtained. Parties of a secure multiparty computation model should only learn what they would learn if instead of a distributed approach, a centralized approach were used (using a trusted third party to compute the result of the function).
    We have worked with all these models. Some of our results are reported below focusing on the protection procedures and measures of risk and utility.

Mètodes de protecció:
  • Els mètodes de protecció implementen les models de privacitat. Per això hi ha una relació estreta entre famílies de mètodes de protecció i els models de privacitat.
    Examples of data protection methods include:
    • Masking methods / data anonymization procedures. They are methods for achieving privacy for reidentification and k-anonymity. They are also used for local differential privacy. Masking methods are applied to databases (standard SQL and non-standard ones) to reduce their quality so that disclosure of information is avoided. The three main group of methods are: perturbative, non-perturbative, synthetic data generators.
    • Methods to achieve differential privacy. This includes additive noise using Laplace distribution, multiplicative noise, randomization.
    • Cryptographic protocols for secure-multiparty computation. Each function to be computed according to the secure-multiparty computation privacy model needs a specific cryptographic protocol.
    Informació i resultats aquí.

  • Mesures de pèrdua d'informació i d'utilitat. Els mètodes d'enmascarament apliquen una transformació a una base de dades per a reduir-ne la qualitat amb l'objectiu d'evitar la divulgació d'informació confidencial. Les mesures d'informació són per quantificar la informació que es perd.
    As we report in our book masking is not always equivalent to information loss. Several authors have shown that for some type of perturbation and data disclosure can be reduced with no information loss. E.g. machine learning models learnt from masked data do not reduce their accuracy but can even slightly increase the accuracy in some cases. Of course, this depends on the type of data and the way the model is built. We have also results in this direction (here)
    Information loss measures depend on the type of data we have (e.g., standard numerical database vs. graphs or documents) and the data uses (e.g., regression, clustering):
    • Generic information loss measures. They are based on statistics (as mean, variance, correlation, contingency tables-based). They are used to have a general metric of the perturbation suffered by the data, specially when we do not know much on the possible data uses of the database.
    • Specific information loss measures. They are based on the analysis of actual analysis on the data. E.g., compare the accuracy between a model extracted from the original database and the accuracy between a model extracted from the masked database.
    We can formulate information loss between a database X and a masked database X' for an analysis f as follows:
    ILf(X, X')=divergence(f(X),f(X'))

  • Mesures de risc de revelació. Els mètodes d'enmascarament apliquen una transformació a una base de dades per a reduir-ne la qualitat amb l'objectiu d'evitar la divulgació d'informació confidencial. Quan el model de privacitat es privacitat contra reidentificació, una modificació qualsevol de la base de dades no vol dir que la reidentificació no sigui possible. Les mesures de risc tenen com a objectiu avaluar aquest risc.
    There are two main types of disclosure in a database:
    • Identity disclosure is when we are able to identify someone in a (masked) database. Uniqueness and measures based on record linkage are examples of disclosure risk measures for identity disclosure.
    • Attribute disclosure is when we increase our knowledge on the attribute of an individual. An example of attribute disclosure measure is to compute (for a given individual) the difference between the inferred value for an attribute and the real value.
    We have worked on different topics related to risk assessment. E.g.,
    • on disclosure risk assessment for the worst-case scenario using supervised ML for estimating risk using record linkage with distances based on [1] bilinear forms [2], Choquet integral [3], as well as using the weighted Euclidean distance,
    • on disclosure risk assessment under transparency attacks (for microaggregation [4], for rank swapping [5], [6]),
    • on disclosure risk assessment for synthetic data showing that in some cases reidentification is possible [7]


Citeu aquest web com:
V. Torra, Data privacy, Springer, 2017. Web associada:

Vicenç Torra, Last modified: 13 : 13 March 03 2020.