Home     Publications     Links     Conferences     History            20 20 20

Data Privacy

Privacy Preserving Data Mining (PPDM)

Statistical Disclosure Control (SDC)


Large Data Set:
  • This web page provides two very large continuous data sets for entity and attribute disclosure risk evaluation. The first data set, called very large Census dataset was extracted using the Data Extraction System of the U.S. Census Bureau, it initially contains 149,642 records composed by 13 attributes. It was created in the European CASC project [2]. A complete description of its attributes can be found in the CASC project web page. The second data set, called forest, has been extracted from the Forest FCoverType dataset, available at the UCI KDD data repository [3]. From the original 54 attributes we have selected only the real-valued ones (only ten attributes).

    For both datasets, we have discarded those records which are duplicated when the last attribute of the dataset is not considered. The resulting datasets have the following number of records and attributes: very large Census has 124,998 records with 13 attributes and Forest has 581,009 records with 10 attributes

    We have considered the last attribute of each dataset as a confidential one; therefore, the SDC protection methods are always applied to all the attributes of the original dataset, except the last one. For the sake of simplicity, in each dataset we have relabeled this last attribute in two ways, so to obtain either 5 or 10 different possible values for it. To do this, for example in the case of 5 different values, we have sorted the dataset X (which contains n records) according to this confidential attribute, and then we have assigned the first confidential value to the first n/5 records, the second value to the following n/5 records, and so on.

  • In [1] we protected this datasets with different parameterizations of classical SDC methods such as noise addition and rank swapping, the aim of this work was to evaluate both the entity and attribute disclosure risk of classical SDC methods using kNN techniques. Four examples of these protection can be downloaded here(link), and a complete description of the obtained results are described in [1].

    We have released these data set to make easy the comparison with our work using different disclosure risk methods.


Cite this site as:
V. Torra, Data privacy, Springer, 2017. Associated website: http://www.ppdm.cat/dp/

Vicenç Torra, Last modified: 15 : 34 December 11 2014.