Large Data Set:
|
-
This web page provides two very large continuous data sets for entity and
attribute disclosure risk evaluation.
The first data set, called very large Census dataset was extracted using the
Data Extraction System of
the U.S. Census Bureau, it initially contains
149,642 records composed by 13 attributes. It was created in the European
CASC project [2]. A complete description of its attributes can be found in
the CASC project web page. The second data set, called forest, has been
extracted from the Forest FCoverType dataset,
available at the UCI KDD data repository [3]. From the
original 54 attributes we have selected only the real-valued ones (only ten
attributes).
For both datasets, we have discarded those records which are duplicated
when the last attribute of the dataset is not considered. The
resulting datasets have the following number of records and
attributes: very large Census has 124,998 records with 13
attributes and Forest has 581,009 records with 10 attributes
We have considered the last attribute of each dataset as a
confidential one; therefore, the SDC protection methods are always
applied to all the attributes of the original dataset, except the
last one. For the sake of simplicity, in each dataset we have
relabeled this last attribute in two ways, so to obtain either 5 or
10 different possible values for it. To do this, for example in the
case of 5 different values, we have sorted the dataset X (which
contains n records) according to this confidential attribute, and
then we have assigned the first confidential value to the first
n/5 records, the second value to the following n/5 records, and
so on.
|