With access to vast amounts of data, privacy protection is more important than ever. Among various de-identification (anonymization) techniques, k -anonymous microaggregation has been widely studied since it enables us to balance between confidentiality and data utility. Despite plenty of microaggregation methods in the sense of reducing the information loss and/or computational complexity, machine learning (ML) models using the resulting aggregated data face the problem that they are not as effective as expected. Motivated by the fact that ML models can be heavily influenced by distorted training data (albeit slightly), we deliberate on the performance of microaggregation in terms of not only data privacy but also data utility. In this paper, we propose Util-MA, a new utility-embraced microaggregation framework for effective ML applications. Specifically, unlike prior studies that apply microaggregation techniques directly to raw data, we design a unified framework that can potentially enhance the data utility while preserving the k -anonymity through preprocessing steps including dimensionality reduction and clustering. By using real-world datasets, we empirically demonstrate the superiority of Util-MA over benchmark microaggregation methods in terms of classification accuracy. Moreover, we investigate the importance of preprocessing by measuring key performance indicators (KPIs) of clustering; the clustering stage of Util-MA leads to high performance on the classification when the clustering results substantially coincide with the ground truth labels. We also establish a close relationship between the KPIs of clustering and the classification accuracies, which tends to be revealed when there is a gain of Util-MA over the benchmark method is observed. Our framework is microaggregation-model-agnostic; thus, underlying microaggregation models can be appropriately chosen according to one's needs and ML tasks.
|Number of pages||12|
|Publication status||Published - 2022|
Bibliographical noteFunding Information:
This work was supported in part by the National Research Foundation of Korea (NRF) Grant by the Korean Government through MSIT under Grant 2021R1A2C3004345, in part by the Institute of Information and Communications Technology Planning and Evaluation (IITP).
© 2013 IEEE.
All Science Journal Classification (ASJC) codes
- Computer Science(all)
- Materials Science(all)
- Electrical and Electronic Engineering