Privacy Preservation in the Age of Synthetic Data - Part II

Yashwardhan Rathore

January 24, 2024

Synthetic data

Trustworthy AI

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Welcome back to the second instalment of our exploration into the intricate world of privacy risk metrics on synthetic data. If you haven't already, make sure to catch up on Part I, where we delved into the necessity of these metrics post-generation and explored their theoretical and practical implementation. In an era dominated by data breaches and privacy concerns, understanding and implementing effective privacy risk metrics have become more crucial than ever.

In Part II of the series, we continue our journey by exploring specific tools and methods in privacy assessment. We'll look at Anonymeter, delve into the details of Anonymity Tests Using AryaXAI, and examine a case study that explores how privacy risk and data impurity are interconnected.

Anonymeter: Unified Framework for Quantifying Privacy Risk in Synthetic Data

https://github.com/statice/anonymeter

Anonymeter is a unified statistical framework to jointly quantify different types of privacy risks in synthetic tabular datasets. It is equipped with attack-based evaluations for the singling out, linkability, and inference risks, which are the three key indicators of factual anonymization according to the Article 29 Working Party.

The framework contains privacy evaluators that measure the risks of singling out, linkability, and inference, which might incur to data donors following the release of a synthetic dataset. According to the European General Data Protection Regulation (GDPR), these risks are the three key indicators of factual anonymization.

Anonymeter takes a holistic approach to privacy risks, with a focus on three pivotal aspects: singling out, linkability, and inference risks:

Singling Out Risk: The singling out risk delves into the likelihood of an attacker identifying an individual in a synthetic dataset using only their sensitive attribute. To thwart this, Anonymeter employs the Laplace mechanism, introducing carefully calibrated noise to sensitive attributes, thereby fortifying individuals' identities. The privacy attacker endeavours to identify an individual in a synthetic dataset, armed solely with their sensitive attribute. The premise: "There is only one person with attributes X, Y, and Z."
Linkability Risk: Linking two records within a synthetic dataset using only their sensitive attributes is the essence of linkability risk. By leveraging the Jaccard index, Anonymeter gauges the similarity between records, enabling the framework to assess the potential for linkage between disparate data points. With sensitivity, the attacker seeks to link two records within a synthetic dataset, leveraging only their sensitive attributes. The challenge: "Records A and B belong to the same person."
Inference Risk: The inference risk gauges the probability of an attacker inferring sensitive information about an individual in a synthetic dataset based solely on their non-sensitive attributes. Through an ingenious game-theoretic strategy, Anonymeter pits attackers against defenders in a battle to safeguard individuals' privacy. Driven by determination, the attacker aspires to infer sensitive information about an individual in a synthetic dataset based solely on their non-sensitive attributes. The query: "A person with attributes X and Y also has Z."

The singling out risk, linkability risk, and inference risk in Anonymeter can be directly related to k-anonymity, l-diversity, and t-closeness.

1. The singling out risk is inversely proportional to k. A higher k means that it is less likely for an attacker to identify an individual in a synthetic dataset.

2. The linkability risk is inversely proportional to l. A higher l means that it is less likely for an attacker to link two records in a synthetic dataset.

3. The inference risk is inversely proportional to t. A higher t means that it is less likely for an attacker to infer sensitive information about an individual in a synthetic dataset.

Anonymity Tests Using AryaXAI:

Step 1:

Open AryaXAI, create a new project and give it a name. Open said project and upload your data. Save your initial data configurations and true label to be predicted (income).

Step 2:

Go to the Synthetic AI tab on the left and select your preferred model. Here, we have used GPT2. Train the model and wait for the notification of Model Training completion.

Step 3:

Move onto the “Anonymity Test” tab presented to you after you click on the show option on the right. Here, select Auxiliary columns, which are the columns that may be available to the attacker.

Select - workclass, education, hours-per-week, capital-loss and capital-gain and click Submit.

Step 4:

You will be able to see the values associated with each metric for Privacy Evaluation on the screen. The lower the metric, the better

1. SingingOutEvaluator (The lower the risk, the better) :

The SinglingOutEvaluator measures how much the synthetic data can help an attacker find a combination of attributes that single out records in the training data.

We can infer the robustness of the synthetic data to "univariate" singling out attacks, which try to find unique values of some attributes that single out an individual.

The risk estimate is accompanied by a confidence interval (at 97% level by default), which accounts for the finite number of attacks performed, 500 in this case.

The SinglingOutEvaluator can also attack the dataset using predicates, which combine different attributes. These are the so-called multivariate predicates.

As visible, the attack is not very successful, and the singling out variable risk is low.

2. LinkabilityEvaluator (The lower the risk, the better):

To run the LinkabilityEvaluator, one needs to specify which columns of auxiliary information are available to the attacker. This is done using the Auxiliary Columns tag, which we selected earlier.

As visible, the attack is not very successful, and the linkability risk is low.

3. InferenceEvaluator (The lower the risk, the better):

Finally, the Anonymity Test allows us to measure the inference risk. It does so by measuring the success of an attacker who tries to discover the value of some secret attribute for a set of target records.

Using these privacy metrics, we can clearly tell whether our synthetic dataset is up to par with the privacy regulations we hold.

Exploring the Intricate Relationship Between Privacy Risk and Data Impurity: A Case Study

As our journey into understanding data privacy and its intricate interplay with data quality continues, we embark on a captivating case study that delves into the realms of the Adult Census dataset.

This study aims to unravel the fascinating correlation between privacy scores and data quality metrics as more and more real data is added to synthetic data. Through a series of carefully designed experiments, we aim to discern whether the sensitivity of privacy risks alters as we progressively integrate varying proportions of real data with synthetically generated information.

Common sense tells us that as more and more real data (on which model is trained for generation of synthetic data) is added to current synthetic data (making it impure) -

Privacy Risk should increase, and thus, the 4 privacy metrics of Anonymeter should increase in value.
Secondly, as more and more real data is added to synthetic data, the new synthetic data distribution becomes
statistically similar to real data and hence should increase Data Quality value when analyzed through the SDV quality report.

Let us put this to the test and check out the analysis of this case study:

Try this on your own and tally it with the results in the Google sheet provided: https://docs.google.com/spreadsheets/d/1_g0s4T8Tng-PtDXBwdhO4oqM4be1ogYIAyx1P5eDmgQ/edit?usp=sharing

Analysis of the experimentation:

1. Privacy Risk Dynamics:

As anticipated, our findings unveiled a direct correlation between the amalgamation of real and synthetic data and increased privacy risks. Specifically, the four privacy metrics provided by Anonymeter — SinglingOutEvaluator, LinkabilityEvaluator, and InferenceEvaluator — exhibited a noticeable upward trend in their values as the proportion of real data escalated. This observation aligns seamlessly with our understanding that introducing genuine personal information enhances an attacker's ability to single out, link records, and infer sensitive attributes within the synthetic dataset.

2. Impact on Data Quality:

Beyond privacy concerns, we dived into data quality metrics using the SDV quality report as our guiding beacon. Intriguingly, our results affirmed that as we integrated a greater share of real data with the synthetically generated counterpart, the data quality metric experienced a consistent rise. This phenomenon finds its roots in the fact that the statistical distribution of the synthetic data becomes progressively similar to the real data distribution, thereby enhancing the alignment of data quality attributes.

Conclusion:

In exploring privacy risk techniques, we've navigated the intricacies of k-anonymity, l-diversity, and t-closeness, understanding their roles in protecting individual privacy while preserving data utility. The journey deepened as we engaged with Anonymeter, a powerful framework that quantifies privacy risks across singling out, linkability, and inference domains. Our practical implementation using the Adult Census dataset highlighted the intrinsic relationship between privacy risks and data quality metrics, underscoring the delicate equilibrium between safeguarding sensitive information and maintaining dataset accuracy.

Our case study echoed the larger narrative — as genuine data seeped into synthetically generated datasets, privacy risks surged in concert with data quality scores. This symbiotic correlation underscores the essential synergy between privacy preservation and data utility, emphasizing the pivotal role of ethical data practices in our data-driven world. As we tread the path of privacy risk and mitigation, we embrace the challenge of harmonizing these two imperatives, weaving a future where privacy and data quality walk hand in hand.

You can refer to Part I of the blog for the foundation.

Article

Maximizing Machine Learning Efficiency with MLOps and Observability

As organizations navigate real-world complexities, it is essential to prioritize both MLOps and observability to create a solid foundation for building, maintaining, and scaling trustworthy ML models.

Article

Synthetic ‘AI’ vs Generative ‘AI’: Which one to use to strengthen data engineering in machine learning

In the world of machine learning, data is the cornerstone of building robust models. Synthetic AI and Generative AI are already making waves, reshaping various industries and creative processes. In this blog, we will delve into the nuances of Synthetic AI and Generative AI highlighting their distinctions and potential applications.

Article

Decoding the EU's AI Act: Implications and Strategies for Businesses

Discover the latest milestone in AI regulation: the European institutions' provisional agreement on the new AI Act. From initial proposal to recent negotiations, explore key insights and actions businesses can take to prepare for compliance. Get insights into actions organizations should take to get ready.

See how AryaXAI improves
ML Observability

Learn how to bring transparency & suitability to your AI Solutions, Explore relevant use cases for your team, and Get pricing information for XAI products.

Schedule a demo

AryaXAI is a full stack ML Observability tool for mission-critical AI functions. Designed by Arya.ai, it is aimed to deliver much required common platform between stakeholders and deliver trust, transparency and auditability.

PRODUCTS

RESOURCES