Sitemap
A list of all the posts and pages found on the site. For you robots out there, there is an XML version available for digesting as well.
Pages
Papers
Why data anonymization has not taken off
MJ Schneider, JB, D Iacobucci. Consumer Needs and Solutions, 2025. Companies are looking to data anonymization research – including differential private and synthetic data methods – for simple and straightforward compliance solutions. But data anonymization has not taken off in practice because it is anything but simple to implement. For one, it requires making complex choices which are case dependent, such as the domain of the dataset to anonymize; the units to protect; the scope where the data protection should extend to; and the standard of protection. Each variation of these choices changes the very meaning, as well as the practical implications, of differential privacy (or of any other measure of data anonymization). Yet differential privacy is frequently being branded as the same privacy guarantee regardless of variations in these choices. Some data anonymization methods can be effective, but only when the insights required are much larger than the unit of protection. Given that businesses care about profitability, any solution must preserve the patterns between a firm’s data and that profitability. As a result, data anonymization solutions usually need to be bespoke and case-specific, which reduces their scalability. Companies should not expect easy wins, but rather recognize that anonymization is just one approach to data privacy with its own particular advantages and drawbacks, while the best strategies jointly leverage the full range of approaches to data privacy and security in combination.Abstract
The Five Safes as a privacy context
JB, R Gong. Preprint, 2025. The Five Safes is a framework used by national statistical offices (NSO) for assessing and managing the disclosure risk of data sharing. This paper makes two points: Firstly, the Five Safes can be understood as a specialization of a broader concept – contextual integrity – to the situation of statistical dissemination by an NSO. We demonstrate this by mapping the five parameters of contextual integrity onto the five dimensions of the Five Safes. Secondly, the Five Safes contextualizes narrow, technical notions of privacy within a holistic risk assessment. We demonstrate this with the example of differential privacy (DP). This contextualization allows NSOs to place DP within their Five Safes toolkit while also guiding the design of DP implementations within the broader privacy context, as delineated by both their regulation and the relevant social norms.Abstract
Property elicitation on imprecise probabilities
JB,\(\negthinspace^\dagger\) R Derr.\(\negthinspace^{\dagger}\) Working Paper, 2025. Property elicitation studies which attributes of a probability distribution can be determined by minimising a risk. We investigate a generalisation of property elicitation to imprecise probabilities (IP). This investigation is motivated by multi-distribution learning, which takes the classical machine learning paradigm of minimising a single risk over a (precise) probability and replaces it with \(\Gamma\)-maximin risk minimization over an IP. We provide necessary conditions for elicitability of a IP-property. Furthermore, we explain what an elicitable IP-property actually elicits through Bayes pairs – the elicited IP-property is the corresponding standard property of the maximum Bayes risk distribution.Abstract
Generalization bounds and stopping rules for learning with self-selected data
J Rodemann, JB. Preprint, 2025. Many learning paradigms self-select training data in light of previously learned parameters. Examples include active learning, semi-supervised learning, bandits, or boosting. Rodemann et al. (2024) unify them under the framework of 'reciprocal learning'. In this article, we address the question of how well these methods can generalize from their self-selected samples. In particular, we prove universal generalization bounds for reciprocal learning using covering numbers and Wasserstein ambiguity sets. Our results require no assumptions on the distribution of self-selected data, only verifiable conditions on the algorithms. We prove results for both convergent and finite iteration solutions. The latter are anytime valid, thereby giving rise to stopping rules for a practitioner seeking to guarantee the out-of-sample performance of their reciprocal learning algorithm. Finally, we illustrate our bounds and stopping rules for reciprocal learning's special case of semi-supervised learning.Abstract
Topics in privacy, data privacy and differential privacy
JB. PhD Thesis, Harvard University, 2025. In an era of unprecedented data availability and analytic capacity, the protection of individuals’ privacy in statistical data releases is becoming an increasingly difficult problem. This dissertation contributes to the theoretical and methodological foundations of statistical data privacy, largely focusing on differential privacy (DP). We begin with a multifaceted investigation into privacy from legal, economic, social, and philosophical standpoints, before turning to a formal system of DP specifications built around five core building blocks found throughout the literature: the domain, multiverse, input premetric, output premetric, and protection loss budget. This system is applied to statistical disclosure control (SDC) mechanisms used in the US Decennial Census, analyzing both the traditional method of data swapping and the contemporary TopDown Algorithm. Beyond these case studies, this dissertation explores the inferential limitations posed by DP and Pufferfish privacy in both frequentist and Bayesian settings, establishing general bounds under mild assumptions. It further addresses the challenges of applying DP to complex survey pipelines, incorporating issues such as sampling, weighting, and imputation. Finally, it contextualizes DP within broader frameworks of data privacy, namely the Five Safes and contextual integrity, advocating for a more integrated approach to privacy that respects statistical utility, transparency, and societal norms.Abstract
A refreshment stirred, not shaken (III): Can swapping be differentially private?
JB, R Gong, XL Meng. To appear in Data Privacy Protection and the Conduct of Applied Research: Methods, Approaches and Their Consequences, 2025. The quest for a precise and contextually grounded answer to the question in the present paper's title resulted in this stirred-not-shaken triptych, a phrase that reflects our desire to deepen the theoretical basis, broaden the practical applicability, and reduce the misperception of differential privacy (DP)—all without shaking its core foundations. Indeed, given the existence of more than 200 formulations of DP (and counting), before even attempting to answer the titular question one must first precisely specify what it actually means to be DP. Motivated by this observation, a theoretical investigation into DP's fundamental essence resulted in Part I of this trio, which introduces a five-building-block system explicating the who, where, what, how and how much aspects of DP. Instantiating this system in the context of the United States Decennial Census, Part II then demonstrates the broader applicability and relevance of DP by comparing a swapping strategy like that used in 2010 with the TopDown Algorithm—a DP method adopted in the 2020 Census. This paper provides nontechnical summaries of the preceding two parts as well as new discussion—for example, on how greater awareness of the five building blocks can thwart privacy theatrics; how our results bridging traditional SDC and DP allow a data custodian to reap the benefits of both these fields; how invariants impact disclosure risk; and how removing the implicit reliance on aleatoric uncertainty could lead to new generalizations of DP.Abstract
JB, R Gong, XL Meng. Preprint, 2025. Through the lens of the system of differential privacy specifications developed in Part I of a trio of articles, this second paper examines two statistical disclosure control (SDC) methods for the United States Decennial Census: the Permutation Swapping Algorithm (PSA), which is similar to the 2010 Census's disclosure avoidance system (DAS), and the TopDown Algorithm (TDA), which was used in the 2020 DAS. To varying degrees, both methods leave unaltered some statistics of the confidential data – which are called the method's invariants – and hence neither can be readily reconciled with differential privacy (DP), at least as it was originally conceived. Nevertheless, we establish that the PSA satisfies \(\varepsilon\)-DP subject to the invariants it necessarily induces, thereby showing that this traditional SDC method can in fact still be understood within our more-general system of DP specifications. By a similar modification to \(\rho\)-zero concentrated DP, we also provide a DP specification for the TDA. Finally, as a point of comparison, we consider the counterfactual scenario in which the PSA was adopted for the 2020 Census, resulting in a reduction in the nominal privacy loss, but at the cost of releasing many more invariants. Therefore, while our results explicate the mathematical guarantees of SDC provided by the PSA, the TDA and the 2020 DAS in general, care must be taken in their translation to actual privacy protection – just as is the case for any DP deployment.Abstract
A refreshment stirred, not shaken (I): Five building blocks of differential privacy
JB, R Gong, XL Meng. In preparation, 2025.
M Kakooei et al.. Preprint, 2024. Accurate Land Use and Land Cover (LULC) maps are essential for understanding the drivers of sustainable development, in terms of its complex interrelationships between human activities and natural resources. However, existing LULC maps often lack precise urban and rural classifications, particularly in diverse regions like Africa. This study presents a novel construction of a high-resolution rural-urban map using deep learning techniques and satellite imagery. We developed a deep learning model based on the DeepLabV3 architecture, which was trained on satellite imagery from Landsat-8 and the ESRI LULC dataset, augmented with human settlement data from the GHS-SMOD. The model utilizes semantic segmentation to classify land into detailed categories, including urban and rural areas, at a 10-meter resolution. Our findings demonstrate that incorporating LULC along with urban and rural classifications significantly enhances the model's ability to accurately distinguish between urban, rural, and non-human settlement areas. Therefore, our maps can support more informed decision-making for policymakers, researchers, and stakeholders. We release a continent wide urban-rural map, covering the period 2016 and 2022.Abstract
General inferential limits under differential and Pufferfish privacy
JB, R Gong. International Journal of Approximate Reasoning, 2024. Differential privacy (DP) is a class of mathematical standards for assessing the privacy provided by a data-release mechanism. This work concerns two important flavors of DP that are related yet conceptually distinct: pure ε-differential privacy (ε-DP) and Pufferfish privacy. We restate ε-DP and Pufferfish privacy as Lipschitz continuity conditions and provide their formulations in terms of an object from the imprecise probability literature: the interval of measures. We use these formulations to derive limits on key quantities in frequentist hypothesis testing and in Bayesian inference using data that are sanitised according to either of these two privacy standards. Under very mild conditions, the results in this work are valid for arbitrary parameters, priors and data generating models. These bounds are weaker than those attainable when analysing specific data generating models or data-release mechanisms. However, they provide generally applicable limits on the ability to learn from differentially private data – even when the analyst's knowledge of the model or mechanism is limited. They also shed light on the semantic interpretations of the two DP flavors under examination, a subject of contention in the current literature.Abstract
The complexities of differential privacy for survey data
J Drechsler, JB. To appear in Data Privacy Protection and the Conduct of Applied Research: Methods, Approaches and Their Consequences, 2024. The concept of differential privacy (DP) has gained substantial attention in recent years, most notably since the U.S. Census Bureau announced the adoption of the concept for its 2020 Decennial Census. However, despite its attractive theoretical properties, implementing DP in practice remains challenging, especially when it comes to survey data. In this paper we present some results from an ongoing project funded by the U.S. Census Bureau that is exploring the possibilities and limitations of DP for survey data. Specifically, we identify five aspects that need to be considered when adopting DP in the survey context: the multi-staged nature of data production; the limited privacy amplification from complex sampling designs; the implications of survey-weighted estimates; the weighting adjustments for nonresponse and other data deficiencies, and the imputation of missing values. We summarize the project's key findings with respect to each of these aspects and also discuss some of the challenges that still need to be addressed before DP could become the new data protection standard at statistical agencies.Abstract
Whose data is it anyway? Towards a formal treatment of differential privacy for surveys
JB, J Drechsler. Working Paper, 2024.
Differential privacy: General inferential limits via intervals of measures
JB, R Gong. Thirteenth International Symposium on Imprecise Probability: Theories and Applications, 2023. Differential privacy (DP) is a mathematical standard for assessing the privacy provided by a data-release mechanism. We provide formulations of pure \(\varepsilon\)-differential privacy first as a Lipschitz continuity condition and then using an object from the imprecise probability literature: the interval of measures. We utilise this second formulation to establish bounds on the appropriate likelihood function for \(\varepsilon\)-DP data – and in turn derive limits on key quantities in both frequentist hypothesis testing and Bayesian inference. Under very mild conditions, these results are valid for arbitrary parameters, priors and data generating models. These bounds are weaker than those attainable when analysing specific data generating models or data-release mechanisms. However, they provide generally applicable limits on the ability to learn from differentially private data – even when the analyst’s knowledge of the model or mechanism is limited. They also shed light on the semantic interpretation of differential privacy, a subject of contention in the current literature.Abstract
Can swapping be differentially private? A refreshment stirred, not shaken
JB, R Gong, XL Meng. Working Paper, 2023. This paper presents a formal privacy analysis of data swapping, a family of statistical disclosure control (SDC) methods which were used in the 1990, 2000 and 2010 US Decennial Census disclosure avoidance systems (DAS). Like all swapping algorithms, the method we examine has invariants – statistics calculated from the confidential database which remain unchanged. We prove that our swapping method satisfies the classic notion of pure differential privacy (\(\varepsilon\)-DP) when conditioning on these invariants. To support this privacy analysis, we provide a framework which unifies many different types of DP while simultaneously explicating the nuances that differentiate these types. This framework additionally supplies a DP definition for the TopDown algorithm (TDA) which also has invariants and was used as the SDC method for the 2020 Census Redistricting Data (P.L. 94-171) Summary and the Demographic and Housing Characteristics Files. To form a comparison with the privacy of the TDA, we compute the budget (along with the other DP components) in the counterfactual scenario that our swapping method was used for the 2020 Decennial Census. By examining swapping in the light of formal privacy, this paper aims to reap the benefits of DP - formal privacy guarantees and algorithmic transparency - without sacrificing the advantages of traditional SDC. This examination also reveals an array of subtleties and traps in using DP for theoretically benchmarking privacy protection methods in general. Using swapping as a demonstration, our optimistic hope is to inspire formal and rigorous framing and analysis of other SDC techniques in the future, as well as to promote nuanced assessments of DP implementations which go beyond discussion of the privacy loss budget \(\varepsilon\).Abstract
JB. Lecture notes, 2021.
Navigating spatio-temporal data with generalised additive models
JB. Unpublished expository paper, 2021.
JB. Lecture notes, 2021.
Big data, differential privacy and national statistical organisations
JB. Statistical Journal of the IAOS, 2020. Differential privacy (DP) has emerged in the computer science literature as a measure of the impact on an individual’s privacy resulting from the publication of a statistical output such as a frequency table. This paper provides an introduction to DP for official statisticians and discuss its relevance, benefits and challenges from a National Statistical Organisation (NSO) perspective. We motivate our study by examining how privacy is evolving in the era of big data and how this might prompt a shift from traditional statistical disclosure techniques used in official statistics – which are generally applied on a cell-by-cell or table-by-table basis – to formal privacy methods, like DP, which are applied from a perspective encompassing the totality of the outputs generated from a given dataset. We identify an important interplay between DP’s holistic privacy risk measure and the difficulty for NSOs in implementing DP, showing that DP’s major advantage is also DP’s major challenge. This paper provides new work addressing two key DP research areas for NSOs: DP’s application to survey data and its incorporation within the Five Safes framework.Abstract
ABS perturbation methodology through the lens of differential privacy
JB, C-H Chien. Work Session on Statistical Data Confidentiality, UN Economic Commission for Europe, 2019. The Australian Bureau of Statistics (ABS), like other national statistical offices, is considering the opportunities of differential privacy (DP). This research considers the Australian Bureau of Statistics (ABS) TableBuilder perturbation methodology in a DP framework. DP and the ABS perturbation methodology are applying the same idea – infusing noise to the underlying microdata – to protect aggregate statistical outputs. This research describes some differences between these approaches. Our findings show that noise infusion protects against disclosure risks in the aggregate Census Tables. We highlight areas of future ABS research on this topic.Abstract
JB. Honours thesis, 2017.
JB. Unpublished expository paper, 2017.
Abelian categories and Mitchell’s embedding theorem
JB. Unpublished expository paper, 2017.
A Künneth formula for complex K theory
JB. Unpublished expository paper, 2017.
Stable homotopy theory and category of spectra
JB. Vacation Research Scholarship Report (AMSI), 2017.
Kolmogorov complexity and the symmetry of algorithmic information
JB. Unpublished expository paper, 2016.
Hausdorff and similarity dimensions
JB. Unpublished expository paper, 2016.
Talks
Whose data is it anyway? Towards a formal treatment of differential privacy for surveys
Keynote talk, Adelaide Data Privacy Workshop
Five building blocks of differential privacy
Introductory tutorial, Adelaide Data Privacy Workshop
Property elicitation on imprecise probabilities
14th International Symposium on Imprecise Probabilities: Theories and Applications
Abstract
Property elicitation studies which attributes of a probability distribution can be determined by minimising a risk. We investigate a generalisation of property elicitation to imprecise probabilities (IP). This investigation is motivated by multi-distribution learning, which takes the classical machine learning paradigm of minimising a single risk over a (precise) probability and replaces it with \(\Gamma\)-maximin risk minimization over an IP. We provide necessary conditions for elicitability of a IP-property. Furthermore, we explain what an elicitable IP-property actually elicits through Bayes pairs – the elicited IP-property is the corresponding standard property of the maximum Bayes risk distribution.
Enhancing digital twins with privacy-aware EO-ML methods
Harvard Center for Geographic Analysis Conference: The Geography of Digital Twins
Abstract
Digital Twins, dynamic virtual models, require granular spatiotemporal data often absent or anonymized in low and middle income regions. We propose a privacy aware Earth Observation–Machine Learning framework that treats privacy protected survey locations as a missing data problem, integrating multiple imputation with multi-temporal satellite imagery and recurrent convolutional neural networks. Applied to continent wide poverty mapping in Africa, the method quantifies uncertainty, significantly improves predictive accuracy, and reduces biases introduced by location perturbation. The resulting high resolution economic indicators support more reliable socioeconomic and environmental Digital Twins for policy analysis. This approach reconciles data privacy and utility, benefiting urban planning, economic forecasting, and sustainability initiatives.
Differential privacy in statistical agencies—Challenges and opportunities
Invited workshop, 2nd Ocean Workshop on Privacy
Can swapping be differentially private? A refreshment stirred, not shaken
Privacy and Public Policy Conference
Navigating privacy and utility with multiple imputation, satellite imaging and deep learning
Joint Statistical Meetings
Abstract
Data science for complex societal problems, such as combating poverty on a global scale, typically involve understanding and addressing multiple challenges. Some examples are (1) integrating data of different types and quality; (2) reducing bias due to data defects; (3) trading data privacy for utility; (4) assessing uncertainties in black-box algorithms. This article documents how we use the framework of multiple imputation to investigate and navigate such challenges in the context of studying poverty in Africa, where we integrate anonymized ground-level surveys with satellite images via deep learning. Advantages of the multiple imputation approach include its (a) statistical efficiency by following a Bayesian approach to incorporate prior and auxiliary information; (b) implementation readiness with black-box methods directly executed on the imputation replications; (c) conceptual simplicity, as a form of data augmentation; and (d) ability to assess uncertainty, via the joined replication of the imputation and the model fitting. However, it is computationally demanding because it requires repeated training over imputation replications, and it is sensitive to the imputation model.
Privacy, data privacy, and differential privacy
Department colloquium, LMU Munich Statistics Department
Abstract
This talk beckons inquisitive audiences to explore the intricacies of data privacy. We journey back to the late 19th century, when the concept of privacy crystallised as a legal right. This change was spurred by the vexations of a socialite’s husband, harried by tabloids during the emergence of yellow journalism and film photography. In today’s era, marked by the rise of digital technologies, data science, and generative AI, data privacy has surged to become a major concern for nearly every organisation. Differential privacy (DP), rooted in cryptography, epitomises a significant advancement in balancing data privacy with data utility. Yet, as DP garners attention, it unveils complex challenges and misconceptions that confound even seasoned experts. Through a statistical lens, we examine these nuances. Central to our discussion is DP’s commitment to curbing the relative risk of individual data disclosure, unperturbed by an adversary’s prior knowledge, via the premise that posterior-to-prior ratios are constrained by extreme likelihood ratios. A stumbling block surfaces when ‘individual privacy’ is delineated by counterfactually manipulating static individual data values, without considering their interdependencies. Alarmingly, this static viewpoint, flagged for its shortcomings for over a decade (Kifer and Machanavajjhala, 2011, ACM; Tschantz, Sen, and Datta, 2022, IEEE), continues to overshadow DP narratives, leading to the erroneous but widespread belief that DP is impervious to adversaries’ prior knowledge.
Turning to Warner’s (1965, JASA) randomised response mechanism—the first recorded instance of a DP mechanism—we show how DP’s mathematical assurances can crumble to an arbitrary degree when adversaries grasp the interplay among individuals. Drawing a parallel, it’s akin to the folly of solely quarantining symptomatic individuals to thwart an airborne disease’s spread. Thus, embracing a statistical perspective on data, seeing them as accidental manifestations of underlying essential information constructs, is as vital for bolstering data privacy as it is for rigorous data analysis.
Finally, unifying the many types of DP as different kinds of Lipschitz continuity on the data release mechanism (hence the ‘differential’ in differential privacy), we elicit from existing literature five necessary building blocks for a DP specification. They are, in order of mathematical prerequisite, the protection domain (data space), the scope of protection (data multiverse), the protection unit (unit for data perturbation), the standard of protection (measure for output variations), and the intensity of protection (privacy loss budget). In simple terms, these are respectively the “what”, “where”, “who”, “how”, and “how much” questions of DP. We answer these questions for data swapping—a traditional statistical disclosure control method used, for example, in the 1990, 2000 and 2010 US Decennial Censuses—drawing parallels with the recent implementation of DP in their 2020 Census and unveiling the nuances and potential pitfalls in employing DP as a theoretical yardstick for privacy methodologies.
Privacy, data privacy, and differential privacy
Department colloquium, Tübingen AI Center
Abstract
This talk beckons inquisitive audiences to explore the intricacies of data privacy. We journey back to the late 19th century, when the concept of privacy crystallised as a legal right. This change was spurred by the vexations of a socialite’s husband, harried by tabloids during the emergence of yellow journalism and film photography. In today’s era, marked by the rise of digital technologies, data science, and generative AI, data privacy has surged to become a major concern for nearly every organisation. Differential privacy (DP), rooted in cryptography, epitomises a significant advancement in balancing data privacy with data utility. Yet, as DP garners attention, it unveils complex challenges and misconceptions that confound even seasoned experts. Through a statistical lens, we examine these nuances. Central to our discussion is DP’s commitment to curbing the relative risk of individual data disclosure, unperturbed by an adversary’s prior knowledge, via the premise that posterior-to-prior ratios are constrained by extreme likelihood ratios. A stumbling block surfaces when ‘individual privacy’ is delineated by counterfactually manipulating static individual data values, without considering their interdependencies. Alarmingly, this static viewpoint, flagged for its shortcomings for over a decade (Kifer and Machanavajjhala, 2011, ACM; Tschantz, Sen, and Datta, 2022, IEEE), continues to overshadow DP narratives, leading to the erroneous but widespread belief that DP is impervious to adversaries’ prior knowledge.
Turning to Warner’s (1965, JASA) randomised response mechanism—the first recorded instance of a DP mechanism—we show how DP’s mathematical assurances can crumble to an arbitrary degree when adversaries grasp the interplay among individuals. Drawing a parallel, it’s akin to the folly of solely quarantining symptomatic individuals to thwart an airborne disease’s spread. Thus, embracing a statistical perspective on data, seeing them as accidental manifestations of underlying essential information constructs, is as vital for bolstering data privacy as it is for rigorous data analysis.
Finally, unifying the many types of DP as different kinds of Lipschitz continuity on the data release mechanism (hence the ‘differential’ in differential privacy), we elicit from existing literature five necessary building blocks for a DP specification. They are, in order of mathematical prerequisite, the protection domain (data space), the scope of protection (data multiverse), the protection unit (unit for data perturbation), the standard of protection (measure for output variations), and the intensity of protection (privacy loss budget). In simple terms, these are respectively the “what”, “where”, “who”, “how”, and “how much” questions of DP. We answer these questions for data swapping—a traditional statistical disclosure control method used, for example, in the 1990, 2000 and 2010 US Decennial Censuses—drawing parallels with the recent implementation of DP in their 2020 Census and unveiling the nuances and potential pitfalls in employing DP as a theoretical yardstick for privacy methodologies.
How does differential privacy limit disclosure risk? A precise prior-to-posterior analysis
Invited talk, ISBA World Meeting
Abstract
Differential privacy (DP) is an increasingly popular standard for quantifying privacy in the context of sharing statistical data. It has numerous advantages—especially its composition of privacy loss over multiple data releases and its facilitation of valid statistical inference via algorithmic transparency—over previous statistical privacy frameworks. Yet one difficulty of DP in practice is setting its privacy loss budget. Such a choice is complicated by a lack of understanding of what DP means in connection to traditional notions of statistical disclosure limitation (SDL). In this talk, we trace the rich literature on SDL back to the foundational 1986 paper by Duncan and Lambert, which defines disclosure in a relative sense as an increase – due to the published data – in one’s knowledge of an individual record. We prove that DP is exactly equivalent to limiting this type of ‘prior-to-posterior’ disclosure, but only when the records are completely independent. More generally, DP is equivalent to controlling conditional prior-to-posterior learning, when conditioning on all other records in the dataset. This connects DP to traditional SDL while also highlighting the danger of viewing data variations as mechanistic—as does DP—rather than as statistical—in which one would explicitly acknowledge the variational dependencies between records. Based on joint work with Ruobin Gong and Xiao-Li Meng.
Whose data is it anyway? Towards a formal treatment of differential privacy for surveys
Data Privacy Protection and the Conduct of Applied Research: Methods, Approaches and their Consequences
Privacy, data privacy, and differential privacy
Keynote talk (joint with Xiao-Li Meng), Humanising Machine Intelligence Workshop
Can swapping be differentially private? A refreshment stirred not shaken
Statistics Canada Methodology Seminar
Abstract
To directly address the title’s query, an answer must necessarily presuppose a precise specification of differential privacy (DP). Indeed, as there are many different formulations of DP – which range, both qualitatively and quantitatively, from being practically and theoretically vacuous to providing gold-standard privacy protection – a straight answer to the question “is X differentially private?” is not particularly informative and is likely to lead to confusion or even dispute, especially when the presupposed DP specification is not clearly spelt out and fully comprehended.
A true answer to the title’s query must therefore be predicated upon an understanding of how formulations of DP differ, which is best explored, as is often the case, by starting with their unifying commonality. DP specifications are, in essence, Lipschitz conditions on the data-release mechanism. The core philosophy of DP is thus to manage relative privacy loss by limiting the rate of change of the variations in the noise-injected output statistics when the confidential input data are (counterfactually) suitably perturbed. Hence, DP conceives of privacy protection specifically as control over the Lipschitz constant – i.e. over this rate of change; and different DP specifications correspond to different choices of how to measure input perturbation and output variation, in addition to the choice of how much to control this rate of variations-to-perturbation. Following this line of thinking through existing DP literature leads to five necessary building blocks for a DP specification. They are, in order of mathematical prerequisite, the protection domain (data space), the scope of protection (data multiverse), the protection units (unit for data perturbation), the standard of protection (measure for output variations), and the intensity of protection (privacy loss budget). In simple terms, these are the “what”, “where”, “who”, “how”, and “how much” questions of DP.
Under this framework, we consider DP’s applicability in scenarios like the US Census, where the disclosure of certain aggregates is mandated by the US Constitution. We design and analyze a data swapping method, called the Permutation Swapping Algorithm (PSA), which is reminiscent of the statistical disclosure control (SDC) procedures employed in several US Decennial Censuses before 2020. For comparative purposes, we are also interested in the principal SDC method of the 2020 Census, the TopDown algorithm (TDA), which melds the DP specification of Bun and Steinke [2016a] (B&S) with Census policy and constitutional mandates.
We analyze the DP properties of both data swapping and TDA. Both B&S’s specification and the original ε-DP specification of Dwork et al. [2006b] demand that no data summary is disclosed without noise – which is impossible for swapping methods as they inherently preserve, and hence disclose, some margins; and is also impossible for TDA since it too keeps some counts invariant. Therefore, for the same reasons that TDA cannot satisfy the B&S specification, data swapping cannot satisfy the original ε-DP specification. On the other hand, we establish that PSA is ε-DP, subject to the invariants it necessarily induces and we show how the privacy-loss budget ε is determined by the swapping rate and the maximal size of the swapping classes. We also prove a DP specification for TDA, by subjecting B&S’s specification to TDA’s invariants. Drawing a parallel, we assess the privacy budget for the PSA in the hypothetical situation where it was adopted for the 2020 Census. Our overarching ambition is two-fold: firstly, to leverage the merits of DP, including its mathematical assurances and algorithmic transparency, without sidelining the advantages of classical SDC; and secondly, to unveil the nuances and potential pitfalls in employing DP as a theoretical yardstick for privacy methodologies. By spotlighting data swapping, we aspire to stimulate rigorous evaluations of other SDC techniques, emphasizing that the privacy-loss budget ε is merely one of five building blocks for the mathematical foundations of DP.
The Five Safes as a privacy context
5th Annual Symposium on Applications of Contextual Integrity
Abstract
The Five Safes is a framework used by national statistical offices (NSO) for assessing and managing the disclosure risk of data sharing. This paper makes two points: Firstly, the Five Safes can be understood as a specialization of a broader concept – contextual integrity – to the situation of statistical dissemination by an NSO. We demonstrate this by mapping the five parameters of contextual integrity onto the five dimensions of the Five Safes. Secondly, the Five Safes contextualizes narrow, technical notions of privacy within a holistic risk assessment. We demonstrate this with the example of differential privacy (DP). This contextualization allows NSOs to place DP within their Five Safes toolkit while also guiding the design of DP implementations within the broader privacy context, as delineated by both their regulation and the relevant social norms.
Differential privacy: General inferential limits via intervals of measures
13th International Symposium on Imprecise Probabilities: Theories and Applications
Abstract
Differential privacy (DP) is a class of mathematical standards for assessing the privacy provided by a data-release mechanism. This work concerns two important flavors of DP that are related yet conceptually distinct: pure ε-differential privacy (ε-DP) and Pufferfish privacy. We restate ε-DP and Pufferfish privacy as Lipschitz continuity conditions and provide their formulations in terms of an object from the imprecise probability literature: the interval of measures. We use these formulations to derive limits on key quantities in frequentist hypothesis testing and in Bayesian inference using data that are sanitised according to either of these two privacy standards. Under very mild conditions, the results in this work are valid for arbitrary parameters, priors and data generating models. These bounds are weaker than those attainable when analysing specific data generating models or data-release mechanisms. However, they provide generally applicable limits on the ability to learn from differentially private data – even when the analyst's knowledge of the model or mechanism is limited. They also shed light on the semantic interpretations of the two DP flavors under examination, a subject of contention in the current literature.
Privacy, data privacy, and differential privacy
Methodology Division Seminar, Australian Bureau of Statistics
Designing formally private mechanisms for the p% rule
Workshop on Advances in Statistical Disclosure Limitation
Abstract
The \(p\)% rule classifies an aggregate statistic as a disclosure risk if one contributor can use the statistic to determine another contributor’s value to within \(p\)%. This is often possible in economic data when there is a monopoly or a duopoly. Therefore, the \(p\)% rule is an important statistical disclosure control and is frequently used in national statistical organisations. However, the \(p\)% rule is only a method for assessing disclosure risk: While it can say whether a statistic is risky or not, it does not provide a mechanism to decrease that risk. To address this limitation, we encode the \(p\)% rule into a formal privacy definition using the Pufferfish framework and we develop a perturbation mechanism which is provably private under this framework. This mechanism provides official statisticians with a method for perturbing data which guarantees a Bayesian formulation of the \(p\)% rule is satisfied. We motivate this work with an example application to the Australian Bureau of Statistics (ABS).
Using admin data and machine learning to predict dwelling occupancy on Census Night
Statistical Society of Australia's Young Statisticians Conference
Abstract
The Australian Census of Population and Housing (the Census) aims to count every person in Australia on a particular night – called the Census night. Houses which do not complete a Census form and do not respond to the Australian Bureau of Statistics’ (ABS) follow-up campaign, pose a complication to achieving this aim: Are these dwellings unoccupied, or are they occupied and the residents unresponsive? To achieve its aim, the Census should count these unresponsive residents, but how can the ABS accurately do this? To answer these questions, the ABS has developed a model which uses administrative data – collected by various government and non-government organisations – to predict the occupancy status of a dwelling. There are various challenges surrounding this new method, including the lack of ground truth, and the presence of strongly unbalanced classes. However, the method will improve the accuracy of ABS Census population counts and has been adopted as part of the 2021 Australian Census imputation process.
A discrete calibration approach to improving data linkage
ABS Methodology Advisory Committee
Australian Mathematical Sciences Institute Connect Conference
Teaching
Deep statistics for more rigorous and efficient data science
Teaching fellow, 2023 ASC Workshop, 2023 Dec.
Variations, information and privacy
Teaching fellow, Harvard University, 2023F.
Data science: An artificial ecosystem
Teaching fellow, Harvard University, 2023 summer.
Teaching fellow, Harvard University, 2022F, 2023S.
Deep statistics: AI and earth observations for sustainable development
Teaching fellow, Harvard University, 2022S, 2023S.
Advanced mathematics and applications 2
Tutor, Australian National University, 2017 Sem 2.
Advanced mathematics and applications 1
Tutor, Australian National University, 2017 Sem 1.
