Selected talks

Whose data is it anyway? Towards a formal treatment of differential privacy for surveys

November 26, 2025

Keynote talk, Adelaide Data Privacy Workshop

Five building blocks of differential privacy

November 26, 2025

Introductory tutorial, Adelaide Data Privacy Workshop

Property elicitation on imprecise probabilities

June 15, 2025

14th International Symposium on Imprecise Probabilities: Theories and Applications

Abstract

Property elicitation studies which attributes of a probability distribution can be determined by minimising a risk. We investigate a generalisation of property elicitation to imprecise probabilities (IP). This investigation is motivated by multi-distribution learning, which takes the classical machine learning paradigm of minimising a single risk over a (precise) probability and replaces it with \(\Gamma\)-maximin risk minimization over an IP. We provide necessary conditions for elicitability of a IP-property. Furthermore, we explain what an elicitable IP-property actually elicits through Bayes pairs – the elicited IP-property is the corresponding standard property of the maximum Bayes risk distribution.

Enhancing digital twins with privacy-aware EO-ML methods

May 22, 2025

Harvard Center for Geographic Analysis Conference: The Geography of Digital Twins

Abstract

Digital Twins, dynamic virtual models, require granular spatiotemporal data often absent or anonymized in low and middle income regions. We propose a privacy aware Earth Observation–Machine Learning framework that treats privacy protected survey locations as a missing data problem, integrating multiple imputation with multi-temporal satellite imagery and recurrent convolutional neural networks. Applied to continent wide poverty mapping in Africa, the method quantifies uncertainty, significantly improves predictive accuracy, and reduces biases introduced by location perturbation. The resulting high resolution economic indicators support more reliable socioeconomic and environmental Digital Twins for policy analysis. This approach reconciles data privacy and utility, benefiting urban planning, economic forecasting, and sustainability initiatives.

Differential privacy in statistical agencies—Challenges and opportunities

February 5, 2025

Invited workshop, 2nd Ocean Workshop on Privacy

Can swapping be differentially private? A refreshment stirred, not shaken

September 14, 2024

Privacy and Public Policy Conference

Navigating privacy and utility with multiple imputation, satellite imaging and deep learning

August 7, 2024

Joint Statistical Meetings

Abstract

Data science for complex societal problems, such as combating poverty on a global scale, typically involve understanding and addressing multiple challenges. Some examples are (1) integrating data of different types and quality; (2) reducing bias due to data defects; (3) trading data privacy for utility; (4) assessing uncertainties in black-box algorithms. This article documents how we use the framework of multiple imputation to investigate and navigate such challenges in the context of studying poverty in Africa, where we integrate anonymized ground-level surveys with satellite images via deep learning. Advantages of the multiple imputation approach include its (a) statistical efficiency by following a Bayesian approach to incorporate prior and auxiliary information; (b) implementation readiness with black-box methods directly executed on the imputation replications; (c) conceptual simplicity, as a form of data augmentation; and (d) ability to assess uncertainty, via the joined replication of the imputation and the model fitting. However, it is computationally demanding because it requires repeated training over imputation replications, and it is sensitive to the imputation model.

Privacy, data privacy, and differential privacy

July 16, 2024

Department colloquium, LMU Munich Statistics Department

Abstract

This talk beckons inquisitive audiences to explore the intricacies of data privacy. We journey back to the late 19th century, when the concept of privacy crystallised as a legal right. This change was spurred by the vexations of a socialite’s husband, harried by tabloids during the emergence of yellow journalism and film photography. In today’s era, marked by the rise of digital technologies, data science, and generative AI, data privacy has surged to become a major concern for nearly every organisation. Differential privacy (DP), rooted in cryptography, epitomises a significant advancement in balancing data privacy with data utility. Yet, as DP garners attention, it unveils complex challenges and misconceptions that confound even seasoned experts. Through a statistical lens, we examine these nuances. Central to our discussion is DP’s commitment to curbing the relative risk of individual data disclosure, unperturbed by an adversary’s prior knowledge, via the premise that posterior-to-prior ratios are constrained by extreme likelihood ratios. A stumbling block surfaces when ‘individual privacy’ is delineated by counterfactually manipulating static individual data values, without considering their interdependencies. Alarmingly, this static viewpoint, flagged for its shortcomings for over a decade (Kifer and Machanavajjhala, 2011, ACM; Tschantz, Sen, and Datta, 2022, IEEE), continues to overshadow DP narratives, leading to the erroneous but widespread belief that DP is impervious to adversaries’ prior knowledge.

Turning to Warner’s (1965, JASA) randomised response mechanism—the first recorded instance of a DP mechanism—we show how DP’s mathematical assurances can crumble to an arbitrary degree when adversaries grasp the interplay among individuals. Drawing a parallel, it’s akin to the folly of solely quarantining symptomatic individuals to thwart an airborne disease’s spread. Thus, embracing a statistical perspective on data, seeing them as accidental manifestations of underlying essential information constructs, is as vital for bolstering data privacy as it is for rigorous data analysis.

Finally, unifying the many types of DP as different kinds of Lipschitz continuity on the data release mechanism (hence the ‘differential’ in differential privacy), we elicit from existing literature five necessary building blocks for a DP specification. They are, in order of mathematical prerequisite, the protection domain (data space), the scope of protection (data multiverse), the protection unit (unit for data perturbation), the standard of protection (measure for output variations), and the intensity of protection (privacy loss budget). In simple terms, these are respectively the “what”, “where”, “who”, “how”, and “how much” questions of DP. We answer these questions for data swapping—a traditional statistical disclosure control method used, for example, in the 1990, 2000 and 2010 US Decennial Censuses—drawing parallels with the recent implementation of DP in their 2020 Census and unveiling the nuances and potential pitfalls in employing DP as a theoretical yardstick for privacy methodologies.

Privacy, data privacy, and differential privacy

July 11, 2024

Department colloquium, Tübingen AI Center

Abstract

This talk beckons inquisitive audiences to explore the intricacies of data privacy. We journey back to the late 19th century, when the concept of privacy crystallised as a legal right. This change was spurred by the vexations of a socialite’s husband, harried by tabloids during the emergence of yellow journalism and film photography. In today’s era, marked by the rise of digital technologies, data science, and generative AI, data privacy has surged to become a major concern for nearly every organisation. Differential privacy (DP), rooted in cryptography, epitomises a significant advancement in balancing data privacy with data utility. Yet, as DP garners attention, it unveils complex challenges and misconceptions that confound even seasoned experts. Through a statistical lens, we examine these nuances. Central to our discussion is DP’s commitment to curbing the relative risk of individual data disclosure, unperturbed by an adversary’s prior knowledge, via the premise that posterior-to-prior ratios are constrained by extreme likelihood ratios. A stumbling block surfaces when ‘individual privacy’ is delineated by counterfactually manipulating static individual data values, without considering their interdependencies. Alarmingly, this static viewpoint, flagged for its shortcomings for over a decade (Kifer and Machanavajjhala, 2011, ACM; Tschantz, Sen, and Datta, 2022, IEEE), continues to overshadow DP narratives, leading to the erroneous but widespread belief that DP is impervious to adversaries’ prior knowledge.

Turning to Warner’s (1965, JASA) randomised response mechanism—the first recorded instance of a DP mechanism—we show how DP’s mathematical assurances can crumble to an arbitrary degree when adversaries grasp the interplay among individuals. Drawing a parallel, it’s akin to the folly of solely quarantining symptomatic individuals to thwart an airborne disease’s spread. Thus, embracing a statistical perspective on data, seeing them as accidental manifestations of underlying essential information constructs, is as vital for bolstering data privacy as it is for rigorous data analysis.

Finally, unifying the many types of DP as different kinds of Lipschitz continuity on the data release mechanism (hence the ‘differential’ in differential privacy), we elicit from existing literature five necessary building blocks for a DP specification. They are, in order of mathematical prerequisite, the protection domain (data space), the scope of protection (data multiverse), the protection unit (unit for data perturbation), the standard of protection (measure for output variations), and the intensity of protection (privacy loss budget). In simple terms, these are respectively the “what”, “where”, “who”, “how”, and “how much” questions of DP. We answer these questions for data swapping—a traditional statistical disclosure control method used, for example, in the 1990, 2000 and 2010 US Decennial Censuses—drawing parallels with the recent implementation of DP in their 2020 Census and unveiling the nuances and potential pitfalls in employing DP as a theoretical yardstick for privacy methodologies.

How does differential privacy limit disclosure risk? A precise prior-to-posterior analysis

July 6, 2024

Invited talk, ISBA World Meeting

Abstract

Differential privacy (DP) is an increasingly popular standard for quantifying privacy in the context of sharing statistical data. It has numerous advantages—especially its composition of privacy loss over multiple data releases and its facilitation of valid statistical inference via algorithmic transparency—over previous statistical privacy frameworks. Yet one difficulty of DP in practice is setting its privacy loss budget. Such a choice is complicated by a lack of understanding of what DP means in connection to traditional notions of statistical disclosure limitation (SDL). In this talk, we trace the rich literature on SDL back to the foundational 1986 paper by Duncan and Lambert, which defines disclosure in a relative sense as an increase – due to the published data – in one’s knowledge of an individual record. We prove that DP is exactly equivalent to limiting this type of ‘prior-to-posterior’ disclosure, but only when the records are completely independent. More generally, DP is equivalent to controlling conditional prior-to-posterior learning, when conditioning on all other records in the dataset. This connects DP to traditional SDL while also highlighting the danger of viewing data variations as mechanistic—as does DP—rather than as statistical—in which one would explicitly acknowledge the variational dependencies between records. Based on joint work with Ruobin Gong and Xiao-Li Meng.

Whose data is it anyway? Towards a formal treatment of differential privacy for surveys

May 16, 2024

Data Privacy Protection and the Conduct of Applied Research: Methods, Approaches and their Consequences

Privacy, data privacy, and differential privacy

December 13, 2023

Keynote talk (joint with Xiao-Li Meng), Humanising Machine Intelligence Workshop

Can swapping be differentially private? A refreshment stirred not shaken

October 31, 2023

Statistics Canada Methodology Seminar

Abstract

To directly address the title’s query, an answer must necessarily presuppose a precise specification of differential privacy (DP). Indeed, as there are many different formulations of DP – which range, both qualitatively and quantitatively, from being practically and theoretically vacuous to providing gold-standard privacy protection – a straight answer to the question “is X differentially private?” is not particularly informative and is likely to lead to confusion or even dispute, especially when the presupposed DP specification is not clearly spelt out and fully comprehended.

A true answer to the title’s query must therefore be predicated upon an understanding of how formulations of DP differ, which is best explored, as is often the case, by starting with their unifying commonality. DP specifications are, in essence, Lipschitz conditions on the data-release mechanism. The core philosophy of DP is thus to manage relative privacy loss by limiting the rate of change of the variations in the noise-injected output statistics when the confidential input data are (counterfactually) suitably perturbed. Hence, DP conceives of privacy protection specifically as control over the Lipschitz constant – i.e. over this rate of change; and different DP specifications correspond to different choices of how to measure input perturbation and output variation, in addition to the choice of how much to control this rate of variations-to-perturbation. Following this line of thinking through existing DP literature leads to five necessary building blocks for a DP specification. They are, in order of mathematical prerequisite, the protection domain (data space), the scope of protection (data multiverse), the protection units (unit for data perturbation), the standard of protection (measure for output variations), and the intensity of protection (privacy loss budget). In simple terms, these are the “what”, “where”, “who”, “how”, and “how much” questions of DP.

Under this framework, we consider DP’s applicability in scenarios like the US Census, where the disclosure of certain aggregates is mandated by the US Constitution. We design and analyze a data swapping method, called the Permutation Swapping Algorithm (PSA), which is reminiscent of the statistical disclosure control (SDC) procedures employed in several US Decennial Censuses before 2020. For comparative purposes, we are also interested in the principal SDC method of the 2020 Census, the TopDown algorithm (TDA), which melds the DP specification of Bun and Steinke [2016a] (B&S) with Census policy and constitutional mandates.

We analyze the DP properties of both data swapping and TDA. Both B&S’s specification and the original ε-DP specification of Dwork et al. [2006b] demand that no data summary is disclosed without noise – which is impossible for swapping methods as they inherently preserve, and hence disclose, some margins; and is also impossible for TDA since it too keeps some counts invariant. Therefore, for the same reasons that TDA cannot satisfy the B&S specification, data swapping cannot satisfy the original ε-DP specification. On the other hand, we establish that PSA is ε-DP, subject to the invariants it necessarily induces and we show how the privacy-loss budget ε is determined by the swapping rate and the maximal size of the swapping classes. We also prove a DP specification for TDA, by subjecting B&S’s specification to TDA’s invariants. Drawing a parallel, we assess the privacy budget for the PSA in the hypothetical situation where it was adopted for the 2020 Census. Our overarching ambition is two-fold: firstly, to leverage the merits of DP, including its mathematical assurances and algorithmic transparency, without sidelining the advantages of classical SDC; and secondly, to unveil the nuances and potential pitfalls in employing DP as a theoretical yardstick for privacy methodologies. By spotlighting data swapping, we aspire to stimulate rigorous evaluations of other SDC techniques, emphasizing that the privacy-loss budget ε is merely one of five building blocks for the mathematical foundations of DP.

The Five Safes as a privacy context

September 22, 2023

5th Annual Symposium on Applications of Contextual Integrity

Abstract

The Five Safes is a framework used by national statistical offices (NSO) for assessing and managing the disclosure risk of data sharing. This paper makes two points: Firstly, the Five Safes can be understood as a specialization of a broader concept – contextual integrity – to the situation of statistical dissemination by an NSO. We demonstrate this by mapping the five parameters of contextual integrity onto the five dimensions of the Five Safes. Secondly, the Five Safes contextualizes narrow, technical notions of privacy within a holistic risk assessment. We demonstrate this with the example of differential privacy (DP). This contextualization allows NSOs to place DP within their Five Safes toolkit while also guiding the design of DP implementations within the broader privacy context, as delineated by both their regulation and the relevant social norms.

Differential privacy: General inferential limits via intervals of measures

July 13, 2023

13th International Symposium on Imprecise Probabilities: Theories and Applications

Abstract

Differential privacy (DP) is a class of mathematical standards for assessing the privacy provided by a data-release mechanism. This work concerns two important flavors of DP that are related yet conceptually distinct: pure ε-differential privacy (ε-DP) and Pufferfish privacy. We restate ε-DP and Pufferfish privacy as Lipschitz continuity conditions and provide their formulations in terms of an object from the imprecise probability literature: the interval of measures. We use these formulations to derive limits on key quantities in frequentist hypothesis testing and in Bayesian inference using data that are sanitised according to either of these two privacy standards. Under very mild conditions, the results in this work are valid for arbitrary parameters, priors and data generating models. These bounds are weaker than those attainable when analysing specific data generating models or data-release mechanisms. However, they provide generally applicable limits on the ability to learn from differentially private data – even when the analyst's knowledge of the model or mechanism is limited. They also shed light on the semantic interpretations of the two DP flavors under examination, a subject of contention in the current literature.

Privacy, data privacy, and differential privacy

June 28, 2022

Methodology Division Seminar, Australian Bureau of Statistics

Designing formally private mechanisms for the p% rule

February 5, 2020

Workshop on Advances in Statistical Disclosure Limitation

Abstract

The \(p\)% rule classifies an aggregate statistic as a disclosure risk if one contributor can use the statistic to determine another contributor’s value to within \(p\)%. This is often possible in economic data when there is a monopoly or a duopoly. Therefore, the \(p\)% rule is an important statistical disclosure control and is frequently used in national statistical organisations. However, the \(p\)% rule is only a method for assessing disclosure risk: While it can say whether a statistic is risky or not, it does not provide a mechanism to decrease that risk. To address this limitation, we encode the \(p\)% rule into a formal privacy definition using the Pufferfish framework and we develop a perturbation mechanism which is provably private under this framework. This mechanism provides official statisticians with a method for perturbing data which guarantees a Bayesian formulation of the \(p\)% rule is satisfied. We motivate this work with an example application to the Australian Bureau of Statistics (ABS).

Using admin data and machine learning to predict dwelling occupancy on Census Night

October 1, 2019

Statistical Society of Australia's Young Statisticians Conference

Abstract

The Australian Census of Population and Housing (the Census) aims to count every person in Australia on a particular night – called the Census night. Houses which do not complete a Census form and do not respond to the Australian Bureau of Statistics’ (ABS) follow-up campaign, pose a complication to achieving this aim: Are these dwellings unoccupied, or are they occupied and the residents unresponsive? To achieve its aim, the Census should count these unresponsive residents, but how can the ABS accurately do this? To answer these questions, the ABS has developed a model which uses administrative data – collected by various government and non-government organisations – to predict the occupancy status of a dwelling. There are various challenges surrounding this new method, including the lack of ground truth, and the presence of strongly unbalanced classes. However, the method will improve the accuracy of ABS Census population counts and has been adopted as part of the 2021 Australian Census imputation process.

A discrete calibration approach to improving data linkage

March 20, 2019

ABS Methodology Advisory Committee

Stable homotopy theory

February 8, 2017

Australian Mathematical Sciences Institute Connect Conference

James Bailie

Selected talks

Upcoming talks

Past talks