Federal Statistics: Managing the Tension Between Data Privacy and the Social Good
Researchers call for using benefit-cost analysis to protect privacy and enhance research and policymaking
Get all our news
A worker fills out information for the 2020 Census.
In 1997, computer scientist Latanya Sweeney was able to identify then-Massachusetts Governor William Weld from anonymous hospital records and voter registration data. The well-publicized incident became a flashpoint that has come to illustrate the continuing struggle of government statistical agencies to balance data privacy concerns and the use of data for the common good.
A July 25 analysis in the Proceedings of the National Academy of Sciences digs into this high-stakes issue.
“This is not a minor technical issue,” said IPR economist Charles F. Manski, who co-authored the investigation with IPR statistician Bruce Spencer and six others. “It’s an inescapable tension between enhancing privacy and enhancing data usability.”
The researchers, social scientists from economics, econometrics, and statistics, possess a long history of experience working with the nation’s premier statistical agencies, particularly the U.S. Census Bureau. They outline how researchers from different disciplines have dealt with “disclosure risk”—the risk that a person’s sensitive personal data—could be identified from their supposedly de-identified survey responses. They describe how trends in privacy protection have the potential to “harm the U.S. federal statistical system and result in major losses to society’s research and knowledge base.”
The authors praise the United States for its impressive statistical system, comprised of 13 agencies dedicated to collecting various statistics, as well as another dozen agencies compiling data on government programs and operations. Many of the statistics these agencies collect are used for vitally important decisions affecting all Americans’ lives, such as census data for reapportioning House seats and distributing federal funding to states.
But this system is under pressure from two opposing streams. One is pushing for safeguarding individuals’ privacy and their information in datasets, compounded by the growth of massive commercial datasets that can be combined with government data to unmask participants.
The other is pushing for federal agencies to release, or make more widely available, a trove of currently inaccessible data that could help improve evidence-based policymaking. For example, Manski points to how anonymized data from income tax returns would be useful for answering questions about income inequality or how income changes across generations, but hardly any researchers have access to them.
Manski says he and his colleagues approached the issue from the perspective of economics and statistics—contending these perspectives are better at understanding the tradeoffs between protecting privacy and data use—over that of computer science, which has focused primarily on enhancing privacy alone.
The researchers examine two of the most commonly proposed tools that might be used by statistical agencies to limit and measure disclosure risk—differential privacy and synthetic data.
In their analysis, the authors argue that their use raises “serious concerns about data usability and data quality that have been inadequately addressed to date. Both threaten to impose major limits on the way research and public policy can be conducted.”
According to Manski, one of the biggest issues revealed in their analysis of differential privacy (DP) lies in how the concept deals with risk. DP measures risk of disclosure in a relative way, aiming to keep it below some threshold. However, people presumably care about the absolute risk that their identities will be disclosed, not relative risks. For example, individuals will likely care a lot more about scenarios where their risk of being identified jumps from 20% to 40%, but worry less about ones where it only nudges up from 1% to 2%.
“The relative risk is the same in both cases, but the absolute change in risk is much higher in the first case,” Manski pointed out.
Nevertheless, the Census Bureau has adopted differential privacy for the 2020 Census. Equally troubling, Manski says, is a movement underway to incorporate differential privacy into major surveys like the American Community Survey and Current Population Survey. It risks rendering their longitudinal data “worthless,” Manski says, as the surveys would only be able to provide summary statistics.
In their examination, Manski and his colleagues offer that benefit-cost analysis, as recommended by Carnegie Mellon statisticians George Duncan and Diane Lambert in the 1980s and later, provides a better-grounded framework for assessing the tradeoffs between privacy risk and data use, as well as a means to evaluate the social benefits and costs of alternative policies.
“Performing a serious benefit-cost analysis is not an easy undertaking; it can take years,” Manski acknowledged. But precedents exist: It was first used in the early 1900s, and it has been written into current regulations of the Office of Management and Budget (OMB) for use by agencies like the Federal Emergency Management Agency, or FEMA.
“The Census Bureau needs to do benefit-cost analyses,” Manski advised. “And it needs to be more like the OMB, which has them built into its regulatory system.”
What approach statistical agencies decide to use comes down to a “policy choice,” Manski says, one that has critical consequences for researchers and Americans.
“We can’t just ‘talk’ about these things—we have to use quantitative methods to assess needs by doing a serious benefit-cost analysis,” Manski said.
Charles F. Manski is the Board of Trustees Professor of Economics and an IPR fellow.
Photo credit: U.S. Census Bureau
Published: July 25, 2022.