New Methods to Protect Privacy When Using Patient Health Data to Compare Treatments

Li Xiong; Andrew Post; Xiaoqian Jiang; Lucila Ohno-Mochado

doi:10.25302/02.2021.ME.131007058

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

New Methods to Protect Privacy When Using Patient Health Data to Compare Treatments

Li Xiong, PhD, Andrew Post, MD, PhD, Xiaoqian Jiang, PhD, and Lucila Ohno-Mochado, PhD.

Author Information and Affiliations

Washington (DC): Patient-Centered Outcomes Research Institute (PCORI); 2021 Feb.

Structured Abstract

Background:

Sharing and reusing clinical data is key to enabling patient-centered outcomes research (PCOR). Data registries established for conducting PCOR must ensure appropriate privacy and confidentiality protections as stated by the PCORI Methodology Committee. There is rising concern that current deidentification or “anonymization” practices insufficiently protect against reidentification and disclosure of private patient data.

Objectives:

The objective of this project was to develop a framework, which we named patient-centered Statistical Health informAtion RElease (pSHARE), for building patient-centered and privacy-preserving statistical data registries for PCOR using the rigorous differential privacy (DP) framework, which gives a provable guarantee on the privacy of patients who provide data. The main goal was to optimize the trade-off between data utility (ie, minimal noise) and data privacy (ie, DP constraints satisfied) in the data registry. The project had 3 specific aims: (1) develop methods for establishing registries of private data, (2) develop methods for establishing registries that contain both private and consented data, and (3) develop methods for evaluating and tracking patient privacy risks and establishing data registries that take into account fine-grained patient privacy preferences.

Methods:

The main challenge in designing DP methods is how to minimize the amount of noise added to the data so that data utility is preserved but a given DP constraint is not compromised. Our approach was patient centered, data driven, and research driven. To preserve data utility, we addressed the high dimensionality and high correlation of the data used in typical PCOR studies by explicitly modeling the cross-dimensional and temporal correlations. In addition, we focused on developing methods for building data registries that respect the fine-grained personalized privacy preferences of patients rather than simple binary opt-in/opt-out preferences. We included patient and stakeholder engagement panels to ensure that the resulting pSHARE methodology was driven by patient perspectives. We also constructed data registries at Emory University and the University of California, San Diego (UCSD) using data extracted from clinical data warehouses to study the inherent trade-offs between privacy protection and utility of the data in PCOR studies.

Results:

Project outcomes included (1) a suite of novel algorithms and techniques and a software tool kit for building data registries that rigorously protect patient privacy preferences; and (2) an evaluation of pSHARE using both publicly available data and data extracted from Emory and UCSD clinical data warehouses with insights on the trade-offs between data utility and patient privacy.

Conclusions:

Our developed algorithms and software tool kit provide a new and more rigorous methodology that complements current deidentification and policy-based data registry practices. pSHARE can empower patients by providing rigorous and transparent privacy controls while contributing their data to PCOR.

Limitations:

The developed methodologies present inherent trade-offs between data privacy and data utility. Our customized pSHARE approach can be used to develop data registries optimized for specific PCOR studies that simultaneously guarantee patient privacy and empirical data utility (eg, by preserving longitudinal patterns), but this approach may not be versatile enough to support arbitrary types of PCOR studies (eg, those that require cross-sectional patterns). The project described here includes a patient engagement plan that involved a series of stakeholder panels from which we gained a preliminary understanding of patient attitudes toward sharing data with researchers as well as patient privacy preferences. Large-scale studies, such as patient surveys, are needed to provide a broader and deeper understanding of patient privacy preferences and attitudes toward adoption of the developed methodology.

Background

Methodologic Gaps

The accelerated adoption of electronic health records (EHRs) presents enormous opportunities to use health information for patient-centered outcomes research (PCOR). The increasing accessibility of EHRs has allowed institutions and researchers to efficiently integrate, retrieve, and filter longitudinal and high-dimensional patient data, leading to PCOR studies that provide an extensive retrospective analysis of patients' health care data. All data registries established for conducting PCOR must ensure appropriate privacy and confidentiality protections, as stated in the PCORI Methodology Committee report.¹ However, achieving such privacy protections while still allowing the information to be used is a critical challenge that has been detailed in recent studies and advisory reports to the government.^2-4 For example, the Institute of Medicine Committee on Health Research and the Privacy of Health Information concluded⁴ that the HIPAA Privacy Rule does not sufficiently protect privacy and that it impedes important health research as currently implemented. Two of the 5 major findings of the committee include that the Privacy Rule (1) overstates the ability of informed consent to protect privacy in lieu of incorporating comprehensive privacy protections, and (2) creates barriers to research and leads to biased research samples, which generate invalid conclusions. The process of obtaining individual consent and recruiting research volunteers creates challenges for health researchers to undertake important research activities while seeking to comply with all the regulations.⁴ Our goal is to address these issues by using a new approach beyond informed consent to protect patient privacy when using health data for research.

Data registries contain clinical data (eg, patient demographic information, diagnoses, and medications) extracted or derived from patient EHRs from 1 or multiple institutions to support PCOR studies. These studies evaluate specified outcomes for populations with a particular disease, condition, or exposure. In current practice, data registries for PCOR are usually established using the original records of patients who have consented to the use of their data in research. Figure 1 illustrates the process for building data registries. Patient fear and anxiety over the potential loss of privacy and confidentiality can affect their willingness to consent. In addition, the overhead required for IRB reviews and policy setup can significantly delay the process. As an alternative, institutions can use deidentified data to build data registries, which are exempt from the HIPAA Privacy Rule. Deidentification, or “anonymization,” protects patient privacy while preserving the utility of the clinical data for large-scale analysis. However, the use of deidentified data has raised increasing concerns about insufficient protection of current deidentification methodologies against reidentification and disclosure. A potential adversary may attempt to reidentify and/or obtain sensitive information about patients in the data registry. Adversaries are entities who attempt to gain access to the data to reidentify and/or obtain sensitive information about patients. These challenges may significantly hinder PCOR studies and may also introduce bias into cohort distributions, hence weakening the validity of PCOR studies. For example, classifying complex or rare patterns in clinical data requires access to a large data set that includes patients with a particular disease or condition; in this case, privacy is a major barrier, because patients with a rare condition may be reluctant to consent.

Figure 1

Existing Privacy Practices for Establishing Data Registries for PCOR.

Goals of the Proposed Research and Specific Aims

Our research was based on the rigorous differential privacy (DP)⁵ framework that has been accepted by the data privacy community as the de facto privacy principle for statistical data release. It has been recently adopted by the US Census 2020 to avoid data disclosure.⁶ The DP framework aligns with the patient consent principle. Patients consider opt-in data sharing—the principle that a health care provider should obtain an individual's affirmative consent before collecting or sharing data—to be one of their most important privacy rights. The DP framework states that a statistical aggregation or computation satisfies “ε-differential privacy” if the results of the computation performed with a particular record from the data set are “indistinguishable” from the results produced without the record, where ε is a privacy parameter that limits the maximum amount of influence a record can have on the result. In other words, it makes it impossible to determine whether a patient contributes to the aggregated result, and hence, it is impossible to infer the personal characteristics of a patient based on the aggregated result. This provides a rigorous method to use the statistical aggregated data for a patient population while guaranteeing privacy. There would be instances, including the occurrence of clinical outcomes that occur infrequently, where the outcomes should not be indistinguishable if a patient opts in or out. This is where the conflict or inherent trade-off is between data utility and privacy, which we will discuss more later in this section. We summarize the key concepts and definitions used throughout the report in Table 1.

Table 1

Key Concepts and Definitions.

A common mechanism used to achieve ε-DP for a given statistical computation on the data is the Laplace mechanism,⁷ which adds a random noise drawn from a Laplace distribution to the original computation result. The Laplace distribution is determined by the privacy parameter ε and the sensitivity of the statistical computation with respect to the inclusion or exclusion of any record in the data set. A more stringent (lower-value) privacy parameter ε requires more noise to be added to the computation result and hence provides a higher level of privacy. One important property of DP is its composability that permits the quantification of cumulative privacy loss over multiple DP computations. A data custodian can specify an overall privacy parameter (ie, privacy budget⁸) for a sequence of statistical computations so that each computation in the series uses a portion of the total budget⁹ and their composition satisfies the overall DP requirement. The main challenge in designing DP methods is how to minimize the amount of noise added to the data so that data utility is preserved but a given DP constraint is not compromised. It is also worth noting that such an approach works better when more data are used to build the registry, that is, the same amount of noise will introduce less relative error and hence have a smaller impact on the utility.

Figure 2 shows an example of a perturbed histogram satisfying DP by adding noise to each of the original histogram counts. DP guarantees that changing or removing a patient record (eg, Alice) from the data set has negligible impact on the output-perturbed histogram. Consider an extreme case wherein a malicious party (ie, an adversary) knows that Alice is in the data set and knows her age but does not know her HIV status. The adversary also knows the HIV status of all other patients in the data set. Given the original histogram, the adversary can infer that Alice is infected with HIV by observing the count of HIV+ patients for the age group (40-45 years). In contrast, given the perturbed histogram, the adversary cannot infer whether Alice contributes to the aggregated result or is infected with HIV.

Figure 2

DP Example.

A growing number of works have addressed data release or specialized data-analysis tasks that use DP.^5,7-25 However, applying DP mechanisms to data registry building to support PCOR studies presents technical challenges because of the high dimensionality and high correlation of data used in PCOR studies. The 2 most common types of PCOR studies are (1) cross-sectional studies that involve multiattribute observations of all or a representative subset of a population at 1 specific point in time and (2) longitudinal studies that involve repeated observations over time. In addition, most previous DP methods are not patient centered and instead take a one-size-fits-all approach by treating all patients the same.

To address these challenges, we proposed a framework for building patient-centered and privacy-preserving statistical data registries for PCOR that we named patient-centered Statistical Health informAtion RElease (pSHARE). Figure 3 shows an overview of the research. Using original patient records, pSHARE builds a set of precomputed summary statistics with a rigorous DP guarantee and an optional set of synthetic records that is consistent with the statistics. PCOR studies can be conducted using these summary data or synthetic data before access to the original data is required. We divided the original data into 2 categories: (1) private data (data without explicit patient consent to be used in the original form for research purposes) and (2) consented data (data with patient consent). Our goals were to use the private data with DP guarantees to complement the consented data to better support PCOR studies and to enable patients to make more-informed and fine-grained privacy choices (such as specifying the level of DP) beyond the binary option of opt in or opt out. The research had 3 specific aims: (1) develop methods for establishing registries of private data, (2) develop methods for establishing registries that contain both private and consented data, and (3) develop methods for evaluating and tracking patient privacy risks and establishing data registries that take into account fine-grained patient privacy preferences.

Figure 3

Research Overview.

Our main goal was to optimize the trade-off between data utility (ie, minimal noise) and data privacy (ie, DP constraints satisfied) in the data registry. We envision that the resulting data registries can be used to support cohort queries such as those supported by the commonly used i2b2 query tool²⁶ (eg, size of a patient cohort with certain demographic characteristics and a certain disease or condition specified as predicates) and to train various machine learning models (eg, Cox survival models for outcome analysis). However, we do acknowledge that the methodology is not suitable for PCOR studies of rare diseases, as DP is achieved by obfuscating or statistically bounding the impact of each individual patient datum on the output model; hence, the accuracy of rare disease models would be significantly impacted given the small number of contributing patients.

Depending on the application of the data registry, its utility can be measured using 3 methods: (1) error or distance metrics that compare the released data and original data, (2) error metrics of answering predicate queries using the data, or (3) metrics that evaluate the accuracy of predictive models learned from the data. Here, we used both publicly available population data sets (for details, please see the Methods section, “Data Sources and Data Sets”) and EHR data from Emory University and the University of California, San Diego (UCSD) to evaluate the privacy and utility trade-offs of the proposed methods. The 3 aims with different types of data registries (described in the Methods section under “Specific Aim 1: Research Design: Building Data Registries Using Private Data,” “Specific Aim 2: Research Design: Building Data Registries Using Both Private Data and Consented Data,” and “Specific Aim 3: Research Design: Building Data Registries With Fine-Grained Patient Privacy Control”) are intended to support the same type of research questions as those discussed previously. However, they are applicable in different settings, depending on whether there are publicly consented data or more fine-grained patient privacy preferences available. Because privacy is a personal perception, there are natural differences from one person to another. Without any prior knowledge of the privacy preferences of individuals, we can only protect individuals equally and treat all data as private (see the Methods section, “Specific Aim 1”). If we have information on the privacy preferences of individuals, it is possible to improve the methods to allocate the privacy budget more wisely and enhance the utility of the resulting data. In the Methods section under “Specific Aim 2,” we discuss the use of consented data to generate a synthetic data registry for a private cohort and under “Specific Aim 3” to further generalize the model to accommodate different preference levels for each contributor.

Potential Impact of the Research

We believe that our research provides a new methodology that is complementary to and more rigorous than current deidentification and policy-based data registry practices. This methodology will allow institutions to build data registries containing derived summary statistics, such as histograms and/or synthetic records, that ensure rigorous and patient-centered privacy guarantees. It will also allow researchers to perform multiple exploratory PCOR analyses to determine which original data they need access to for a full study before requesting IRB approval and patient consent. We believe that pSHARE will provide a principled and practical methodology that significantly advances the current state-of-the-art methods with regard to (1) rigorous and patient-centered privacy and confidentiality protections that build public trust and encourage continued patient participation in PCOR studies; and (2) rigorous guarantees of data integrity that facilitate cohort discovery and exploratory analyses using statistical and synthetic data. Ultimately, pSHARE will facilitate the development of PCOR studies.

Patient and Stakeholder Engagement

We formed a stakeholder panel that included patient privacy advocates, patients, privacy compliance officers, and biomedical informaticians. The panel consisted of the following members:

Katherine Kim, PhD, MPH, MBA (assistant professor, Betty Irene Moore School of Nursing, University of California [UC], Davis), is a patient engagement expert. Her research focuses on information technology to improve community health, care coordination, and clinical research. She served as an advisory member on the panel.
Pam Dixon is the founder and executive director of the World Privacy Forum, a US-based public interest research group that is well known and respected for its consumer privacy research. She served as an advisory member on the panel.
Kristin West, JD, MS (associate vice president, research director, Office of Research Compliance, Emory University), serves as the privacy officer for Emory University. She works closely with researchers, information technology team members, and the IRB to determine how to facilitate clinical research using identifiable health information while safeguarding the privacy rights of individuals.
Michael W. Kalichman, PhD, is founding director of the UCSD Research Ethics Program (http://ethics.ucsd.edu) and the San Diego Research Ethics Consortium (https://ethics.ucsd.edu/related-programs/sdrec/) as well as the co-founding director of the Center for Ethics in Science and Technology (http://ethicscenter.net). He has extensive experience aligning research methods with public perceptions and preferences regarding privacy.
Robert El-Kareh, MD, MS, MPH, is a faculty member within the Divisions of Biomedical Informatics and Hospital Medicine at UCSD. He is a practicing hospitalist, and his main interests include clinical informatics, quality improvement, and the use of electronic data to identify and prevent diagnostic errors in health care. Dr El-Kareh represented the clinicians and informaticians who will benefit from using data registries.
Cynthia Burstein Waldman (patient and patient advocate) has served as chair of the board of directors of a national patient advocacy group for patients with a hereditary heart condition. Through her work with this organization, she became interested and involved in patient privacy and medical research and was often called upon to make policy decisions in these areas.
Gordon Fox is a patient with hypertrophic cardiomyopathy and a moderator of the discussion board for the Hypertrophic Cardiomyopathy Association.
Carly Medosch is a patient and an independent chronic illness advocate. She is particularly interested in health technology and believes in the importance of patient inclusion in all aspects of health care.
Barbara Saltzman is a patient, a registered nurse, and an attorney. Her last position was as a patients' rights advocate for the County of Los Angeles Department of Mental Health. She continues to provide volunteer patient advocacy for family and friends, and she reads various publications to keep current on medical research and privacy issues.

We convened 3 online panels (2 hours for each via audio/web conference) at different stages of the project. Detailed information about the panel members and all panel recordings are available at http://www.mathcs.emory.edu/aims/pcori/panel.html. The panel members provided feedback to help us formulate and solidify the study's research question design and evaluations, as well as to help us disseminate the project results. The panel provided 2 concrete outcomes. First, their input convinced us of the variability of patient privacy preferences and the need to develop flexible and customizable methodologies that take these preferences into account. This patient perspective drove our methodology, described in the Methods section under “Specific Aim 3,” which includes algorithms that consider personalized patient privacy preferences.

Second, the panel made clear that the concept of a privacy budget, which is the key parameter in DP (please see Table 1 and “Goals of the Proposed Research and Specific Aims” under the Background section), has no intrinsic intuitive meaning and is difficult to explain to patients, honest brokers, and regulators. The panel provided valuable insights on the importance of clearly communicating privacy budget settings and their privacy and utility trade-offs to users. However, large-scale studies, such as patient surveys, are needed to gain a broader and deeper understanding of patient privacy preferences, including how demographic factors impact large-scale adoption of pSHARE or related DP methodology. We had started designing such a large-scale patient survey but left it for future research because of limited resources. The draft survey is available on our project website at http://www.mathcs.emory.edu/aims/pcori/Survey/survey.html. The survey questions aim to understand patient attitudes toward both traditional deidentification approaches and DP approaches for privacy protection and how patients' health statuses and demographic backgrounds impact their attitudes. In addition to the survey questions, our project website includes many educational videos on key concepts such as EHR and DP.

Methods

We developed a pSHARE framework for building patient-centered and privacy-preserving statistical data registries for PCOR. Figure 3 provides an overview of our research and the challenges it addressed. Using original patient records, pSHARE builds a set of precomputed summary statistics that guarantee DP and statistical integrity, as well as an optional set of synthetic records that are consistent with the statistics and on which exploratory PCOR studies (eg, preassessment of hypotheses) can be conducted before access to the original data is required (eg, verification of hypothesis).

Our project had the following main outcomes:

A suite of algorithms and techniques for building data registries that preserve DP and patient privacy preferences while enabling a wide range of exploratory PCOR studies. Our methods considered 3 settings corresponding to the 3 specific aims: (1) using private data alone, (2) using both private and consented data, and (3) using data with fine-grained patient privacy preferences. pSHARE may be tailored to a specific health data domain, and it addresses the unique data characteristics common to PCOR studies, including high dimensionality and high correlation.
The results from our evaluations of the algorithms and methods using both publicly available population data and private patient data extracted from Emory and UCSD clinical data warehouses with insights on the inherent trade-offs among utility, privacy, and efficiency for applying DP to build PCOR data registries.
Open-source software (https://github.com/pshare-emory) that allows medical information service providers to build data registries ensuring rigorous DP guarantees more effectively and efficiently while enabling individualized patient risk and preference controls.

Here, we describe our proposed methods for each of the specific aims (with the corresponding publications that resulted from the work). We then summarize the results of our evaluations for each method in the Results section.

Specific Aim 1: Research Design: Building Data Registries Using Private Data

Building Data Registries for Multidimensional and Dynamic Data [Related Publications: Li et al, CIKM 2015]

The main approaches used in previous work to generate statistical summaries or synthetic data with DP can be classified into 2 categories: (1) parametric methods that fit the original data to a multivariate distribution and make inferences on the parameters of the distribution,¹³ and (2) nonparametric methods that learn empirical distributions from the data through histograms.^19,27-29 Most of these approaches work well for single-dimensional or low-order data but become problematic when used with common types of PCOR data, which tend to be high dimensional (eg, demographic information and diagnoses) and have large attribute domains (eg, age and laboratory test results). To address these challenges, we proposed DPCopula,³⁰ a semiparametric and DP data synthetization method for multidimensional and large-domain data that uses copulas (ie, functions that enable modeling of the marginal distributions for each attribute or dimension and the dependency structure among the dimensions). We will discuss how to model the high-dimensional attributes, such as diagnosis and medications, as sequences later in this section. However, the copula-based methods and most existing histogram methods are designed to handle “one-time” releases of static data sets and do not adequately address the increasing need to handle dynamically changing data sets. A straightforward application of existing histogram methods on snapshots of dynamic data sets incurs high accumulated error because of the composability of DP, which quantifies the overall privacy guarantee over multiple DP computations and the correlations or overlapping data subjects between the snapshots.

To address the dynamics of evolving data sets under patient-level DP, we proposed a distance-based sampling approach. Our goal is to allow publication of a series of histograms that PCOR researchers can use directly (eg, for cohort queries) or for sampling synthetic data for training machine learning models. Instead of generating a DP histogram at each time stamp, we instead computed new histograms when an update was significant (ie, the difference between the current data set and the most recently released data set was higher than a threshold, which can be measured by a distance metric such as the Euclidean distance between the 2 histograms). Both the distance computation and threshold comparison were designed to guarantee DP. Data sets may be subject to periodic small updates. Distance-based sampling allows us to release a new histogram only when the data sets have significant updates, hence saving the privacy budget and reducing the overall error of released histograms. The explicit threshold-based sampling provides 2 advantages: (1) We can predefine a threshold based on the expected update rate of the data if there is prior domain knowledge, and (2) we can dynamically adjust the threshold in a principled way based on data dynamics. Another important feature of our approach is that it is orthogonal to the histogram method used for each time point (ie, it can use any of the state-of-the-art, static, DP histogram release methods as a black box), which is efficient and effective for generating “one-time” histograms. We proposed 2 methods for defining the threshold. The first method, distance-based sampling with fixed threshold (DSFT), uses a predefined threshold T. The second, improved method, distance-based sampling with adaptive threshold (DSAT), applies a feedback control mechanism to adaptively adjust the threshold. We present a formal analysis of DP guarantees, complexity, and utility for DSFT and DSAT. Our experimental results demonstrate that our methods significantly outperformed baseline approaches and existing state-of-the-art techniques.

Building Data Registries for Correlated Sequential Data [Related Publications: Xu et al, ICDE 2015; Xu et al, TKDE 2016]

Many PCOR studies require data that include repeated encounters for each patient and observations of high-dimensional attributes (such as different diagnostic codes, medication orders, and laboratory measurements) for patients over periods of time. Direct application of the histogram methods at individual time points not only introduces highly compound errors but also loses the self-correlation between time points or the temporal patterns of the population. In our prior work,³¹ we used a prefix tree, or trie, to store and represent the temporal patterns of the longitudinal data. A prefix tree is an ordered tree structure that groups temporal patterns with the same prefix (beginning patterns) into the same branch in the tree. We implemented a basic DPTrie algorithm, which generates a DP prefix tree from the original data with noise added to each node count. Our preliminary study using Emory's EHR prescription data set³¹ showed that the basic DPTrie can support temporal query patterns for relatively short event sequences but did not work well for long and unsynchronized query patterns. Motivated by this, we proposed a new approach for extracting only frequent (sub)patterns from the data set with DP. The idea was that we could then construct synthetic sequential data that preserve frequent patterns. Our algorithm, differentially private frequent sequence mining via sampling-based candidate pruning (PFS²), is based on the a priori pattern-mining algorithm, which builds candidate sequences directly from frequent subsequences. It uses a small sample of the database to estimate the frequency of candidate sequences and to prune infrequent ones before counting their support in the original database. This reduces the amount of noise produced by DP and improves the utility of the results.

Building Data Registries for Correlated Graph Data [Related Publications: Xu et al, ICDE 2016; Cheng et al, TKDE 2018]

Many correlated data can also be modeled as graphs representing the correlations or co-occurrences between events. For example, each node can represent a diagnosis or medication; when they co-occur in 1 encounter, we draw an edge between them. In this way, we can represent each patient's encounters in a graph. We can then identify frequent subgraphs from a collection of input graphs. Frequent subgraphs are graphs that occur in input graphs more frequently than a given threshold and may represent patterns of co-occurring diseases/conditions or common phenotypes. We studied the problem of frequent subgraph mining (FGM) under the DP model, which aims to discover subgraphs that occur in the input graphs above a given frequency threshold. To make FGM satisfy DP, a potential approach is to use the Laplace mechanism to find frequent subgraphs in order of increasing size. Although this approach can identify the frequent subgraphs and obtain their noisy supports simultaneously, it does not perform well. This is primarily because this approach adds an amount of noise that is proportionate to the number of candidate subgraphs. During the mining process, this approach generates a large number of candidate subgraphs, leading to a large amount of perturbation noise. To address this challenge, we introduced a new DP frequent subGraph mining (DFG) algorithm. DFG consists of 2 main steps: (1) private identification of frequent subgraphs from input graphs and (2) computation of the noisy support of each identified subgraph.

Quantifying and Controlling the Privacy of Traditional DP Mechanisms for Correlated Data [Related Publications: Cao et al, ICDE 2017; Cao et al, PVLDB 2018; Cao et al, TKDE 2019]

Because PCOR data can be highly correlated, we also investigated the traditional DP mechanisms for continuously releasing highly correlated data at each time point (ie, under the event DP). Existing mechanisms typically assume that the data at different time points are independent or that adversaries do not possess knowledge of correlation between data. We investigated the potential privacy loss of a traditional DP mechanism under temporal correlations when such correlations have been acquired by adversaries.

We proposed a system to control temporal privacy leakage (TPL) in a traditional DP continuous-data release mechanism as shown in Figure 4. First, a data curator chooses a privacy budget and selects a DP mechanism to use for releasing private streaming data. Our system can quantify the TPL of the selected mechanism and then calibrate the appropriate privacy budgets, which the data curator then uses in the DP mechanism in a way that counters the TPL. The system consists of 2 major modules: (1) privacy budget calibration and (2) TPL quantification. The first module provides an appropriate privacy budget for the DP mechanism at each time point so that the released data are protected from TPL. The second module, TPL quantification, precisely quantifies TPL based on the use of previous and current privacy budgets. It visualizes the change of TPL at each time point and estimates data utility, which helps the data curator understand the trade-off between privacy and utility.

Figure 4

Quantifying and Controlling Privacy for Correlated Sequence Data.

Specific Aim 2: Research Design: Building Data Registries Using Both Private Data and Consented Data

General-purpose algorithms for privacy protection do not consider varying patient attitudes toward health privacy; therefore, most existing methods (including the methods we described earlier) treat all patient records as equally sensitive. As a result, these algorithms may introduce too much perturbation, rendering the resulting information suboptimal for PCOR analyses. In reality, different patients have different privacy concerns regarding their data, and some require less protection than others (see “Specific Aim 3: Research Design: Building Data Registries With Fine-Grained Patient Privacy Control”). Many people even make their data freely accessibly (eg, by open consent in clinical trials) to promote research and scientific discovery. Other resources of public data, like the 1000 Genomes Project and the Personal Genome Project, also provide useful information. Because these consented data are samples of larger data sets, which include private samples, it is possible to develop methods to improve the utility of existing general-purpose algorithms that protect every patient equally regardless of individual preferences.

Building Local Data Registries With Consented and Private Data [Related Publications: Wang et al, TKDE 2018]

Our prior study³² showed that even when consented/public data represent a small percentage of the total data, these data can be used to improve the discrimination and calibration of logistic regression models. We also built privacy-preserving support vector machine learning models to jointly consider private and consented data. These models demonstrated similar advantages over general-purpose privacy-preserving algorithms.³³ These investigations focused on supervised learning models, but they were limited in their generalizability.

To promote patient-centered privacy research, we developed a general methodology in this project to build data registries that use private and consented data and can support broader data-analysis tasks. During their exploratory analyses, researchers might want to try using different tools, including cross-sectional data-analysis methods like logistic regression, longitudinal data-analysis methods like Cox proportional hazards, or even very simple correlation analysis. Therefore, customized designs that optimize a single type of analysis might not be favorable to different types of downstream analysis. Our strategy here was to maximally preserve the distribution of the private data by leveraging the availability of consented data to support a wide spectrum of potentially different exploratory analyses. To achieve this goal, we constructed a weighted private density from the hybrid data sets under DP. The main idea is to assign a weight to each public point that is proportional to the count of private points close to this point. We focused on protecting privacy in M-estimators, which are well studied in statistics. Under some regularities, M-estimators are robust to the outliers. This property makes it possible for M-estimators to maintain high utility under DP. Our proposed workflow is summarized in Figure 5.

Figure 5

Building Data Registries With Consented (Public) and Private Data.

Consider a situation in which a researcher (the data user) would like to conduct an exploratory data analysis to derive M-estimators (eg, the coefficients of the covariates in the logistic regression) from a private data set before making a formal IRB application. Such a scenario is similar to the i2b2 mechanism that researchers use to determine whether their patient sample size is large enough to support an envisioned study. Our proposed method can improve the ability of researchers to conduct exploratory analyses of synthetic data to gain a sense of the viability of their hypotheses (eg, whether a risk factor has a significant association with hypertension). The researcher asks for a best model under DP from a trusted third party (TTP) that can access both public and private data sets. One way to achieve this is to directly infer the M-estimators from a private data set under DP; however, this method might not work with high-dimensional data. The TTP could use public consented data to represent a private data set. In most cases, though, the open-consented population could contain bias, for example, if younger and more highly educated people are more willing to make their information accessible. Because of the possible bias in public data sets, a significant challenge is determining how to use public information to better understand the private data set without risking privacy. Our main contributions to answering this question are (1) theoretical support for the idea that an optimal subset of the public data set, rather than the whole data set, could be defined and used to provide highly accurate and privacy-protected M-estimators; and (2) a DP selection procedure to obtain this optimal subset. Given the private data of interest, our method generates a synthetic DP hybrid cohort by referencing the public data to pick the most representative samples to facilitate a study, which ensures valid statistical procedures. Theoretically, we found that the bias-variance trade-off in the performance of our M-estimators could be characterized in the sample size of the released data set. In addition, we empirically verified that because of the bias of a public data set, directly estimating from public data sets to represent private data sets has a large prediction error compared with our optimal subset selection procedure.

Building Cross-Center Data Registries With Consented and Private Data [Related Publications: Lu et al, 2015; Li et al, TKDE 2018]

In this task, we investigated cross-center data registry construction in a privacy-preserving and patient-centered manner. Many health care providers have private data sets, as well as some consented/public data of the same type (eg, claims or EHR data with different levels of access). Because of the differences in cohort distributions in different locations, combining locally generated synthetic data from each provider may not provide the best results. Our general approach was to develop methods for distributed learning that take advantage of the distributed data sources.

We proposed WebDISCO (see Lu et al in the Related Publications section), a proof-of-concept web service, to provide federated Cox model learning without transmitting patient-level data over the network. WebDISCO has an interactive user interface that nonstatisticians can use without difficulty. The analysis employs data sets from multiple institutions to produce reliable results that are expected to be more generalizable than those produced by a single institution. Figure 6 shows the overall architecture of our method. Data sets from different institutions are locally aggregated into intermediate statistics, which are then combined on the server to estimate global model parameters for each iteration. The server sends the recalculated parameters to the clients at each iteration. All information exchanges are protected by HTTPS-encrypted communication. The learning process is terminated when the model parameters converge or after a predefined number of iterations are completed.

Figure 6

Building Cross-Center Data Registries.

Although this framework does not share patient-level information among sites, the statistics sent from local sites to the global server may still be used to infer individual-level information at each local site. We further developed a distributed online-learning framework with DP whereby the statistics shared among sites are ensured to satisfy DP (see Li et al, TKDE 2018: Related Publications). Similar to the previous method, each node (ie, data source) has the capacity to learn a model from its local data set. However, instead of submitting the intermediate parameters using a global server, each node can exchange intermediate parameters with a random part of their own neighboring (ie, logically connected) nodes. Hence, the topology of the communications in our distributed computing framework is unfixed in practice. To tackle high-dimensional incoming data entries, we studied a sparse version of the distributed online-learning algorithm (DOLA) with novel DP techniques to save computing resources and improve utility. Furthermore, we present 2 modified private DOLAs that meet the needs of practical applications. One converts the DOLA to a distributed stochastic optimization in an offline setting, while the other uses a mini-batches approach to reduce the amount of perturbation noise and improve utility.

Specific Aim 3: Research Design: Building Data Registries With Fine-Grained Patient Privacy Control

Our specific aim 1 focused on building data registries using private (unconsented) data. Our specific aim 2 used both consented (opted-in) and public data for building data registries from the private data to maximize data utility while maintaining rigorous privacy guarantees. Specific aim 3 was a general patient-centered approach in which we proposed and developed patient-centered methods that enable and take into account more fine-grained control of individual patient preferences.

Building Data Registries With Fine-Grained Patient Privacy Control [Related Publications: Li et al, PAKDD 2017]

In addition to the typical opt-in or opt-out function, we envisioned providing fine-grained privacy choices for patients, such as the level of DP, as shown in Figure 7. This could take the form of additional privacy choices integrated into existing informed-consent systems (such as the informed CONsent for Clinical data Use for Research [iCONCUR] system under development at UCSD). Using fine-grained personalized privacy preferences, we designed a DP data-release method. All existing methods assume a single privacy parameter for a single data set. When standard DP mechanisms are applied to individualized privacy parameters, the most stringent privacy requirement chosen by a patient needs to be used as the privacy parameter for the entire data registry given the parallel composability of DP. This inevitably introduces an unnecessarily large amount of noise into the resulting data registry.

Figure 7

Example of Personalized Privacy Preferences Where Each Patient Specifies a Privacy Parameter for DP (ie, Budget α_i).

Our idea was to develop adaptive data-partitioning strategies based on both data distribution and the privacy requirements of patients. We grouped patients with similarly stringent privacy requirements and generated a partial registry for each subpopulation, resulting in higher data fidelity for each subgroup than for the group as a whole. We then combined the subgroup registries using a weighting mechanism. We designed an optimal partitioning strategy to achieve optimal data utility while still meeting the unique privacy requirements of each patient. We evaluated our methods using real data and simulated individual privacy preferences and confirmed the feasibility of generating DP data that can be used for training various machine learning models with high utility.

Data Sources and Data Sets

For each of the methods we developed that we described previously, we used a combination of evaluation methods that included analytical studies, simulations, and experimental evaluations. To achieve repeatability, we conducted our experimental evaluations on publicly available data sets, including those in the following list. Detailed results can be found in the corresponding publications listed in the Results section under each aim; these are also listed in Related Publications. We highlight selected results and summarize key findings in the next section.

Adult data set from the UC Irvine Machine Learning Repository, which was extracted from the 1994 US Census data (https://archive.ics.uci.edu/ml/datasets/adult). This is a benchmark data set used in statistical privacy research. It includes a class attribute indicating whether a person earns >$50 000 a year, as well as such features as age, work class, education, marital status, occupation, relationship, race, sex, and native country.
The Diabetes 130-US hospitals data set from the UC Irvine Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008), which contains 10 years (1999-2008) of clinical care data on diabetes patients from 130 US hospitals and integrated delivery networks. It includes >50 patient-representative features, such as patient number, race, gender, age, admission type, hemoglobin A_1c test result, and diagnosis. Using this data set, we wanted to predict whether a patient would experience emergency department visits during the year before a hospitalization. If successful, medical help could be offered before an emergency department visit becomes necessary.
Cancer data set that included 32 557 structures of human tumor cell lines (http://cactus.nci.nih.gov/download/nci).
The Multiparameter Intelligent Monitoring in Intensive Care III (MIMIC III) research database.⁸⁵

We also tested selected methods on the real health data extracted from the Emory and UCSD clinical data warehouses. UCSD has created a repository of clinical data, known as the Clinical Data Warehouse for Research (CDWR), to serve the needs of clinical and translational researchers in the university's Clinical and Translational Research Institute. CDWR contains data primarily from the Epic EHR system plus a few other sources at UCSD, covering >1200 providers across 39 clinics; >550 000 outpatient visits per year; and >180 000 hospital admissions, 17 million orders, and information on approximately 2 million patients. The Emory Analytic Information Warehouse was developed for similar purposes. It builds upon an existing Oracle-based clinical-data warehouse that contains approximately 80% of clinical data (structured and textual) from 3 hospitals and clinic visits to 800 physicians in >16 locations, comprising data on 47 500 inpatient admissions and >1.3 million clinic visits per year.

Most of our published work used the publicly available benchmark data sets as described previously, which are preprocessed and do not have missing data. For MIMIC III data, we used the data in their aggregated form and treated the unobservable records as 0. For Emory and UCSD data, there are indeed missing data issues. We used simple-imputation methods and replaced the missing values with the mean (numeric) or median (categorical) under the assumption of data missing at random. This method is used instead of more complicated methods because our goal is not to conduct PCOR studies using the data, but instead to evaluate the effectiveness of the DP algorithms in maintaining the accuracy of the resulting models or synthetic data compared with the original data. Hence, we do not expect the imputation methods to bias the evaluation results.

Analytical and Evaluative Approach

For each of the methods we developed, we used a combination of evaluation methods that included analytical studies, simulations, and experimental evaluations. When applicable, we performed analytical studies to derive the quality bounds and complexities of the algorithms. The main research questions addressed in our evaluation studies were (1) What is the impact of the level of privacy constraint on the utility of the data for PCOR analyses? (2) Which algorithms work best given a particular class of data? (3) How do the proposed approaches compare to existing state-of-the-art methods and other potential solutions?

We used the following metrics to evaluate the effectiveness, efficiency, and feasibility of the methods.

Utility: We measured the integrity or utility of the data registries using error or distance metrics between the released data and the original data, error metrics to answer random predicate queries using the data, and accuracy metrics to evaluate the predictive models learned from the data. The accuracy metrics included various measures of discrimination (eg, area under the receiver operating characteristic curve) and calibration.
Privacy: We measured the level of privacy using quantitative measures, such as the accumulated cost of DP.
Efficiency and scalability: We carried out performance studies to measure runtime and scalability in terms of data size, such as the number of patients, number of dimensions, number of time points in the data, and number of participating data providers.

For each of the methods we proposed, we conducted comparative analyses of different techniques (including the baseline method and existing state-of-the-art methods, when available) with varied privacy budget values and compared the suitability of the techniques depending on the properties of the data, such as sparsity, skewness, and length of the time periods for longitudinal data.

Results

For each of the methods we developed, we used a combination of evaluation methods that included analytical studies, simulations, and experimental evaluations. Detailed results are reported in each corresponding publication. Here, we summarize our key results and findings.

Specific Aim 1: Building Data Registries Using Private Data

Building Data Registries for Dynamic Data [Related Publications: Li et al, CIKM 2015]

We evaluated the utility of our proposed algorithms in terms of accuracy in answering random range-count queries (such as for cohort discovery) using the adult data set. We compared the query accuracy of our proposed DSAT algorithm (see the Methods section, “Specific Aim 1: Research Design: Building Data Registries Using Private Data”) with (1) the baseline Laplace mechanism, which generates a DP histogram at each time point; and (2) the fixed-sampling method, which generates a DP histogram at a fixed sampling interval. Our proposed sampling framework can use any state-of-the-art static histogram method at each sampling point. The results shown in Figure 8 are from an example that used the standard Laplace method (left) or the private spatial decomposition method³⁴ (right), which is a state-of-the-art static histogram method that uses spatial partitioning. We also included nonprivate methods (which do not add any perturbation) to compare the update errors of the DSAT and fixed-sampling algorithms. As shown in Figure 8, our adaptive framework DSAT algorithm significantly outperformed the baseline and the fixed-sampling approaches in terms of utility. Also, as expected, higher privacy budgets resulted in better utility. More detailed results evaluating the impacts of different algorithmic and workload parameters can be found in Li et al, CIKM 2015 (Related Publications). Our key finding was that the proposed methods significantly outperformed the baseline approaches and existing state-of-the-art techniques. In future work, we would like to study update models and incorporate them into our sampling framework to further enhance utility.

Figure 8

DP Histogram for Dynamic Data: Range-Count Query Error vs Privacy Level.

Building Data Registries for Correlated Sequential Data [Related Publications: Xu et al, ICDE 2015; Xu et al, TKDE 2016]

Our proposed algorithm, DP PFS² (see the Methods section, “Specific Aim 1: Research Design: Building Data Registries Using Private Data”), is the first to support general frequent subsequence mining (FSM) from correlated sequential data under DP. We therefore compared the PFS² algorithm with 2 DP sequence database publishing algorithms. The first algorithm, referred to as n-gram,³⁵ uses variable-length n-grams, and the second algorithm, referred to as Prefix,³⁶ uses a prefix tree structure. To privately find frequent sequences using the n-gram and Prefix algorithms, we first ran them on the original data sets to generate anonymized data sets and then ran the nonprivate FSM algorithm GSP³⁷ over the anonymized data sets. We employed 2 widely used metrics to evaluate the performance of the algorithms. The first was F score, which was used to measure the utility of discovered frequent sequences. The second was relative error (RE), which was used to measure error with respect to the true support count of the frequent sequences.

Figure 9 shows the performances of PFS², n-gram, and Prefix under a varying privacy budget on the MSNBC data set (relative threshold u = 0.015). The MSNBC data set is available at the UC Irvine Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/msnbc.com+anonymous+web+data) and contains page visits of website users who visited https://www.msnbc.com/ on September 28, 1999. PFS² consistently achieved better performance than did the other 2 algorithms at the same level of privacy. We also observed the privacy and utility trade-off exhibited by the algorithms and found that the quality of the results improved as the privacy budget increased. We also observed that the quality of the results was more stable for the MSNBC data set than for the other data sets. This is because of the high supports (ie, density) of the sequences in the MSNBC data set, which are more resistant to noise. Detailed results showing the impacts of the various parameters and the characteristics of the different data sets are reported in Xu et al, ICDE 2015 (Related Publications) and Xu et al, TKDE 2016 (Related Publications).

Figure 9

Correlated Sequence Data: FSM Utility vs Privacy.

Building Data Registries for Correlated Sequential Data: Case Studies Using Emory and UCSD Data

We conducted case studies using data extracted from the Emory and UCSD clinical data warehouses. The data were completely deidentified through an honest broker process. We focused on the diagnostic codes and represented the data as a sequential data set, as shown in Figure 10, which captures all the diagnostic codes for the same patient in sequence over a period of time. Figure 11 shows the size and statistics of the data sets we constructed from Emory and UCSD, respectively (referred to as S-data sets). To tackle the performance issue that might be caused by the large number of International Classification of Diseases, Ninth Revision (ICD-9) codes (around 15 000 unique codes) and to reduce the dimensionality of the data, we also adopted the clinical classifications software (CCS) categories and constructed corresponding data sets (referred to as C-data sets). Our goal is to learn frequent sequential patterns of the diagnosis codes (which suggest temporal progression of diseases) while ensuring DP.

Figure 10

Sample Diagnostic Sequence Data Set.

Figure 11

Extracted Diagnostic Sequence Data Sets from Emory and UCSD.

Figure 12 shows the performance of the PFS² algorithm under various privacy budgets on Emory and UCSD diagnostic sequence data sets. δ is the relative threshold used for discovering frequent sequences. It is obvious that the quality of the results improves as the privacy budget increases. Emory's F score is much higher than that of UCSD because of the bigger cohort size. Interestingly, S-data sets outperform C-data sets. This is because we limit the length of input sequences in the algorithm. Although the domain for ICD-9 codes is huge, the sensitivity is not affected for the S-data sets. When we use the single-level CCS, we are generalizing the ICD-9 codes into fewer categories, which yields more frequent sequences. As a result, there is more space to make mistakes because of the DP noise; hence, the overall performance is degraded. We are currently extending this work to use International Statistical Classification of Diseases, Tenth Revision (ICD-10) codes as well as encoding additional variables, such as medication orders and procedures.

Figure 12

FSM on Emory and UCSD Data: Utility vs Privacy.

Building Data Registries for Correlated Graph Data [Related Publications: Xu et al, ICDE 2016; Cheng et al, TKDE 2018]

In the Methods section for specific aim 1, we described methods for modeling data as graphs that can represent the correlations or co-occurrences between events (eg, diagnoses and medication can co-occur in 1 encounter). In these experiments, we evaluated the performance of our proposed DFG algorithm for mining frequent subgraphs under DP from correlated graph data. We compared it with 2 algorithms. The first one was a straightforward baseline algorithm, denoted naive, that used the Laplace mechanism. The second one was the Markov chain Monte Carlo sampling-based algorithm,³⁸ which we designated the differentially private frequent pattern mining (DFPM) algorithm. We implemented the algorithms in Java using 2 common metrics to evaluate their performance, similar to the method we used for the correlated sequence data above. The first metric was the F score, which we used to measure the utility of the discovered frequent subgraphs. The second was RE, which we used to measure error with respect to the true support count of the frequent subgraphs.

Figure 13 shows the performance of the 3 algorithms for mining the top 50 frequent subgraphs under various privacy budgets from the cancer data set (http://cactus.nci.nih.gov/download/nci). Our proposed algorithm, DFG, consistently achieved better performance at the same level of privacy than did the naive algorithm and the state-of-the-art DFPM algorithm. All of the algorithms performed in a similar way: The utility of the results improved as the privacy budget (ε) increased. This is because as ε increases, a smaller amount of noise occurs, and a lower degree of privacy is guaranteed. Detailed results can be found in Xu et al, ICDE 2016 (Related Publications).

Figure 13

Correlated Graph Data: Frequent Subgraph Mining Utility vs Privacy.

Quantifying the DP of Traditional DP Mechanisms for Correlated Data [Related Publications: Cao et al, ICDE 2017; Cao et al, PVLDB 2018; Cao et al, TKDE 2019]

We analyzed the privacy leakage of a DP mechanism under temporal correlation by modeling using a Markov chain. Our analysis revealed that the event-level privacy loss of a DP mechanism can be higher than the intended privacy guarantee. We call this unexpected privacy loss TPL. We designed efficient algorithms for precisely quantifying such TPL. Finally, we proposed data release mechanisms that can convert any existing DP mechanism into one that protects against TPL. We published these results in Cao et al, ICDE 2017 and TKDE 2019, and the prototype system for visualizing and controlling TPL can be found in Cao et al, PVLDB 2018 (all in the Related Publications section).

Figure 14 illustrates the potential TPL associated with a traditional DP mechanism with various levels of temporal correlations within the data. According to our analysis, TPL can be divided into 2 parts: backward privacy leakage (BPL) and forward privacy leakage (FPL), which are caused by backward or forward temporal correlations, respectively. When releasing DP data at time t, all the BPLs from the previous time points remain the same, whereas all the FPLs from the previous time points are updated because of forward temporal correlations. Overall, as we expected, the stronger the temporal correlation, the higher the TPL.

Figure 14

Correlated Sequence Data: BPL and FPL, Caused by Backward and Forward Temporal Correlations, Respectively, and Overall TPL of Traditional DP Mechanisms.

Specific Aim 2: Building Data Registries Using Both Private Data and Consented Data

Building Local Data Registries With Consented and Private Data [Related Publications: Wang et al, TKDE 2018]

We tested our methods using both simulations and real data sets, which confirmed our finding that bias-variance trade-off phenomena can be characterized in the sample size of the released data set. Below, we highlight one of the real data set studies for which the complete and detailed results can be found in Wang et al, TKDE 2018 (Related Publications). In our experimental study, we used 2 clinical data sets from different institutes for the diagnosis of acute myocardial infarction. One was from the Edinburgh Institute and included 1253 patients. The other was from the Sheffield Institute and included 500 patients. Both the size and distribution of a public data set will impact the performance of the hybrid model evaluated here. With a fixed privacy budget, there is a cost to using an increasing number of samples from the public data set because the projection from private samples to these reference points consumes the privacy budget. To fairly evaluate the impact of cohort size, we developed methods to select an optimally sized public data set and evaluated it on a logistic regression model use case.

The response variable in logistic regression is a 0-1 disease variable, and the covariates in this study measured at both institutes were pain in left arm, pain in right arm, nausea, hypoperfusion, ST elevation, new Q-waves, ST depression, T-wave inversion, and sweating. Because the data were collected from different sites, there is an intrinsic bias when using 1 data set to approximate the other. In our study, we simulated a situation in which 1 data set was public and the other was private. We found 153 distinct data points in the Sheffield Institute data set and 181 in the Edinburgh Institute data set. We compared the performance of the 3 M-estimators (from the public data set, from the hybrid data set without adding noise to the weights, and from the hybrid data sets under DP) using logistic regression. For each evaluation, the number of samples selected from the public data set was placed on the x-axis, and the corresponding prediction errors were placed on the y-axis.

Figure 15 shows a clear trade-off in the performance of the DP M-estimator using hybrid data sets (red curve). When the privacy budget was sufficient, the use of the hybrid data sets outperformed the use of the public data set alone. The inflection point in the red curve suggests the existence of an optimal sample size for release.

Figure 15

Consented and Private Data: Prediction Error vs Subset Size of Consented Data.

Building Cross-Center Data Registries With Consented and Private Data [Related Publications: Li et al, TKDE 2018]

We evaluated our DOLA for building a cross-center data registry. We used the “regret” evaluation metric, which measures the loss function of online learning; a lower regret value corresponds to a better model utility.

Figure 16 shows the average regret (normalized by the number of iterations) incurred by our DOLA for different privacy levels in the Diabetes 130-US hospitals data set. As expected, the regret obtained by the nonprivate algorithm was the lowest. More significantly, the regret of the private algorithms approached that of the nonprivate regret algorithms as the privacy level decreased. Detailed results can be found in Li et al, TKDE 2018 (Related Publications).

Figure 16

Cross-Center Data: Utility (Regret) vs Privacy.

Specific Aim 3: Building Data Registries With Fine-Grained Patient Privacy Control

Building Data Registries With Fine-Grained Patient Privacy Control [Related Publications: Li et al, PAKDD 2017]

We used 2 data sets (the United States and Brazil) from the Integrated Public Use Microdata Series, with 370 000 and 190 000 census records, respectively. Each data set contained 13 attributes, namely, age, gender, marital status, education, disability, nativity, working hours per week, number of years residing in the current location, ownership of dwelling, family size, number of children, number of automobiles, and annual income. For personalized DP, we randomly generated the privacy budgets for all records from a uniform distribution and a normal distribution. We set the range of privacy budget values from 0.01 to 1, with 0.01 being patients with high privacy concern, and with sample independent and identically distributed privacy budgets from uniform (0.01, 0.1) and normal (0.1, 1). We evaluated the utility of our mechanisms using random range-count queries, support vector machines, and logistic regression and compared them with the sampling mechanism³⁹ and the baseline minimum mechanism that uses the minimum privacy level among all individuals. We show a representative result in Figure 17. The full results are reported in Li et al, PAKDD 2017 (Related Publications).

Figure 17

Personalized Privacy: Query Utility vs Privacy.

Figure 17 shows the relative frequency error of our proposed partitioning mechanisms (utility based and privacy aware) in comparison with the baseline and the sampling mechanism under normal and uniform distribution of privacy preferences. We varied the privacy budget thresholds, which form an algorithmic parameter used in the sampling mechanism. The errors of the partitioning mechanisms are independent of the parameter and hence remain the same. The accuracy of the sampling mechanisms reached their optima when the budget threshold attained the mean of all privacy budget values, which is consistent with the experimental conclusion in the original work.³⁹ The accuracy of the sampling mechanism deteriorates sharply when the threshold value is smaller than the mean privacy budget. This is because when the number of records is sufficiently large, the privacy budget dominates the performance. Our partitioning mechanisms remained stable and performed almost the same when the sampling mechanism was performing optimally. The utility-based partitioning displayed slightly better performance than did the privacy-aware mechanism in our experiments because it considers both the privacy and utility of the target DP computation. The baseline minimum performed similarly to the sampling mechanism when the privacy budget threshold was smallest. This is because when the threshold reaches its smallest value, the sampling mechanism is equal to the minimum mechanism.

To summarize, we developed 2 partitioning-based mechanisms that aimed to fully use the privacy budgets of different individuals and maximize the utility of the target DP computations. Privacy-aware partitioning minimizes privacy budget waste, and utility-based partitioning maximizes the utility function of the target mechanism. Both partitioning mechanisms outperformed the existing state-of-the-art sampling mechanism and the baseline mechanism when using the data registry for range-count queries, logistic regression, and support vector machines. In future work, it will be useful to evaluate the utility of the partitioning mechanisms for different aggregations or analytical tasks.

Dissemination and Implementation of the Research

We produced 14 publications and 1 manuscript in submission during this project, all of which were published in premier data management and biomedical informatics conferences and journals (see the complete list in the Related Publications section). Conference publications in the data management and data privacy fields require a rigorous review and selection process (a 15% acceptance rate is typical), and the publications are included in formally archived proceedings, with typical page lengths between 10 and 15 pages.

In addition, we have made the pSHARE software tool kit available on GitHub (https://github.com/pshare-emory). We expect that institutions can use different components of pSHARE in combination with their existing tool sets to build policy-based and deidentification-based data registries, with an optional data use agreement management system that can be used to disclose data at different privacy levels (original, deidentified, and DP) under different conditions. Going forward, we plan to develop additional software documents and offer support through direct training via conferences, online tutorials, and webinars to train honest brokers and data custodians to use the software.

Discussion

The developed methodologies present an inherent trade-off between data privacy and utility: The quality of results improves as the privacy budget increases. Customized approaches with rigorous privacy guarantees can be used to develop data registries for specific PCOR studies with empirical utility guarantees (eg, preserving frequent subsequences, ensuring accurate M-estimators) but may not be versatile enough to support all types of PCOR studies. Indeed, there are still substantial engineering, statistical, legal, and other challenges to implementation of these methods, especially for synthetic data that fully preserve the statistical distributions or properties of the original data (not just 1 aspect, eg, frequent subsequences, as we described in the Methods section, “Specific Aim 1: Research Design: Building Data Registries Using Private Data”). One potential direction is to explore generative adversarial network–based approaches that can simultaneously preserve the structural and temporal correlations in the data while providing a rigorous privacy guarantee.

One possible barrier to adopting the methodology in practice is how the privacy budget or preferences will be ascertained in the real world. Our work so far is based on predefined privacy budgets or simulated privacy preferences. Thus, the generalizability and applicability of the measure in the real world is still unknown. The privacy budget, which is the key parameter in DP, has no intrinsic intuitive meaning and is difficult to explain to patients, honest brokers, and regulators. More importantly, it is an open question to understand how individuals value their privacy and how much they are willing to trade privacy in exchange for insights gleaned from data sharing. Moreover, these values may differ for research with commercial (as opposed to noncommercial) applications and for different types of health information. During our stakeholder panels with patients, patient advocates, patient privacy advocates, clinicians, and others, we gained valuable insights into ways of communicating privacy budget settings and privacy and utility trade-offs for practical use. Larger-scale studies, such as patient surveys, are needed to gain a broader and deeper understanding of patient privacy preferences, including the impact of demographic factors for large-scale adoption of the developed data-sharing methodology. We had begun designing a large-scale patient survey but left it for future research because of limited resources. Additional research and more extensive user studies are needed to better understand patient attitudes and preferences and to expedite the adoption of the developed methods. A high level of rigor is needed in choice architecture and should involve question types, such as ranking (generating probabilities), trade-off, and discrete choice analysis to measure the outcomes scientifically if such a study is pursued in the future. In addition to questionnaire-based surveys, another promising line of work is from the community of theoretical privacy and economics that models the value of privacy to a player who participates in a DP.^40,41 More work needs to be done to test whether human decision-making about privacy in practice is consistent with existing theoretical models of privacy valuations, for example, through behavioral experiments.

Another potential barrier is that institutions or stakeholders may unjustifiably dismiss the DP-based approach because the resulting data registries contain noise compared with the original source data. We recognize that this is a trade-off we must accept to gain rigorous privacy guarantees without the need for obtaining explicit consent. However, the analytical and empirical studies we have conducted on our proposed methods have demonstrated that the methods guarantee the integrity and utility of the data for a variety of exploratory analyses and cohort discovery queries. In addition, it is not our intention to replace the original data with the statistical or synthetic data registries. Rather, we envision that our tool kit and the resulting data registries can complement the original or deidentified data registries by providing much quicker and more rigorous access to a preview of the large-scale private data for exploratory PCOR studies (eg, determining whether a cohort size is sufficient for a study or preassessing hypotheses for study planning) before access to a subset of the source data becomes necessary. In addition, the synthetic data can be used as benchmark data for evaluating and testing various algorithmic and machine learning solutions.

In this report, we present methods for building registries with fine-grained privacy preferences expressed as the desired level of DP, but we acknowledge the difficulties of achieving acceptance and broad adoption of such methods because of the lack of intuitive understanding of the DP parameter. The DP parameter, or privacy budget, measures the probabilistic difference in a statistical outcome given the presence or absence of a given patient record, but this definition has no intrinsic intuitive meaning and may be difficult to explain to patients, honest brokers, or regulators. Potential future directions for addressing this issue include providing patients with suggested values or choices they can use to decide their privacy preferences or exploring the practical risks to patients of reidentification and disclosure in the form of posterior and prior probability given certain DP parameter choices.

Our framework for building cross-center data registries demonstrated the feasibility of using a federated survival analysis algorithm to facilitate collaboration across different institutions. However, the proposed framework is still limited, as it does not address the policy and engineering concerns related to federated use of institutional data. We envision that additional distributed models will continue to be added to the arsenal of distributed statistical methods and made available to investigators worldwide.

Conclusions

The algorithms and software tool kit developed and described here, collectively named pSHARE, provide a new and complementary methodology to current deidentification and policy-based data registry practices. pSHARE has the potential to empower patients through its rigorous privacy controls while contributing derivatives of their data to PCOR.

The developed methodologies present inherent trade-offs between data privacy and data utility. Our customized pSHARE approach can be used to develop data registries for specific PCOR studies that simultaneously guarantee patient privacy and empirical data utility, but this approach may not be versatile enough to support all types of PCOR studies.

We anticipate that additional research will be needed to investigate the communication mechanisms used to identify patient privacy choices, to develop methods that enhance the versatility of data registries, and to accelerate the dissemination and adoption of the developed methods using engineering approaches such as improved user interfaces.

References

1.: PCORI Methodology Committee. The PCORI Methodology Report. November 2013. Accessed February 2, 2021. https://www.pcori.org/assets/2013/11/PCORI-Board-Meeting-Methodology-Report-for-Acceptance-1118131.pdf
2.: President's Information Technology Advisory Committee (PITAC). Revolutionizing Health Care Through Information Technology: Report to the President. National Coordination Office for Information Technology Research and Development. June 2004. Accessed February 2, 2021. https://www.nitrd.gov/pubs/pitac/pitac_report_health-it_2004.pdf
3.: National Research Council (US) Committee on Engaging the Computer Science Research Community in Health Care Informatics; Stead WW, Lin HS, eds. Computational Technology for Effective Health Care: Immediate Steps and Strategic Directions. National Academies Press; 2009. Accessed February 2, 2021. https://www.nap.edu/catalog/12572/computational-technology-for-effective-health-care-immediate-steps-and-strategic [PubMed: 20662117]
4.: Institute of Medicine (US) Committee on Health Research and the Privacy of Health Information; Nass SJ, Levit LA, Gostin LO, eds. Beyond the HIPAA Privacy Rule: Enhancing Privacy, Improving Health Through Research. National Academies Press; 2009. Accessed February 2, 2021. https://www.nap.edu/catalog/12458/beyond-the-hipaa-privacy-rule-enhancing-privacy-improving-health-through [PubMed: 20662116]
5.: Dwork C. Differential privacy. In: Bugliesi M, Preneel B, Sassone V, Wegener L, eds. Automata, Languages and Programming. ICALP 2006. Lecture Notes in Computer Science. Volume 4052. Springer; 2006:1-12.
6.: US Census Bureau. Disclosure avoidance and the 2020 Census. Revised August 10, 2020. Accessed February 2, 2021. https://www.census.gov/about/policies/privacy/statistical_safeguards/disclosure-avoidance-2020-census.html
7.: Dwork C, McSherry F, Nissim K, et al. Calibrating noise to sensitivity in private data analysis. In: Halevi S, Rabin T, eds. Theory of Cryptography. TCC 2006. Lecture Notes in Computer Science. Volume 3876. Springer; 2006:265-284.
8.: Vinterbo SA, Sarwate AD, Boxwala AA. Protecting count queries in study design. J Am Med Inform Assoc. 2012;19(5):750-757. [PMC free article: PMC3422502] [PubMed: 22511018]
9.: McSherry F. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data; June 2009; Providence, RI.
10.: Yan H, Zhang X. Design of an extended privacy homomorphism algorithm. In: 2011 2nd IEEE International Conference on Emergency Management and Management Sciences; August 8-10, 2011; Beijing, China.
11.: Xiao X, Tao Y. Output perturbation with query relaxation. Presented at: VLDB ‘08; August 24-30, 2008; Auckland, New Zealand.
12.: Blum A, Ligett K, Roth A. A learning theory approach to non-interactive database privacy. In: Proceedings of the 40th Annual ACM Symposium on Theory; May 17-20, 2008; Victoria, British Columbia, Canada.
13.: Machanavajjhala A, Kifer D, Abowd J, Gehrke J, Vilhuber L. Privacy: theory meets practice on the map. Presented at: 2008 IEEE 24th International Conference on Data Engineering; April 7-12, 2008; Cancun, Mexico.
14.: Korolova A, Kenthapadi K, Mishra N, Ntoulas A. Releasing search queries and clicks privately. Presented at: 18th International Conference on World Wide Web; April 2009; New York, NY.
15.: Gondree M, Mohassel P. Longest common subsequence as private search. In: Proceedings of the 2009 ACM Workshop on Privacy in the Electronic Society; November 9, 2009; Chicago, IL.
16.: McSherry F, Mironov I. Differentially private recommender systems: building privacy into the net. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; June 2009; Paris, France.
17.: Inan A, Kantarcioglu M, Ghinita G, Bertino E. Private record matching using differential privacy. In: 13th International Conference on Extending Database Technology; March 22-26, 2010; Lausanne, Switzerland.
18.: Xiao X, Wang G, Gehrke J. Differential privacy via wavelet transforms. IEEE Trans Knowl Data Eng. 2011;23(8):1200-1214.
19.: Hay M, Rastogi V, Miklau G, Suciu D. Boosting the accuracy of differentially-private histograms through consistency. In: Proceedings of the Very Large Data Bases (PVLDB) Endowment; September 13-17, 2009; Singapore.
20.: Li C, Hay M, Rastogi V, McGregor A. Optimizing linear counting queries under differential privacy. In: PODS '10: Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems; June 6-11, 2010; Indianapolis, IN.
21.: Sramka M, Safavi-Naini R, Denzinger J, Askari M. A practice-oriented framework for measuring privacy and utility in data sanitization systems. In: Proceedings of the 2010 EDBT/ICDT Workshops; March 22-26, 2010; Lausanne, Switzerland.
22.: McSherry F, Mahajan R. Differentially-private network trace analysis. In: SIGCOMM '10: Proceedings of the ACM SIGCOMM 2010 Conference; August 30-September 3, 2010; New Delhi, India.
23.: Ding B, Winslett M, Han J, et al. Differentially private data cubes: optimizing noise sources and consistency. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data; June 12-16, 2011; Athens, Greece.
24.: Ray S, Nizam MF, Das S, Fung BC. Verification of data pattern for interactive privacy preservation model. In: Proceedings of the 2011 ACM Symposium on Applied Computing (SAC). March 21-24, 2011; TaiChung, Taiwan.
25.: Chen R, Mohammed N, Fung BC, Desai BC, Xiong L. Publishing set-valued data via differential privacy. Presented at: 37th International Conference on Very Large Data Bases; August 29-September 3, 2011; Seattle, WA.
26.: Clinical & Translational Research Institute (CTRI). i2b2 query tool. Accessed February 2, 2021. http://sites.bu.edu/bu-i2b2/intro-to-i2b2/i2b2-query-tool/
27.: Xu J, Zhang Z, Xiao X, et al. Differentially private histogram publication. Paper presented at: 28th International Conference on Data Engineering; April 1-5, 2012; Washington, DC.
28.: Cormode G, Procopiuc M, Shen E, et al. Differentially private spatial decompositions. Paper presented at: 28th International Conference on Data Engineering; April 1-5, 2012; Washington, DC.
29.: Cormode G, Procopiuc C, Srivastava D, et al. Differentially private summaries for sparse data. In: Proceedings of the 15th International Conference on Database Theory; March 26-30, 2012; Berlin, Germany.
30.: Li H, Xiong L, Jiang X. Differentially private synthesization of multi-dimensional data using Copula functions. Adv Database Technol. 2014;2014:475-486. doi:10.5441/002/edbt.2014.43 [PMC free article: PMC4232968] [PubMed: 25405241] [CrossRef]
31.: Gardner J, Xiong L, Xiao Y, et al. SHARE: system design and case studies for statistical health information release. J Am Med Inform Assoc. 2013;20(1):109-116. [PMC free article: PMC3555328] [PubMed: 23059729]
32.: Ji Z, Jiang X, Wang S, Xiong L, Ohno-Machado L. Differentially private distributed logistic regression using public and private data. Paper presented at: 3rd Annual Translational Bioinformatics Conference (TBC/ISCB-Asia 2013); October 2-4, 2013; Seoul, Republic of Korea.
33.: Li H, Xiong L, Ohno-Machado L, et al. Privacy preserving RBF kernel support vector machine. Biomed Res Int. 2014;2014:827371. doi:10.1155/2014/827371 [PMC free article: PMC4071990] [PubMed: 25013805] [CrossRef]
34.: Cormode G, Procopiuc CM, Srivastava D, Shen E, Yu T. Differentially private spatial decompositions. In: Proceedings of the 2012 IEEE 28th International Conference on Data Engineering; April 1-5, 2012; Washington, DC.
35.: Chen R, Acs G, Castelluccia C. Differentially private sequential data publication via variable-length n-grams. In: Proceedings of the 2012 ACM Conference on Computer and Communications Security; October 2012; Raleigh, NC.
36.: Chen R, Fung BC, Desai BC. Differentially private transit data publication: a case study on the Montreal transportation system. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; August 12-16, 2012; Beijing, China.
37.: Srikant R, Agrawal R. Mining sequential patterns: generalizations and performance improvements. In: Apers P, Bouzeghoub M, Gardarin G, eds. Advances in Database Technology – EDBT '96. Lecture Notes in Computer Science. Volume 1057. Springer; 1996.
38.: Shen E, Yu T. Mining frequent graph patterns with differential privacy. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; August 11-14, 2013; Chicago, IL.
39.: Jorgensen Z, Yu T, Cormode G. Conservative or liberal? Personalized differential privacy. Presented at: 2015 IEEE 31st International Conference on Data Engineering; April 13-17, 2015; Seoul, South Korea.
40.: Ghosh A, Roth A. Selling privacy at auction. Games Econ Behav. 2015;91:334-346.
41.: Nissim K, Orlandi C, Smorodinsky R. Privacy-aware mechanism design. In: Proceedings of the 13th ACM Conference on Electronic Commerce; June 4-8, 2012; Valencia, Spain.

Related Publications

All publications below are published and resulted from the research supported by this PCORI award.

Bonomi L, Xiong L. On differentially private longest increasing subsequence computation in data stream. Trans Data Priv. 2016;9(1):73-100.
Bonomi L, Xiong L. Private computation of the longest increasing subsequence in data streams. In: Proceedings of the Workshops of the EDBT/ICDT 2015 Joint Conference (EDBT/ICDT), Brussels, Belgium, March 27, 2015.
Cao Y, Yoshikawa M, Xiao Y, Xiong L. Quantifying differential privacy under temporal correlations. 2017 IEEE 33rd International Conference on Data Engineering (ICDE), San Diego, CA. 2017:821-832. [PMC free article: PMC5584619] [PubMed: 28883711]
Cao Y, Xiong L, Yoshikawa M, Xiao Y, Zhang S. ConTPL: controlling temporal privacy leakage in differentially private continuous data release. Proceedings of the Very Large Data Bases Endowment (PVLDB). 2018;11(12):2090-2093. [PMC free article: PMC6697134] [PubMed: 31423349]
Cao Y, Yoshikawa M, Xiao Y, Xiong L. Quantifying differential privacy in continuous data release under temporal correlations. IEEE Trans Knowl Data Eng (TKDE). 2019;31(7):1281-1295. [PMC free article: PMC6704013] [PubMed: 31435181]
Cheng X, Su S, Xu S, Xiong L, Xiao K, Zhao M. A two-phase algorithm for differentially private frequent subgraph mining. IEEE Trans Knowl Data Eng (TKDE). 2018;30(8):1411-1425. [PMC free article: PMC7678507] [PubMed: 33223776]
Li C, Zhou P, Xiong L, Wang Q, Wang T. Differentially private distributed online learning. IEEE Trans Knowl Data Eng (TKDE). 2018;30(8):1440-1453. [PMC free article: PMC6764830] [PubMed: 31564813]
Li H, Xiong L, Jiang X. Partitioning-based mechanisms under personalized differential privacy. PAKDD 2017: Advances in Knowledge Discovery and Data Mining. 2017;615-627. [PMC free article: PMC5602579] [PubMed: 28932827]
Li H, Xiong L, Jiang X, Liu J. Differentially private histogram publication for dynamic datasets: an adaptive sampling approach. CIKM ‘15: Proceedings of the 24th ACM International Conference on Information and Knowledge Management. 2015;1001-1010. [PMC free article: PMC4788513] [PubMed: 26973795]
Lu C-L, Wang S, Ji Z, Wu Y, Xiong L, Jiang X, Ohno-Machado L. WebDISCO: a Web service for DIStributed COx model learning without patient-level data sharing. J Am Med Inform Assoc. 2015;22(6):1212-1219. [PMC free article: PMC5009917] [PubMed: 26159465]
Wang M, Ji Z, Kim H-E, Wang S, Xiong L, Jiang X. Selecting optimal subset to release under differentially private M-estimators from hybrid datasets. IEEE Trans Knowl Data Eng (TKDE). 2018;30(3):573-584. [PMC free article: PMC6051552] [PubMed: 30034201]
Xu S, Su S, Cheng X, Li Z, Xiong L. Differentially private frequent sequence mining. IEEE Trans Knowl Data Eng (TKDE). 2016;28(11):2910-2926. [PMC free article: PMC10237146] [PubMed: 37274928]
Xu S, Su S, Xiong L, Cheng X, Xiao K. Differentially private frequent subgraph mining. 2016 IEEE 32nd International Conference on Data Engineering (ICDE), Helsinki, Finland. 2016;229-240. [PMC free article: PMC5015894] [PubMed: 27616876]
Xu S, Su S, Cheng X, Li Z, Xiong L. Differentially private frequent sequence mining via sampling-based candidate pruning. 2015 IEEE 31st International Conference on Data Engineering (ICDE), Seoul, Korea. 2015;1035-1046. [PMC free article: PMC4788512] [PubMed: 26973430]

Acknowledgments

We would like to acknowledge the following members of our patient engagement panels:

Katherine Kim, PhD, MPH, MBA (assistant professor, Betty Irene Moore School of Nursing, UC Davis)
Pam Dixon (executive director, World Privacy Forum)
Kristin West, JD, MS (associate vice president, research director, Office of Research Compliance, Emory University)
Michael W. Kalichman, PhD (professor and director of the UCSD Research Ethics Program)
Robert El-Kareh, MD, MS, MPH (Department of Biomedical Informatics, UCSD)
Cynthia Burstein Waldman (patient and patient advocate)
Gordon Fox (patient)
Carly Medosch (patient)
Barbara Saltzman (patient)

Research reported in this report was funded through a Patient-Centered Outcomes Research Institute® (PCORI®) Award (#ME-1310-07058). Further information available at: https://www.pcori.org/research-results/2014/new-methods-protect-privacy-when-using-patient-health-data-compare-treatments

Institution Receiving Award: Emory University

Original Project Title: Building Data Registries with Privacy and Confidentiality for PCOR

PCORI ID: ME-1310-07058

Suggested citation:

Xiong L, Post A, Jiang X, Ohno-Mochado L. (2021). New Methods to Protect Patient Privacy When Using Patient Health Data to Compare Treatments. Patient-Centered Outcomes Research Institute (PCORI). https://doi.org/10.25302/02.2021.ME.131007058

Disclaimer

The [views, statements, opinions] presented in this report are solely the responsibility of the author(s) and do not necessarily represent the views of the Patient-Centered Outcomes Research Institute® (PCORI®), its Board of Governors or Methodology Committee.

This book is distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License which permits noncommercial use and distribution provided the original author(s) and source are credited. (See https://creativecommons.org/licenses/by-nc-nd/4.0/

Bookshelf ID: NBK599332PMID: 38232192DOI: 10.25302/02.2021.ME.131007058

PubReader
Print View
Cite this Page
Xiong L, Post A, Jiang X, et al. New Methods to Protect Privacy When Using Patient Health Data to Compare Treatments [Internet]. Washington (DC): Patient-Centered Outcomes Research Institute (PCORI); 2021 Feb. doi: 10.25302/02.2021.ME.131007058
PDF version of this title (1.4M)

Other titles in this collection

PCORI Final Research Reports

Related information

PMC
PubMed Central citations
PubMed
Links to PubMed

Recent Activity

Clear Turn Off Turn On

New Methods to Protect Privacy When Using Patient Health Data to Compare Treatme...
New Methods to Protect Privacy When Using Patient Health Data to Compare Treatments

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

New Methods to Protect Privacy When Using Patient Health Data to Compare Treatments

Authors

Affiliations

Structured Abstract

Background:

Objectives:

Methods:

Results:

Conclusions:

Limitations:

Background

Methodologic Gaps

Goals of the Proposed Research and Specific Aims

Potential Impact of the Research

Patient and Stakeholder Engagement

Methods

Specific Aim 1: Research Design: Building Data Registries Using Private Data

Building Data Registries for Multidimensional and Dynamic Data [Related Publications: Li et al, CIKM 2015]

Building Data Registries for Correlated Sequential Data [Related Publications: Xu et al, ICDE 2015; Xu et al, TKDE 2016]

Building Data Registries for Correlated Graph Data [Related Publications: Xu et al, ICDE 2016; Cheng et al, TKDE 2018]

Quantifying and Controlling the Privacy of Traditional DP Mechanisms for Correlated Data [Related Publications: Cao et al, ICDE 2017; Cao et al, PVLDB 2018; Cao et al, TKDE 2019]

Specific Aim 2: Research Design: Building Data Registries Using Both Private Data and Consented Data

Building Local Data Registries With Consented and Private Data [Related Publications: Wang et al, TKDE 2018]

Building Cross-Center Data Registries With Consented and Private Data [Related Publications: Lu et al, 2015; Li et al, TKDE 2018]

Specific Aim 3: Research Design: Building Data Registries With Fine-Grained Patient Privacy Control

Building Data Registries With Fine-Grained Patient Privacy Control [Related Publications: Li et al, PAKDD 2017]

Data Sources and Data Sets

Analytical and Evaluative Approach

Results

Specific Aim 1: Building Data Registries Using Private Data

Building Data Registries for Dynamic Data [Related Publications: Li et al, CIKM 2015]

Building Data Registries for Correlated Sequential Data [Related Publications: Xu et al, ICDE 2015; Xu et al, TKDE 2016]

Building Data Registries for Correlated Sequential Data: Case Studies Using Emory and UCSD Data

Building Data Registries for Correlated Graph Data [Related Publications: Xu et al, ICDE 2016; Cheng et al, TKDE 2018]

Quantifying the DP of Traditional DP Mechanisms for Correlated Data [Related Publications: Cao et al, ICDE 2017; Cao et al, PVLDB 2018; Cao et al, TKDE 2019]

Specific Aim 2: Building Data Registries Using Both Private Data and Consented Data

Building Local Data Registries With Consented and Private Data [Related Publications: Wang et al, TKDE 2018]

Building Cross-Center Data Registries With Consented and Private Data [Related Publications: Li et al, TKDE 2018]

Specific Aim 3: Building Data Registries With Fine-Grained Patient Privacy Control

Building Data Registries With Fine-Grained Patient Privacy Control [Related Publications: Li et al, PAKDD 2017]

Dissemination and Implementation of the Research

Discussion

Conclusions

References

Related Publications

Acknowledgments

Suggested citation:

Disclaimer

Views

In this Page

Other titles in this collection

Related information

Similar articles in PubMed

Recent Activity