Decentralizing the Analysis of Health Data

The transition from paper to digital health care records promises a significantly enhanced ability to leverage claims and clinical data for secondary uses – uses beyond that for which the health data was originally collected, such as research, public health surveillance, or fraud prevention. Done properly, these secondary uses of data that were originally collected for treatment or payment can aid the creation of a more effective, information-driven health care system. For example, researchers are using digital claims data to provide the public with comparisons of the quality and cost effectiveness of treatment for particular conditions among plans or health care facilities in a given market.

Patient privacy and data security are among the first considerations of agencies establishing such programs, and many agencies have instituted strong technical controls (such as de-identifying the data) and policy frameworks to protect the confidentiality and integrity of the data. Although a strong policy framework is essential, the technical architecture of information exchange is another important factor. This week, the Center for Democracy & Technology (CDT) released a report challenging the prevailing centralized model of health data analysis and urging Dept. of Health and Human Services (HHS) to explore distributed systems for secondary use programs. The paper comes at the same time that the Centers for Medicare and Medicaid (CMS) issued a final rule for its risk adjustment program – mandated by the Affordable Care Act of 2010 – that would use a distributed system as a default, changing course from the proposed rule, which would have required a centralized model.

In recent years, federal and state agencies have set up numerous programs to analyze health care data for secondary uses. While the primary data source for many of these secondary use programs is health claims collected from plans, a long-term goal is to analyze clinical data collected from providers’ electronic medical records. So far, the majority of these systems have been built on a centralized database model under which data sources (such as plans) will submit health data (such as claims) to federal or state agencies, and the agencies collect the data into one system for analysis. Examples of these centralized systems include the All-Payer Claims Databases (APCDs) operating in approximately 14 states, and the Office of Personnel Management’s Health Claims Data Warehouse. However, the centralized database model itself raises risks and is increasingly being called into question.

A fundamental problem with the centralized architecture used by many secondary use programs is that centralization does not minimize the copying of data. Instead, centralization typically results in retaining and sharing of multiple copies of patient data – the source copy, the copy with the agency operating the secondary use program, and copies provided to researchers or other third parties for their analytic functions. This pattern repeats itself each time a new research or policy need requires the creation of another centralized database. Yet continually building and copying huge repositories of medical data is risky, inefficient, and a poor long-term strategy.

  • Maintaining copies of sensitive information in various locations for long periods of time worsens the risk and severity of data breaches – a growing and extremely costly problem.
  • Unnecessarily sharing copies of patients’ data for purposes other than treatment or payment erodes both trust in the confidentiality of medical records and support for health care reform.
  • It is burdensome and costly for plans to set up and secure multiple large data submissions to different entities in various locations, especially if those entities require different data formats. This situation is particularly inefficient when the entities are performing substantially similar analyses.
  • If secondary use programs collect fully identifiable patient data or clinical data from electronic medical records in the future, these privacy and security issues will become considerably more urgent.

In our report, CDT urges policymakers to consider distributed alternatives to centralized databases. Whereas centralized databases typically operate by compiling data into one system and managing it from that location, decentralized systems leave data housed with the original sources of the data, and instead perform analyses by searching the data held by these entities, giving researchers the results of their analyses rather than raw copies of the data. Distributed networks minimize data transfer and leverage existing infrastructure, and can often cost less and take less time to establish than centralized databases. Leaving the copies of the data with the original data sources can help ease the proprietary and liability concerns many data holders have with transferring data to government agencies and third parties. Using distributed networks to cut down on the number of copies of sensitive data about individuals can also reduce the risk and severity of data breaches.

There are multiple approaches to decentralized analytical systems, and policymakers should consider which model would best achieve the program’s goals given resource constraints and the types of analytics required. One decentralized approach – a distributed query system – is for researchers to send data sources detailed research questions, permitting data sources to write analytic code to answer those questions. The data sources use the code they have written to analyze their in-house data and then return structured responses – rather than copies of the data – to the researchers. An operational example of this type of system is the Food and Drug Administration’s (FDA) Federal Partners project. A second query-based approach to decentralized analytics is for researchers to write the analytic code and send the code to data sources. Data sources analyze their in-house data using the code (but do not modify the code), review the output, and provide the responses to the research questions with computer logs that reveal any manipulation of the code. This process does require the data sources to use a common data format. An operational example of this approach is the FDA’s Mini-Sentinel Initiative.

Nonetheless, query-based approaches may not be appropriate for secondary uses that can lead to competitive advantages or disadvantages for health plans. In HHS’ 2011 proposed rule for its risk adjustment program, the agency expressed its concerns regarding distributed query systems. Although CMS acknowledged the potential privacy benefits of distributed systems, the agency was concerned that permitting plans to analyze their own data could lead to fraud and inaccuracy, and CMS was also concerned with small insurers’ ability to respond to multiple queries. These issues drove CMS to propose regulations that would lock plans participating in the risk adjustment program into a centralized database model. However, CMS changed course in the final rule, which it released this week. Citing the comments it had received, CMS’s final rule established that the agency would use a distributed system to analyze health data for risk adjustment, and provided states with flexibility to choose the data collection system that worked best for them (and that met HHS requirements). The final rule gives little detail regarding what a distributed system for risk adjustment might look like or how the system would overcome the agency’s concerns regarding fraud, inaccuracy, and handling queries.

In our report, CDT recommends policymakers explore a “distributed access” model for secondary use programs that require person-level data and carry an unacceptable risk of fraud or inaccuracy. The distributed access model would give agencies direct access to the (de-identified) data and permit the agencies – not the plans – to perform the analyses. Under this theoretical model, each health plan participating in a secondary use program would be required to set aside a structured, de-identified copy of its claims and encounter data in a secure environment (such as on an edge server or in a cloud storage center). The plans would each be required to offer an interface over the Internet to the state and federal agencies responsible for operating the secondary use programs, providing secure access to the data set aside in the data sources’ respective systems. The agencies themselves would use this access to perform the analyses necessary to meet the goals of their secondary use programs, but the data would not be duplicated and sent to the government, and agencies should be prohibited from using the data for anything other than the specified uses of the programs. The agencies would retain the results of their analyses, but would not keep full copies of the data. Under a distributed access model, participating plans should be required de-identify the data held “at rest” in the secure environments; plans could use a one-way hash algorithm to mask patient identifiers if agencies need longitudinal records of individual patients.

CDT urges HHS to develop a strategy, in collaboration with technology vendors, state agencies, and consumer groups, to comprehensively explore models of analysis that do not require health data to be copied and stored in multiple databases. CDT is encouraged by HHS’ existing efforts to explore decentralized systems focused on discrete objectives – such as the Mini-Sentinel and Federal Partners projects – as well as CMS’ decision to support decentralized systems for its risk adjustment program. Ultimately, no distributed system should be formally deployed until its functionality is validated at a population scale. However, now is the time to consider the network architectures for secondary use programs and health information exchange, when the technical infrastructure is in a relatively nascent stage.

CDT urges federal and state agencies to ensure regulations that establish secondary use programs do not lock plans into a centralized database model, but instead leave open the possibility of using decentralized solutions in the future, subject to agency approval. The long-term goal is to establish a scalable, secure architecture that effectively supports valuable secondary use programs while minimizing unnecessary duplication and transmission of patients’ sensitive data.

Harley Geiger is Policy Counsel at the Center for Democracy & Technology.

Leave a Reply

Your email address will not be published. Required fields are marked *