I have read with interest the ongoing conversation about the ProPublica Surgeon Scorecard in THCB and beyond, not because I believe this latest effort at measuring quality will have a significant effect on patient care, but because behind the latest public metric debate – in fact behind all healthcare metric debates – is a major systemic problem. This problem somehow always seems to remain unseen. We acknowledge that measuring healthcare quality is difficult and that using medical data is challenging, but I’m not convinced that people completely understand why or how measurement and data are so difficult in healthcare…nor am I certain that everyone understands the repercussions of those challenges.
As I wrote here, the most promising recent development in medicine is the emphasis on learning from our data. We are finally digitizing records of clinician and patient interactions via the adoption of EMRs. Data warehousing technologies are connecting healthcare’s disparate systems and making data accessible to decision makers. Data will be the foundation for healthcare improvement. However, it is dangerous to assume that accessing raw data is equivalent to accessing relevant information.
All of today’s widely adopted EMR systems were designed to fulfill three purposes: financial reimbursement, narrative communication among clinicians, and legal protection. Now, we aim to use that data for very different purposes: the improvement of health and the discovery of process efficiencies. The impact of the resulting inconsistencies between design and use cannot be overstated, and yet no one else seems to be stating those impacts at all.
I was introduced to this dilemma early, when I was searching for a dissertation topic for my PhD in informatics. A urologist colleague suggested that I help him discover differences in the way various surgeons execute one of the most widely conducted surgeries: the radical retropubic prostatectomy (RRP) or surgical removal of the prostate.
The prostate is in a tight space with sensitive structures on nearly all sides. Some urologists take as wide a margin as possible to decrease the likelihood of leaving any cancer behind. Toward this end, they may remove one or both of the nerves that cause erections, leading to impotence, and in some cases, incontinence. Other surgeons excise a smaller margin, leaving more structures intact but risk leaving a small amount of cancer in the body. In all RRPs, the surgeon also runs the risk of nicking the bladder or related structures.
We know little about a surgeon’s decision making. The characteristics of the patient and his disease may play a role in which approach the surgeon takes. Or the surgeon’s teacher and their beliefs might be the dominant factor. We’re not sure. If we could count what surgeons did, why they did it, and whether it worked, we could make suggestions to all surgeons about what worked best in different situations. At the very least, we would know which surgeons tended to leave more cancer behind and which surgeons were more likely to leave a patient impotent. In theory, it sounds straightforward enough. Unfortunately, actually answering these basic questions was complicated enough to be considered the topic of a dissertation…for an informatics graduate student.
First, to find which patients had prostate cancer, we relied on ICD-9 codes. These codes were of questionable accuracy as a large number of patients with the code for prostate cancer were simply checked for the disease. They were so inaccurate, in fact, that we decided to build software to classify who actually had the disease. Second, despite the use of “standard” measures in pathology reports, pathologists entered important measures –the staging of the tumor and the intermediate outcome of whether or not any cancer was left behind after the surgery–as free text. Third, the surgeons described their surgical approaches (nerve-sparing or non) in unstructured free text in the surgical operative report. Fourth, we found that neither impotence nor incontinence were consistently documented in the record, making it impossible to count that outcome at all. Fifth, we might be able to infer accidental injury to the bladder based on reading between the lines of the operation report, but even that strategy was not 100% reliable.
What we really wanted to know from our study were whether the patient went on to have a long healthy life, recurrence of their cancer, and/or death. We soon found that we could only measure long life and recurrence for patients that chose to remain within the hospital system whose data we were studying. The only way for a hospital system to know this about their patients if the patient went elsewhere was for the hospital system to purchase claims data from a third party. Similarly, if the patient left its system, the hospital would only know the patient died by purchasing the National Death Index from the Center for Disease Control. But the NDI is available only if they intend to use it for research, and the information contained within is 12 to 24 months old (assuming you applied for it 2 – 3 months before it was available).
The good news: I had chosen a field with good job security prospects. The bad news: healthcare doesn’t have insight into the most basic and important information required to improve care.
I have since worked with data from 1500+ clinics, 265+ government, community, and academically-affiliated hospitals, and 10+ health insurers on efforts to learn from medical data how to better deliver patient care across many sub-disciplines of medicine. In my experience, the story above is not exceptional in any way. It is a story that would not surprise clinical researchers, health economics outcomes researchers, epidemiologists, QI specialists, or anyone that has attempted to tease truth from clinical data. Outside of these highly specialized fields, however, few people are familiar with what, exactly, makes measuring healthcare quality so difficult.
I do not mean to imply that learning from healthcare data is not possible. As Mark Friedberg stated in his thoughtful THCB piece about the ProPublica Scorecard, “Scoring the Surgeon Scorecard,” there are proper, scientifically credible methods of validating claims and other types of medical data: validating measures, checking the accuracy of the source data, optimizing choices of statistical methods, and calculating the reliability and risk of measurement error. We need these methods to overcome the limitations of what data we currently can capture and how we can capture it. As an example of why this is necessary, he points to inconsistencies discovered in the assignment of surgeons to surgeries performed in Part A versus Part B claims – a rather important detail.
Such checks and balances have a cost, though. The Surgeon Scorecard and most epidemiology, economics, and health services research efforts have the advantages of budgets and time allocated to perform the type of validation Dr. Friedberg describes. Furthermore, the results of such projects are typically papers that can be debated in the larger medical community, where errors can be brought to light quickly and safely.
But what happens when this same data is used to inform thousands of healthcare decisions in organizations and institutions across the US? Data warehousing and business intelligence tools play an increasingly important role in facilitating organizational leadership’s decision making. This same clinical data–with all of its limitation–are being used by such systems to assess clinical performance, identify patients in need of escalated care, and direct resources. The signing of Medicare Access and CHIP Reauthorization Program (MACRA), and the alignment of Meaningful Use 3 with it, will substantially increase the number of medical decisions made based on aggregated clinical data. Many of these data-driven inferences will appear as guidance in electronic medical records. Some already do.
The decisions made from these conclusions on a daily basis are far more impactful–both to individual patients and in aggregate–than a ProPublica report or even the results of most multi-million dollar randomized controlled trials (which result from extremely vigilant data collection and validation). And yet, thus far this question of how to use data not designed for quality improvement for quality improvement has been largely ignored.
And just what are the repercussions? Ultimately, we don’t know that, either. Each year, our healthcare system kills an estimated 400,000 people by mistake. However, that number might be closer to 200,000. We are fairly certain it’s at least 98,000, or that’s what the last study said, 16 years ago. We accidentally maim a lot more than that. Millions maybe.
Lost in our understandable disappointment with the size of those figures and our determination to improve them is the threat posed by our inability to quantify and understand something so critical. These deaths are a tragedy – the results of past healthcare failures. Our inability to even count the number of deaths (i.e., failures) reliably should be an outrage. And still there is little attention paid to the limitations our data infrastructure imposes on any legitimate attempt at understanding, let alone improving healthcare.
So what can be done to improve our current situation?
Innovations offering incremental improvement in data collection, management, and use will certainly help. We will improve technologies capable of translating a clinician’s words to billing codes. We will see increased use of machine learning and natural language processing to seek patterns in noisy, sparse data, helping us to understand what did happen in past healthcare encounters and what should happen in the future. Hospitals will consider hiring data professionals to work alongside doctors to capture evidence of what’s happening and why in a more useful manner. We will form new companies to track down actual patient outcomes across systems. Done right, these technologies and techniques could offer insights into not only better ways to take care of patients but also how confident we are in those decisions and what evidence exists to support them.
Additionally, we have the option–already mentioned–of validating our data. Should the same best practices of data validation employed by researchers be applied to the growing number of suggested care decisions delivered by clinical-data-dependent technologies? As a patient, I hope so. As an industry insider, I know that this will rarely be the case; in fact, I question whether it is even feasible. Keep in mind, in addition to paying for validation in time and money, we face variation in the documentation of medical condition and care between institutions and organizations. As a result, the validation of any one measure at any one facility is unlikely to simply transfer to the next. And this month’s long awaited switch from the 13,000 diagnostic codes of ICD-9 to the 68,000 codes of ICD-10 is unlikely to improve the validity or reliability of assignments.
Finally, we can acknowledge our limitations. A consistent theme in the ProPublica debate is this notion of “good enough.” In science and reporting we tend to accept “good enough” results if the researchers disclose the limitations of the conclusions. These insights are useful in helping potential users of information determine what to make of the results, how far to trust them, how to apply them to their own facilities, etc. Business intelligence and decision support tools that present the results of automated analyses based on clinical data could, similarly, disclose known data and methodological limitations, so clinicians could judge their utility for each patient.
We should consider all of these and probably more. None, however, will address the larger problem. One of my favorite quotes is written on the walls of the Institute for Healthcare Improvement, where our organization is currently housed: “Every system is perfectly designed to get the results it gets.” Today’s systems of data collection and analysis were designed to meet their goals of financial, legal, and narrative documentation, and they do so admirably. Nowhere in the design of today’s widely adopted systems did we insert the requirement that we learn from our data nor the requirement that we adapt care to those learnings. Until this requirement permeates the design of our information systems, we will continue to have to guess the answer to many important questions, including how many people we kill each year by accident.
Leonard D’Avolio Ph.D., is the CEO and co-founder of Cyft, assistant professor at Harvard Medical School, an advisor to Ariadne Labs and the Helmsley Charitable Trust Foundation. He can be followed on twitter @ldavolio and his writings and bio appear at http://scholar.harvard.edu/len