Data Thinking In Health Care

Clinicians have been on the receiving end of some pretty terrible practices when it comes to information technology.  Instead of informed and shared decision making, clinicians experience an assault of mandates, metrics, buzzwords, and acronyms without clear explanation or expectations.  Not surprisingly, the pages of THCB and beyond contain frustrated denunciations of EMRs, dares for Dr. Watson to replace them, and dismissals of “big data.”  This whole “technologists are from mars, clinicians are from venus” vibe is understandable, but it isn’t productive.  

Data is the building block of measurement and now that it’s finding its way into healthcare systematic use of it to measure, improve, and provision care isn’t likely to be dropped off the formulary any time soon.  It would be helpful then to have a shared language to allow clinicians and technicians alike to cut through the fog of jargon and focus on using data productively.

Developed through trial and error (mostly error) is a simple heuristic that I have found useful for establishing a shared understanding around using data in healthcare. I’ll call it Data Thinking, if only to keep with the tech tradition of stealing working names from other products (in this case, Design Thinking).

Data Thinking is a simple way of coming to consensus, explaining the jobs to be done, and mapping buzzwords to function.  Regardless of vendor, technology, or buzzword, making data useful falls into a few basic steps:

  1. Access – getting your hands on the data
  2. Structure – getting it to “apples to apples” so you can do the math
  3. Analysis – learning what matters
  4. Interaction – putting it to use: right place, time, people, presentation


Screen Shot 2016-01-15 at 8.19.25 AM

I first described the idea to help explain how informatics can contribute to comparative effectiveness research.  I have since used this simple heuristic to design and scope everything from national infrastructure projects for the Dept. of Veterans Affairs to helping fellows and students plan their projects.  It served as a roadmap for the design of courses in the inaugural year of Northeastern University’s Masters in Health Informatics and for the entire undergraduate Informatics curriculum for the Mass College of Pharmacy and Health Sciences.  I submit it to this particularly active audience of thinkers and do’ers with the hope that you’ll improve, adapt, or adopt any part that you find helpful.

Here’s a deeper dive on each of its components.  I’ve also offered a few buzzwords falling under each step to make it more tangible with real world tech examples.

Context is king (judge, jury, and executioner).  It matters.  Alot.

Data, is a funny raw material in that it gets its value from context.  For example, diagnosis data (e.g., ICD codes) captured for billing purposes can be considered of perfectly acceptable quality for that intended use.  This same piece of data might be worthless for research and even inappropriate for clinical care.  This inconsistency is at the heart of many of the complaints about EMR design as well as most debates surrounding new quality of care metrics and incentive programs.  

The importance of context means that at every stage (e.g., access, structure, analysis, interaction) your success will be determined, at least in part, by factors that have little to do with tech and everything to do with context.  To illustrate, consider which is more pressing for the success of a project: the selection of a particular database technology or answers to context-specific questions like:

  • Who owns the data and how can you gain access? (access)
  • How is the data currently formatted and what challenges will this create? (structure)
  • What methods are best suited to arrive at answers that users can trust? (analysis)
  • Who exactly will employ the results and how will they employ them? (interaction)

Ignoring context can (and does) lead to picking the wrong tools for the job.  Hospitals and health plans successfully utilize business intelligence tools to run queries on disease codes and procedures to understand “units sold” (i.e., utilization) in terms of procedures, encounters, and prescriptions.  This works because utilization data is captured in our systems with this very context in mind.  

We run into a buzzsaw when we try to use these same tools to understand why things were done, whether or not they worked, and what’s likely to happen next.  Why?  Because improving the quality of care was never considered in the original design of today’s data capture systems and therefore isn’t in ‘easy to query’ format.  

Things get interesting when the context of capture differs from the context of intended reuse. In healthcare, that’s almost always the case.

1) Access

All healthcare data-related projects begin with access for an obvious reason: if you can’t get access, you can’t make data more useful.  Unfortunately most projects end here too. Unless you hold a position of influence in the organization, few of the levers are yours to control.  Access to aggregated clinical data is not supported by most of today’s health information systems.  Most seeking access must compete for the limited resources of IT teams. Additional barriers are imposed by policies of access designed to prioritize the risk of breaching patient privacy over the risk of not learning what we’re doing and whether it’s working.  

Be wary: the death of a project via access issues is rarely a quick and merciful one.  It usually sounds more like “That’s probably do-able” followed by months of dead ends.  To avoid death by slow roll, address data access by seeking answers to data access questions as early as possible:   

Who will need to sign off on data access? Does what you’re doing qualify as research?  What resources are available to facilitate access?   Where will you put the data once you have it and what barriers might that present?  Does the organization you’re working with have any experience in providing the type of access you’re looking for?  If so, who can you talk to that has accessed similar data?

Access Buzzwords

“The cloud” (external hosting of data and applications) is probably the most relevant.  Popular issues like hospital data breaches, new analytics vendors, and an upcoming onslaught of genomic data help add to the cloud conversation.

Application Programming Interfaces (APIs).  These are basically hooks exposed by software (e.g., programmatic agreements) that allow other programs to interact with it.  APIs are getting attention as a possible light at the end of the tunnel of EMRs unable (or unwilling) to grant access to important data / interfaces.

2) Structure

So you managed to gain access to data.  Now, to make data useful for analysis you must prepare it for “apples to apples” comparisons.  The mess of disparate and disconnected information systems in healthcare means that there is tremendous opportunity for employment traversing data structures.  These positions are guarded by an ingenious system of acronyms designed to deflect unwelcome intruders.  A typical day in the life of a hospital data analyst is likely to feature the design of ETLs in SQL or SAS to map DICOM from one PACS to another or cleaning up LOINC inconsistencies in the LIMS.  You still there?

If so, then you should press on the context surrounding each of the structures in use and their appropriateness for whatever reuse is proposed.  For example, if you’re trying to automatically populate a cancer registry and the source of your data is your hospital’s pathology reports, start by asking how that data is structured.  Is it formatted as unstructured free text?  If so, is the data you’re interested in captured in accordance to a standard (e.g., AJCC)?  If it’s captured by a drop down list in the electronic medical record are the possible values comprehensive and granular enough for whatever purposes you have in mind for them?

Structure Buzzwords:

Data Warehouse: Brings together data from multiple sources (access), hopefully with some consistency (structure).

SQL: Structure query language or SQL is the primary language of storing and manipulating data in databases.  It plays an important role in formatting data for analysis.

ETL: Extract, transform, and load (ETL) tools use SQL to migrate data from source to source. ETL tools allow the user to save jobs, assign permissions, etc. making it easier to do the same job in the future.

Natural Language Processing (NLP): NLP adds structure to unstructured free text via software pipelines.  By some estimates up to 70% of clinically relevant data is stored as clinical notes, making this an increasingly important technical approach in healthcare.

ICD-9 or 10, CPT, SNOMED, UMLS, LOINC, DICOM:  The great thing about data structure standard such as these is that there are so many to choose from.  For example, the Unified Medical Language System is a standard containing 90+ standards for mapping medical free text to standards.

3) Analysis

With access and structure solved, it becomes time to look for patterns of interest in the analysis stage.  It’s important to not simply assume which methods / tools are best based on familiarity to them. Many in healthcare have a tendency to equate analysis with SAS (a statistical software package).  That’s the equivalent of bringing only a screwdriver to the job site.  Sure, the screwdriver is a good tool to have but it’s probably best to complement it with some options.

Are basic counts of few data points good enough? Or do the numbers required prohibit manual counting?  If so, you’ll need to rely on automated methods of analyzing your data.  What metrics are trusted by your intended recipients of the results?  For example, recall, precision, and F-measure are perfectly acceptable ways to present back the accuracy of predictions in the computer science world.  However, most clinicians are more comfortable with sensitivity, specificity, and area under ROC (receiver operator curve).

Can you rely on the data you have to support the type of analysis you want to do?  For example, is an analysis of 2 year old claims data going to yield reliable insights into the effects of a behavior modification intervention?  Or might you have to budget for follow up of a random sub-sample via telephone calls?  Too many projects begin to consider these questions long after budgets are spent, forcing good statisticians to add ‘miracle worker’ to their resumés.

Analysis Buzzwords:

R, SAS, and SPSS are statistical software packages helpful for exploring data and testing hypotheses.  They offer a number of proven and even experimental statistical approaches / methods to choose from.

SQL: I mentioned SQL as a useful tool for structuring data.  It also comes in handy for certain types of analysis by supporting the application of boolean logic to seek patterns or return specific values in datasets (e.g., ‘find all patients where date of birth > 1/1/1995’).

Data mining, machine learning, “big data,” predictive analytics, cognitive computing:  These methods are useful for discovering patterns in noisy data, particularly when the evidence is spread across multiple sources and formats.  For example, at Cyft we use these methods to understand which patients actually have pneumonia or diabetes with complications because one can’t rely on disease codes alone to answer these questions.  Similarly, these methods are well-suited to predict what will happen, such as which patients are most likely to be re-admitted following CABG.

4) Data Interaction

The best analysis in the world using perfectly suited data won’t make a single patient’s life better if it isn’t presented to the right person at the right time and in the right format.  As implied by the term, “interaction” is a two way street involving the design of the best way to capture data as well as the best way to present it.

Design – something we in healthcare have traditionally ignored – comes in handy when it comes to interaction.  This implies *gasp* talking to the intended users of our systems, understanding their needs and finding the most efficient way to meet them.  Good interaction requires answers to questions like: who will use the results, when will they use them, what’s the optimal amount of information to present at different stages of their workflow?

Interaction Buzzwords

Interaction-related buzzwords exist.  Unfortunately, in today’s engineered world of healthcare, most clinicians don’t encounter them.  It’ll be a good sign when acronyms like UX (user experience), UI (user interface), and participatory design start making their way into clinical experience.

Healthcare needs much more than a simple way of thinking to bring the opportunity of technology in alignment with the task of improving care.  However, one shouldn’t underestimate the ability to cut through jargon to achieve informed decision making.  Good clinicians are always traversing the gap between clinical and lay language – or even cross speciality jargon – to arrive at better outcomes.  It is no different when the patient is healthcare itself and information technology is among the proposed interventions.

Next time you’re involved in using data or IT to improve care, give it a try.  Jot the pyramid on a whiteboard.  Walk the team from access through interaction.  Asking the tough context questions above at each stage.  Then let me know if it worked and how to improve it.  Post your feedback or questions below or if you prefer email me directly (ldavolio at cyft dot com).  With any luck we’re make it just a bit easier to make sure data becomes one of healthcare’s most valuable resources.

Leonard D’Avolio Ph.D., is the CEO and co-founder of Cyft, assistant professor at Harvard Medical School, an advisor to Ariadne Labs and the Helmsley Charitable Trust Foundation.  He can be followed on twitter @ldavolio and his writings and bio appear at http://scholar.harvard.edu/len

Categories: Uncategorized

Tagged as: