Freeing the Data

I’m keynoting this year’s Intersystems Global Conference on the topic of “Freeing the Data” from the transactional systems we use today such as Enterprise Resource Planning (ERP), Customer Relationship Management (CRM),  Electronic Health Records (EHR), etc.  As I’ve prepared my speech,  I’ve given a lot of thought to the evolving data needs we have in our enterprises.

In healthcare and in many other industries, it’s increasingly common for users to ask IT for tools and resources to look beyond the data we enter during the course of our daily work.   For one patient, I know the diagnosis, but what treatments were given to the last 1000 similar patients.  I know the sales today, but how do they vary over the week, the month, and the year?   Can I predict future resource needs before they happen?

In the past, such analysis typically relied on structured data, exported from transactional systems into data marts using Extract/Transform/Load (ETL) utilities, followed by analysis with Online Analytical Processing (OLAP) or Business Intelligence (BI) tools.

In a world filled with highly scalable web search engines,  increasingly capable natural language processing technologies, and practical examples of artificial intelligence/pattern recognition (think of IBM’s Jeopardy-savvy Watson as a sophisticated data mining tool), there are novel approaches to freeing the data that go beyond a single database with pre-defined hypercube rollups.   Here are my top 10 trends to watch as we increasingly free data from transactional systems.

1.  Both structured and unstructured data will be important

In healthcare, the HITECH Act/Meaningful Use requires that clinicians document the smoking status of 50% of their patients.   In the past, many EHRs did not have structured data elements to support this activity.    Today’s certified EHRs provided structured vocabularies and specific pulldowns/checkboxes for data entry, but what do we do about past data?   Ideally, we’d use natural language processing, probability, and search to examine unstructured text in the patient record and figure out smoking status including the context of the word smoking such as “former”, “active”, “heavy”, “never” etc.

Businesses will always have a combination of structured and unstructured data.   Finding ways to leverage unstructured data will empower businesses to make the most of their information assets.

2.  Inference is possible by parsing natural language

Watson on Jeopardy provided an important illustration of how natural language processing can really work.   Watson does not understand the language and it is not conscious/sentient.   Watson’s programming enables it to assign probabilities to expressions.     When asked “does he drink alcohol frequently?”, finding the word “alcohol” associated with the word “excess” is more more likely to imply a drinking problem than finding “alcohol” associated with  “to clean his skin before injecting his insulin”.    Next generation Natural Language Processing tools will provide the technology to assign probabilities and infer meaning from context.

3.  Data mining needs to go beyond single databases owned by a single organization.

If I want to ask questions about patient treatment and outcomes, I may need to query data from hundreds of hospitals to achieve statistical significance.   Each of those hospitals may have different IT systems with different data structures and vocabularies.   How can a query a collection of heterogenous databases?   Federation will possible by normalizing the queries through middleware.   For example, data might be mapped to a common Resource Description Framework (RDF) exchange language using standardized SPARQL query tools.   At Harvard, we’ve created a common web-based interface called SHRINE that queries all our hospital databases, providing aggregate de-identified answers to questions about diagnosis and treatment of millions of patients.

4.  Non-obvious associations will be increasingly important

Sometimes, it is not enough to query multiple databases.   Data needs to be linked external resources to produce novel information.  For example, at Harvard, we’ve taken the address of each faculty member, examined every publication they have ever written, geo-encoded the location of every co-author, and created visualizations of productivity, impact, and influence based on the proximity of colleagues.   We call this “social networking analysis”

5.  The President’s Council of Advisors on Science and Technology (PCAST) Healthcare IT report will offer several important directional themes to will accelerate “freeing the data”.

The PCAST report suggests that we embrace the idea of universal exchange languages, metadata tagging with controlled vocabularies, privacy flagging, and search engine technology with probabilistic matching to transform transactional data sources into information, knowledge and wisdom.    For example, imagine if all immunization data were normalized as it left transactional systems and pushed into state registries that were united by a federated search that included privacy protections.  Suddenly every doctor could ensure that every person had up to date immunizations at every visit.

6.  Ontologies and data models will be important to support analytics

Part of creating middleware solutions that enable federation of data sources requires that we generally know what data is important in healthcare and how data elements relate to each other.   For example, it’s important to know that an allergy has a substance, a severity, a reaction, an observer, and an onset data.   Every EHR may implement allergies differently, but by using common detailed clinical model for data exchange and querying we can map heterogeneous data into comparable data.

7.  Mapping free text to controlled vocabularies will be possible and should be done as close to the source of data as possible.

Every industry has its jargon.   Most clinicians do not wake up every morning thinking about SNOMED-CT concepts of ICD-10 codes.   One way to leverage unstructured data is to turn it into structured data as it is entered.   If a clinician types “Allergy to Pencillin”, it could become SNOMED-CT concept 294513009 for Pencillins.  As more controlled vocabularies are introduced in medicine and other industries, transforming text into controlled concepts for later searching will be increasingly important.   Ideally, this will be done as the data is entered, so it can be checked for accuracy.  If not at entry, then transformations should be done as close to the source systems as possible to ensure data integrity.   With every transformation and exchange of data from the original source, there is increasing risk of loss of meaning and context.

8.  Linking identity among heterogenous databases will be required for healthcare reform and novel business applications.

If a patient is seen in multiple locations how can we combine their history together so they get the maximum benefit of alerts, reminders, and decision support?    Among the hospitals I oversee, we have persistent linkage of all medical record numbers between hospitals – a master patient index.   Surescripts/RxHub does a realtime probabilistic match on name/gender/date of birth for over 150 million people in real time.   There are other interesting creative techniques such as those pioneered by Jeff Jonas for creating a unique hash of data for every person, then linking data based on that hash.   For example John, Jon, Jonathan, and Johnny are reduced to one common root name John.    The combination of John + lastname + date of birth is hashing using SHA-1.   In this way, records about the person can be aggregated without ever disclosing who the person really is – it’s just that hash that is used to find common records.

9.  New tools will empower end users

All users, not just power users, want web-based or simple to use client server tools that allow data queries and visualizations without requiring a lot of expertise.  The next generation of SQL Server and PowerPivot offer this kind of query power from the desktop.    At BIDMC, we’ve created web-based parameterized queries in our Meaningful Use tools, we’re implementing PowerPivot, and we’re creating a powerful hospital-based visual query tool using I2B2 technologies.

10.  Novel sources of data will be important

Today, patients and consumers are generating data from apps on smart phones, from wearable devices, and social networking sites.   Novel approaches to creating knowledge and wisdom will source data from consumers as well as traditional corporate transactional systems.

Thus, as we all move toward “freeing the data” it will no longer be ufficient to use just structured transaction data entered by experts in a single organization, then mined by professional report writers.   The speed of business and the need for enhanced quality and efficiency is pushing us toward near real time business intelligence and visualizations for all users.   In a sense this mirrors the development of the web itself, evolving from expert HTML coders, to tools for content management for non-technical designated editors, to social networking where everyone is an author, publisher, and consumer.

“Freeing the data” is going to require new thinking about the way we approach application design and requirements.   Just as security needs to be foundational, analytics need to be built in from the beginning.

I look forward to my keynote in a few weeks.  Once I’ve delivered it, I’ll post the presentation on my blog.

John Halamka, MD, MS, is the CIO at Beth Israel Deconess Medical Center and the author of the popular Life as a Healthcare CIO blog, where he writes about technology, the business of healthcare and the issues he faces as the leader of the IT department of a major hospital system. He is a frequent contributor to THCB.

3 replies »