First of all I have to admit that I am a convert and not an original believer in the Data Lake and late binding approaches to data analytics. I do not think it is my fault or at least I have a defense of sorts. I grew up in a world where my entrepreneurial heroes were people like Bill Gates, Larry Ellison, and Steve Jobs and it seemed that structured systems like operating systems that allowed many developers to work against a common standard were the way to go.
I grew up in a world where UNIX was fragmented into 12 types making us compile our code on a number of different boxes just so people could use our scripts to provide the semblance of a dynamic web site. So when it came time to figure out a way to make health care better I worked with our team to build a model of how to standardize and structure data into a schema that had a home for almost any data we could encounter. We built a data model and called it the Data Trust because we wanted to be sure that the data in that healthcare data model was clean and trusted as a single source of the truth. We had some success as well.
Last week I was watching “The Big Short”. I read the book and so I felt smug enough to wait for the movie to come out on streaming media. As I watched the beginning of the movie something was made clear. The people who discovered that there was a bubble did something that others did not do. They looked back to the original information underlying all of the layers of cleaning and rating. They saw that the base loans in bonds were not what they seemed to be and when others visited places like Florida, they found that people were in the open taking out loans that couldn’t possibly be paid back. The risk in having clean information with many layers of transformation and analytical filters is that you can get into the conundrum of “The Big Short”. People who can make sense of the underlying signals may lose the fidelity of the signals in what is eliminated in the cleansing or data could be transposed as a matter of well-intentioned organizational policy. Essentially it is very important that smart people with the right mind set to figure out whether their hypotheses are right are able to get to original data to look at it and think through the meaning of that data. What they need may be a Data Lake but it maybe something more.
What most people tell me when I am working on a data project, even in very advanced organizations – “We don’t know where our data is?” along with “Even when we know where our data is we don’t have any way to get to it.” This is the essence of the concept of dark data. The old me would tell them that they needed to grab a hold of it all and map it into a standardized enterprise data warehouse.
But now I am wiser. In the mantra of Scrooge McDuck I prefer to work smarter not harder. Now what I think is really needed is to make maps of where the data is in place. For most of the data out there the map will be the only way to get to it and that is fine because as long as data can be geo-cached (those of you with kids; who have mosquito bites to prove you have wandered in the wilderness looking for Tupperware™ boxes know what I mean) then the data consumers who are interested in finding it can slowly migrate it from dark data into enterprise data at the appropriate pace. The first thing we need then is not an enterprise data warehouse and is not an enterprise data lake but a Patient Data Positioning System, a GPS for data that can help to get to the underlying location of information of interest.
If we take that analogy a bit further we can build the many layers or the map views including data in motion from one system to the next on top of that initial Data Positioning System (DPS) based view. This would align well with the maps drawn of the many layers of patient data for the precision medicine project, but instead of layering by type the layers would be by process level with the base layer of the map just ‘where the data is.’
In some situations we may recommend that the data source simply ‘go away’ because it is causing more trouble than good. Especially if it is an intermediate form that is creating lots of work to maintain but not generating more truth. In other cases it may be important to migrate the data in a full copy to the Data Lake so that it can be analyzed ‘hot’ as it is entered. In still other scenarios it may be best to join that data with the other hundreds of data sets like it of the same type but different format into a unified format, or at least unified identifier framework so that it can be easily used without having to be cleansed every time someone tried to access it.
Now there is another problem that this can help to solve. It is the issue of annotations. In many types of patient data the underlying information stays the same but the annotations change over time based on the changing body of knowledge of how to interpret the underlying data. For example – Genomics struggles with this challenge in that annotation even at the level of ‘what is a variant’ changes as we discover new patterns of the genome that can vary. Additionally the layers of which of those variants have biological or clinical significance is a constantly moving target both from research done locally, what the literature has published, and how the organization wants to interpret the literature. Annotations, like in the Talmud, need to stay live even when the underlying information is static.
This is true in most complex high dimensional data sets including imaging (photos, xrays, MRIs, and slides), wave analysis such as EEG and ECG, audio files with associated transcripts, free text notes with NLP outputs, and mappings between structured concepts such as lab tests and LOINC codes or ICD9 and ICD10 codes. Among the things that the DPS should solve for is how to continuously update the annotations based on acquired knowledge while keeping the location of the data in place. So rather than looking to transform the data into a common format we should be solving for the continuous ‘expansion of understanding’ about the meaning of the data in the context of a world of information outside of the organization. That is a different system from how the traditional data warehouse has been laid out but the DPS can focus on using the position of data to map the annotation on top of it just like in a map as a layer. The system may need to potentially maintain the annotations over time in the way that a weather pattern can be seen developing from day 1 to 10 of a storm.
So for those of you who are still frustrated that you have not even gotten to a data lake and now I am asking for some new thing I want to encourage you that building a data map and a DPS will likely be much easier than building all of the pipes to move the data around and transform it into final forms. So this should offer an even easier way to get started on the path to ideal information, help groups understand that governance can help move the ball forwards, and enable the Michael Burry in your organization—the man who first saw the underlying flaw in mortgage bonds and derivatives—to look into the data and find something that surprises us and maybe helps our patients.
Dan Housman is CTO for ConvergeHealth by Deloitte.
Categories: Uncategorized
Thanks John,
Coming from the ‘decision support world; If I understand you correctly you are stating a need to associate contextual information with each data element; so beyond the standard source, date and time attributes other descriptors are useful. Example: business-clinical process, crosswalk to other locations and reference to the data collection ‘case’ e.g.this is a static blood pressure result and is inherited by the longitudinal patient health registry.
Probably not a good example on my part but illustrates difficulties I have witnessed when different analysts have a go at separate data-repositories at the application level. In the clinical world analysts typically do not understand the use-case for how the information they generate is used. The mistakes show up as large variance in what should be tightly clustered information as the reports are assessed at the committee level. When this happens the user of the information (in this case clinical outcomes measurement group) will suspect poor reliability and loose trust in the provider of information.
When moved to a (State not to be mentioned) in 2001 I discovered 4 “data warehouses” that contained information sharing the same description for each Medicaid recipient. Then when extracting and comparing information; the variability was significant enough to cause the physicians to use the States reports as ‘door-stops’.
A long time to solve this one…more due to politics required to bust-up the silos.
I wont speak to the store vs. do not store issue but only note the importance of considering all of the use-cases for interpretation.
I shifted careers in 1998 from clinical program manager to subject matter expert for a start-up that was owned by Rob Merenyi the former CTO of the Lotus Suite. As an object architect he taught me to appreciate the importance of thinking through the attributes of a data element. His abstract thinking was challenging for me but gave rise to much faster and intelligent application without sacrificing the data repository needed for different reasons by so many payers, providers, regulators and so on.
Even though I have no idea what a Data Lake is, much of what Dan is saying rings true. Data should stay in place as much as possible and various services should be mapping, providing directories, and managing authorization to access the maps, directories, and the data itself.
One place to start is with http://thedatamap.org/ – a project supported by Patient Privacy Rights that recently won a Knight grant for expansion. Another place is to convert all of the hidden data brokers – institutions that a patient doesn’t know exist and can’t access – to be patient-accountable and patient-transparent. As we are doing this, we need to heed the core message of the JASONs and other independent expert panels and make sure that the authorization for all this use of patient data is centered on the patient and not scattered among dozens of institutional portals. This is the goal of the HEART workgroup http://openid.net/wg/heart/ and we could really use folks like Dan on our calls in order to advance this vision in relation to FHIR as the emerging standard for access to data in the lake.