I am happy to announce the release of the doctor “referral” social graph. This dataset, which I obtained using a Freedom of Information Act request against the Medicare claims database, details how most doctors, hospitals and other providers team together to deliver care in the United States. This graph is nothing less than a map of how healthcare is delivered in this country.
For the time being, the only way to get a copy of this data set is to support the Medstartr crowd funding campaign for either $100 (for the viral “open source eventually” version of the data) or $1000 (for the proprietary friendly version of the data, that any business can freely “merge” with other data). If you need consulting around this data, you can buy in at the $5k or $10k levels. Also, we are going to have really awesome t-shirts.
I will be writing a more in-depth technical article about this dataset over on the brand new O’Reilly Strata blog (which focuses specifically on Big Data) so I will gloss over most of the technical details here, with a few important exceptions.
First, when I say a “graph” I am not talking about a diagram. I am talking about a mathematical model that supports nodes and connections between those nodes. These are visualized as diagrams, but it is not possible to really analyze large graphs without a database. In this case, the nodes are doctors, hospitals and other providers and the connections between those nodes represent the degree to which they collaborate on specific patients.
Also, despite my branding to the contrary, this is not strictly a “referral” data set, although a fairly large portion of the data do represent referral relationships. Instead, it depicts the degree to which any healthcare provider “works” on a patient in the same time frame as some other provider. This means, for instance, that many primary care doctors are linked to emergency rooms. But this just means that a patient they were seeing was also seen by the emergency room in the same time period. Referral relationships can be inferred from this data, but not presumed.The last issue is the question of patient data. There is no data specific to patients in this data set. This is “better” than deidentified patient data, since the patient is essentially entirely abstracted out of the equation. As a result this data release is very unlikely to have patient privacy implications.
So with that controversy avoided, lets talk about the “other” privacy drama: Doctor privacy.
Some doctors will view the release of this dataset as a violation of their privacy. This is human nature, it is easy to view the previously not “public” information as “private”. In reality this data was “opaque”, neither really open or really closed. It was available to anyone who wanted to do enough digging, and very few people are willing to do that work. Doctors who bill Medicare (which is optional) are government contractors. They have an obligation in that role to the American public, and FOIA applies to them just as much as it does to Boeing or Lockheed Martin or any other government contractor. The FOIA law has always made this data available, all I am doing is making access to this data convenient. Using FOIA as a vehicle I am taking the raw data that CMS has and putting it into a format that is easy to process and understand. It is a tribute to people like Todd Park and Aman Bhandari and for that matter, Barack Obama, that this FOIA request was processed at all. I expect that previous administrations might have been unwilling to respond to a FOIA request that was “creative” in the way that this request was.
I think openness and accountability are good things, both from the government and from doctors. If we think accountability is good, then it is worth asking “Who should doctors be accountable to?” The insurance companies already know this data (or could if they wanted to). The pharmacy chains can all calculate this graph by watching prescription patterns, which is data that are frequently selling to the drug companies. The government obviously can see this data, which is how I was able to acquire it. Competent reporters can gum-shoe this information about particular groups of doctors. While all of these groups will benefit from having simple access to this data, none of them were truly denied it before.
Which means that the only people who were blind to these patterns are patients and, ironically, doctors themselves. Healthcare is a strange industry where consumers have the less information about their caregivers than most other parties. Normally an organization like the Consumers Union can step in to ensure that consumers are fully informed. There are good organizations, including Consumers Union and Propublica (whose advice I have found invaluable) , that attempt to get the data required to take this role. But all of them are crippled because they lack access. This data release is not enough to completely solve this problem, but it is certainly a step in the right direction.
This is the whole point of me releasing, rather than hoarding, this data set, in order to enable patients to engage with their healthcare in a completely new way. There will be lots of uses of this data, but my most important project for this data is simple: I want to create algorithms to rate doctors that patients find useful and that doctors find fair. This is my primary aim with the release of this data set. Of course, there are other clever uses for this data:
* Doctors will be able to use the dataset to determine where to set up shop. For instance, if everyone in a small town is referring to nephrologist in a large town a hundreds of miles away, and the volume of those referrals is high enough, then a nephrologist can see the business case for moving to that smaller town. Currently doctors pay tens of thousands of dollars for somewhat-less-than-completely-subjective information about the right place to start a new practice. Perhaps we could do better with this data set.
* Insurance companies can overlay this data with their own data to get a very accurate picture of how their own doctors operate. This is especially helpful for any kind of capitated care model. Insurance companies pay hundreds of thousands of dollars to massage their own data, and now they can essentially double their insights for trivial dollars.
* For Accountable Care Organizations, it is possible to use this as a map for which specialists to pursue for capitation contracts. Trying to force primary care doctors to work with people they are not familiar is a waste of time, but getting them to merely consolidate on specialists that they are already comfortable with will be much easier. We are going to be tooling up for this specific use case and other ACO specific open projects with our new ACO Wrangler project(s).
* For almost any Health IT startup, this dataset is a sales map. It will show which doctors in which cities are the most well-connected and central figures. This is especially true for startups with a network of doctors component, like Sermo, Doximity, or Tap Health.
* For hospitals, this data can be overlayed with critical patient safety data like hospital readmission rates, in order to see what referral patterns are connected with poor coordination of care among outpatient facilities. This data set should enable them to detect the difference between doctors who fail to coordinate care for no reason in particular and those that fail to coordinate care because they have minor relationships with ten different hospitals, rather than major relationships with two.
* For government officials, this data can be used to make a map of which healthcare organizations are actually delivering care in a community. This might help to break up monopolies that can be detected and validated in the data.
* Lawyers, will find lots and lots of evidence of, well whatever they are looking for evidence for. There will be lots of lawsuits that reference this data. Preferably not including me personally.
Of course, I hope to make lots of money consulting around this data. But I also want to create an ecosystem around the data too, which means no hoarding.
I want other data scientists, who are more familiar with graph theory, to start to analyze this data set. My goal is to get the best data possible into the hands of everyone who wants it, at rates they can afford. Those that cannot afford anything at all will simply have to wait… eventually this will become an entirely open dataset. This is the reason that the “Open Source Eventually” data is so cheap, using a viral Creative Commons License, I want researchers to be forced to publish the merger of this graph data, and whatever else they are “sitting on”.
This is heart of the “dual licensing” model we are using. If you want to join the open community of researchers and data scientists who will be improving the public version of the data set, $100 buys you exclusive early access. If you want to leverage this data in your business, then you can get a copy without the viral open source license for $1000. This way, those who profit from closed-door analysis are helping fund those of us who are willing to innovate in the open.
This data set is very likely the largest named social graph, of any kind, that will be publicly available. Facebook, Twitter and LinkedIn all have much larger data sets, but they only let small slices of that social graph outside of their walls. I cannot think of another graph data set of this size that shows how real entities partner together with this level of granularity. The core NPI data set (the National Provider Identifier is the “key” for this data set) already has physical addresses for all of the several million entities that it encodes. This means that referral and teaming patterns can be studied along with any other data set that is also Geo-encoded. If a data scientist is interested in graphs of any kind, then this data should be attractive.
But I want to make the data even better. Specifically, I want to make it better in a way that enabled my favorite use case, the development of objective, fair and useful doctor rating systems. In order to do that, the most valuable “opaque” data sets are available from the various state level medical boards. Each medical board sets its own price for this download (usually more than $50 usually less than $500) and chooses its own data format for the download. This makes it extremely difficult for the average data scientist to get an accurate picture of a doctors history. In order to get this data today, you either have to submit to a very painful process, or you have to pay someone a tremendous amount of money. This prevents an extensive ecosystem of data scientists solving real world problems with this information. This data is often called “credentialing” data, because it forms the basis of something like a “credit history” for doctors. Insurance companies want to be sure that they are not hiring flagrantly bad doctors, and so they require doctors to provide the details of their state-level practice history in order to prove that they are competent and safe. I want to make this credentialing data into something everyone can see. Interestingly even doctor struggle to track their own data as they travel between states. Pretty much no one is satisfied with the current confused situation regarding doctor data.
I want to fix that. As a result, I am asking for 150 people to give me $100, for a total of $15k. With that money I can afford to buy all of the state medical board data (which includes information about what about medical schools, certifications and any board imposed punishments) and merge it into a single massive doctor database. Then I will give that database back to all of the sponsors. Its a simple equation: pay a little money and get lots and lots of data that would cost you either months of life or hundreds of thousands of dollars or both. Please consider pitching in..
Of course, you could decide to wait. However, if you do, know that this data will never be cheaper than it is during this initial Medstartr. The participants in this Medstartr will be given permanent and significant discounts on all future releases of doctor data. We are planning on raising the price for the proprietary-friendly version of this data set from $1000 to $3000 immediately after the Medstartr ends.
For those of you who are not that into the data itself, but generally support the notions of open data in society, we have pretty awesome limited edition artwork and t-shirts available. This is an interesting experiment for us: is this kind of radical openness something that the community at large will financially support? Can you crowdfund open data sets? We are using a combination of “Open Source Eventually” and “Dual Licensing” in order to ensure that those that contribute get the most value out of the dataset. But we are surprised at how many people have decided to pay way too much money to buy a t-shirt in order to support this idea.
This is the first public effort of the Health IT Not Only For Profit Micro-Incubator NotOnly Development. If you would like to be kept informed about our future patient skunkworks projects be sure to visit NotOnlyFor.com We plan to make money at these projects but we also want to create a space where real change is just as important as the bottom line. There are lots of projects that are just too complex and small for non-profit foundations to take the time to wrap their heads around. Sometimes, these projects will make money, but not enough to be worth creating a normal for profit company around. This development shop is being formed to address this “doughnut hole” for innovative Health IT projects. I may have to get a “real job after a while, but until that happens, this is pretty much the best work ever.
If you have time, you might enjoy watching the keynote of this data release at Strata RX:
Lastly, here is an amateurish screen cast that describes what the data is, specifically, and how it can be used.