So this is really interesting: the more data Medicare releases on provider payments, the better. But there are real concerns about patient privacy here. I remember when I was a wee undergraduate at CMU, Professor Fienberg was working on how to release raw data from the Census Bureau without revealing personally identifiable information. You can imagine the problem like this: in towns like the one I grew up in in Vermont, revealing that “the average black person” earns a certain sum of money could well mean that you’ve just revealed John Smith’s income; there just aren’t that many black people in Vermont.

As I understood it at the time — note here that my understanding is many years out of date — the Census Bureau had a couple ways of releasing its multidimensional contingency tables. First, it would only publish data in a given cell if the number of observations in that cell was above some threshold (that is, if the cell didn’t uniquely identify John Smith). I believe they also applied some scaling factor to every cell, deliberately obfuscating it so that any summary statistics from the table would come out right, but raw data were all incorrect.

These problems get harder if you’re able to combine, say, Census Data with data that you get from credit-card companies or data from (as above) hospitals. The more data you can agglomerate, the less anonymous any one source is, no matter how hard you try. I’m sure there are lots of people, all around the country, working very hard to de-anonymize various databases for marketing and law-enforcement purposes.

Point being just that, while releasing raw Medicare data would be terrific (the AMA’s comment in that link that people wouldn’t know what to do with all that raw data, and would take it out of context, is thoroughly disingenuous), there are difficult problems to surmount first. I wish them luck. I should check to see where Professor Fienberg’s work has taken him; the last update I got was more than a decade ago.