Today, word came out that NY released taxi data that has been entirely reidentified.
The technique and concepts to conduct the attack can be found here, and I also found the slashdot discussion interesting.
The result is that the identity and paths of specific named taxi cabs is now public information. This is not entirely bad, since now the data set will be extensively used to detect specific bad actors. Still it was more than the NY government intended and will probably result in a lawsuit.
That lawsuit will be mostly justified, since it is well-understood among security professionals how you do de-identification right and the rules were not followed. If you are doing this with health data, I can recommend fellow O’Reilly Author Khaled El Emam who wrote both Anonymizing Health Data and also Guide to the De-Identification of Personal Health Information both of which I can recommend. You can hire him through Privacy Analytics. He is the de-identification expert that I know the best and I can endorse, but he is far from the only one.
Generally, hashing can be a reasonable approach as long as salts are used in combination with a secure hash algorithum. I prefer to use a different salt for every id, which makes a rainbow attack (like this one) pretty hard to do.
More importantly, it also entirely appropriate to simply use a randomly generated number instead of a hash. Hashes are convenient when you need to rely on a dynamic and extensible process, rather than static data. It also allows you to throw away the original data, and know that you can reliably repeat the process given new data. That is why it is used so frequently in password storage.
This will result in a chilling effect for open data releases unfortunately, but I am glad it happened. This is a relatively unimportant data set. Which is to say, this could have been much worse. This could have happened with patient data. I work with stuff like HIV and TB infection data, as well as EHR notes containing infidelities etc. I hate to say it, but its better for governments to learn on taxi cabs.
Lastly, I would encourage those who are considering doing data releases like this to reach out to organizations like Propublica and/or DocGraph. If you cannot afford to hire Khaled, we can at least help to ensure that you avoid the basic mistakes. Believe it or not, data journalists like myself are not interested in violating legitimate privacy rights (although we can have a healthy debate around the word “legitimate”) and we would be more than happy to help ensure that a data release is free from reidentification drama.
Part of me wonders why they didn’t just release the taxi data with the taxi numbers intact. I strongly prefer real-name accountability in data sets like this. It might be because by learning the identity of the taxi, you might be able to infer the identity of the passenger, who has a legitimate privacy concern.
Accidents like this will happen, and NY was right to make a release rather than hold back a release because there “might” be a way to reidentify a data set. My hat is off again to NY state/city… innovators in open data.
-Fred Trotter