Cisco Blogs

Entropy and Reidentification

September 17, 2009 - 0 Comments

Paul Ohm’s recent paper about the failures of anonymization brought to light some very compelling arguments against the practice. The goal of anonymization is to remove personally identifying details without removing other usefulness from a dataset. As an example, a company might take out names, social security numbers, day and month of birth, street address and credit card information from their customer dataset, but leave purchase history. Such an anonymized dataset might be useful to a marketing partner to identify trends in some generalized demographics that could help them to make more effective decisions in marketing products to future and returning customers.

Ohm’s paper highlighted some recent failures involving publicized datasets that had been anonymized to protect individual identity. Organizations have recognized that privacy is important and the current legislation has supported anonymization as an appropriate method to reduce the risk of identification while allowing collected data to be shared. However, Ohm concludes that both the results of anonymization and the protections afforded by leading legislative efforts have not lived up to expectations of effectiveness. If the research holds true, this could significantly alter privacy regulation as we know it.

One case Ohm describes involved the failure of Netflix’s anonymization efforts to prevent individual reidentification. Researchers were able to take such an anonymized dataset and extract individual identity using movie ratings. By combining the Netflix dataset with a publicly available dataset from, the researchers were able to see patterns in movie ratings from Netflix and movie reviews on IMDb. This led to significant certainty in correlating individual identities between the two datasets. This brings up the point that while one data set may seem randomized (i.e. movie rental histories of individuals), if there exists another source for correlating this data (i.e. movie recommend choices grouped elsewhere) the initial set may not be invulnerable to reidentification.

This creates some possible legislative problems. The challenges of balancing the benefits of release of information against the potential harm caused by identification have been handled by protecting those who anonymize. But if anonymization is not effective, then the laws established for protecting personally identifying information (PII) are significantly weakened. Under HIPAA (the Health Insurance Portability and Accountability Act), anonymized health records data is not regulated; under the European Union’s Data Privacy Directive (DPD), organizations must follow careful handling, processing, and storage guidelines for any data that can directly or indirectly identify individuals.

The key to this reidentification is entropy. In information theory, entropy represents the amount of uncertainty about an unknown value. The goal of anonymization is to increase the entropy of a dataset as high as possible without reducing the usefulness of the resulting anonymized set. But by correlating multiple datasets, it is possible to identify patterns and reduce the amount of entropy, and in many cases lead to high mathematical probability that individuals can be positively and confidently identified. The more information that can be gathered to compare and combine, the more likely that identity can be extracted.

In both HIPAA and the DPD, the threat of reidentification through entropy reduction makes a compelling case that anonymization could be subject to spectacular failures. Under HIPAA, no regulations are applied to data that is effectively protected — and while the HIPAA rules absolve anonymized data, this research suggests that anonymization is not a suitable safeguard. Under the DPD, all data that could result in the reidentification of an individual must be protected, including things like movie reviews that do not appear, at first glance, to be personally identifying.

Yet one could make the case that after some as-yet-undefined threshold, any data that reduces the entropy of a dataset could be personally identifying. The potential for reidentification through the presence of “sufficient” unique data would greatly increase the burden of data protection under the DPD. Classifying a threshold or an amount of sufficient entropy loss could be very difficult. Since a given data owner cannot possibly know what all other data owners collect regarding an individual, it will be nearly impossible to judge which items could be leveraged during the combination of different datasets and subsequent reidentification.

Ohm notes that anonymization is not the only method for legislatively balancing the benefits of information sharing with the need for privacy. However, the expectation with anonymization was that the burden on organizations would be relatively light compared to other methods. If privacy efforts and regulations adapt to these discoveries, the resulting controls for privacy protection may be much more difficult to implement than the practices that are generally accepted today.

In an effort to keep conversations fresh, Cisco Blogs closes comments after 60 days. Please visit the Cisco Blogs hub page for the latest content.