Revealed: Secret PIIs in your Unstructured Data!

By Andy Green, technical content specialist at Varonis.

Monday, 6th May 2013 Posted 13 years ago in by Phil Alsop

Personally identifiable information or PII is pretty intuitive. If you know someone’s phone, social security number, or credit card number, you have a direct link to their identity. Hackers use these identifiers, along with a few more personal details, as keys to unlock data, steal identities, and ultimately take your money. The lines between PII and non-PII data are blurring. It’s been known for at least 10 years that there are specific pieces of data, which may appear anonymous, but when they’re taken together are just as effective at identifying a person as traditional PII.

The easiest to understand of these so called quasi-PIIs is the trio of full birth date, post code, and gender. If a company had published a dataset that had been “de-identified” by removing all the standard PIIs, but left those three data items alone, a smart hacker could with very high likelihood find the name and address of the person behind that data.

Why would this work? At a very basic level, the identity thief is effectively doing the work of a detective--essentially going through lists looking for matches. The lists in this case are voting records, which in the US are available from most towns and counties at a nominal fee-- typically around $40. In the UK this information is free. Voting records contain name, address, and most importantly full birth date; post codes can be easily determined from address. By looking for matching birth dates and post codes, identity thieves narrow down the search to a few names. Add gender information and for most post codes, hackers can arrive at a unique name. Of course, the more additional information or clues gathered, especially taken from social media and other web sites, the easier it is to filter and narrow down names when there’s more than one candidate.

Take the US, for example. A quick back of the envelope calculation tells you why one might do very well with this approach. Taking 365 days--ignoring leap years--and multiplying by an average age of 80, it works out that a complete birth date gives 29,200 “bins” to place a zip code’s worth of US citizens. If you have gender information, you double the number of slots, to a little over 58,000.

I can hear nitpickers out there that saying that voting rolls contain only the names of those over the age of 18, so you would have to remove 6570 slots. True enough, but researchers have shown it’s possible to exploit Facebook’s leaky handling of data on school age minors to partially address this gap.
In any case, based on the last US census, there are over 40,000 zip codes, with an average of only 7000 people per zip code. On a gut level, it seems there’s a good chance most of those 7000 people will find themselves alone in one of those 58,000 slots. In other words, the odds are that most of them won’t share the same date of birth, zip code, and gender.

Carnegie Mellon computer science professor and data privacy expert Latanya Sweeney ran the numbers back in 2000: using then current census data (broken down by zip codes and age groups) she was able to identify 87% of the people in the US using just those three non-PIIs.

Piecing the information together is even easier in the UK, as a post code will often cover little more than a single street.

Fortunately, Sweeney’s research and results from other experts have made their way to policy makers. For example, when medical research on US patients is published, HIPAA’s Safe Harbor de-identification rules say that no geographic unit smaller than a state can be included in the public data. Full dates (e.g., admission, birth) must also have the year removed.

With US regulations on PII varying by the particular legislation, this is by no means a universal rule. However, the Federal Trade Commission, an influential regulatory agency on privacy matters, has recently issued new best practices on data de-identification. They’ve called for all companies to achieve a “reasonable level of confidence” that their public data can’t be linked back to an individual. Clearly, the combination of birth date, zip code, and gender would fail that test.

Are there other quasi-PII’s out there? Of course! The larger problem is that consumers are sharing all kinds of information about themselves on web sites and social forums. In a possible scenario, think of an online retailer collecting preference data about its customers—sports interests, hobbies, etc.—along with geographic data and perhaps income information.

These data items would not be considered traditional PII. If hackers pulled this “anonymous” data from a poorly permissioned file on a server, you could imagine them mining various special interest sites, looking for names that match up based on those interests and geo data. Once they have a match, the next step might be a phishing attack, with the hackers pretending to be the retailer.

For companies that want to stay ahead of the coming stricter de-identification rules—that are being considered in the US and will likely become law in the EU—it would be worth their while to start carefully reviewing their non-PII data. Wherever that data might be on their file system.