Anonymized data plays a crucial role in making our world a better place, from advancements in healthcare to customer service enhancements. However, research published today reveals that it is all too simple to identify specific individuals by reverse engineering such data.
Data scientists from Imperial College London and Yves Rocher Company Profile in a study that was published today in the journal Nature Communications that machine learning could overcome standard anonymization techniques and re-expose sensitive personal data of almost all individuals, even from incomplete datasets.
This means that companies that have purchased personal information can reverse engineer it without ever seeing the original data before it is anonymized and sold for use in AI projects, market research, and other applications.
Businesses can also use anonymized data to create increasingly in-depth personal profiles of individuals without the individual's knowledge.
How simple it is to reverse engineer anonymized data This study stands out because it is the first to show just how simple it is to reverse engineer anonymized data.
The data scientists were able to correctly re-identify 99.98% of Americans in any anonymized dataset using just 15 characteristics, such as age, gender, and marital status, thanks to the machine learning model they developed.
Additionally, the researchers developed a tool that enables users to assess the ease with which their own data could be exposed through this approach. The tool can be found here.
The study's first author, Dr. Luc Rocher, from UCLouvain, stated, "While there may be a lot of people who are in their thirties, male, and living in New York City, far fewer of them were also born on January 5, are driving a red sports car, and live with two kids (both girls) and one dog."
This means that businesses that already have some information about a person can use anonymized data to build a more complex and disturbingly detailed profile of them. By using information they already have about a person to get more information out of datasets that are supposed to be completely anonymous, they can also unlock additional details.
Are higher standards for anonymity required?
The findings, according to the researchers, demonstrate that current methods for anonymizing data are unsuitable for their intended use. This adds weight to the growing concerns regarding the practice, which have largely been ignored.
Senior author Dr. Yves-Alexandre de Montjoye, from Imperial's Department of Computing and Data Science Institute, stated, "Companies and governments have downplayed the risk of re-identification by arguing that the datasets they sell are always incomplete."
"Our findings demonstrate that an attacker could easily and accurately estimate the likelihood that the record they found belongs to the person they are looking for," the statement reads. "This is contrary to this."
Dr. Julien Hendrickx, a co-author from UCLouvain, added, "We are often assured that anonymisation will keep our personal information safe."
"Our paper demonstrates that de-identification is far from sufficient to safeguard the privacy of people's data."
Therefore, the findings of the study emphasize the need for stricter regulations regarding the handling and anonymization of persona data.
Hendrickx stated, "It is essential for anonymization standards to be robust and account for new threats such as the one demonstrated in this paper."
De Montjoye added, "The goal of anonymization is so we can use data to benefit society."
"This is very important, but it shouldn't and shouldn't have to be done at the expense of people's privacy."