More Data, More Problems: Why Data Minimization Should be a First Step Before De-identification, Anonymization, or Similar Methods

Data collection is expected to grow exponentially in the coming decade. Current market estimates place the compound annual growth rate from 2021 to 2028 at 25.6 percent. Social media and the internet of things have expanded data collection efforts at a rapid pace and data processing has become more valuable to businesses every year.

The personally identifiable information (PII) of average consumers is one of most valuable forms of data in this booming market. While amassing large stores of personal information is lucrative, it comes with several challenges to both the business and consumer. The legal landscape for data privacy is largely unsettled. State, federal, and transnational laws intersect with different standards that make compliance a hassle. Data collection also poses a risk to the individual privacy of consumers. In the event of a data breach, thousands if not millions of consumers could have their information stolen, becoming susceptible to identity theft, fraud, and abuse.

Given these risks it is easy to see how, when it comes to data, more is not necessarily better. Businesses have a moral obligation, and often a legal obligation, to effectively manage these risks. Data Minimization, De-identification, and Pseudonymization are popular strategies used to create more secure data. Effective data management necessitates companies to fully consider the benefits and risks of these strategies, incorporating the most appropriate administrative and technological controls to provide meaningful data privacy for processing activities. This article will discuss these data management strategies in detail, examining how companies should employ data management tools to protect against the risks of unruly data.

What is Data Minimization?

Data minimization is a data management practice which limits the collection or storage of personal information. A company using data minimization will only retain data that is directly relevant and necessary to accomplish a specific purpose. For instance, businesses committed to data minimization can limit the scope of data collection so that it only keeps categories of data that are truly valuable to business operations. Even after collection, businesses can practice data minimization by eliminating data that no longer have any value. Useful data often have a short lifespan, being necessary for discrete or contemporaneous purposes that quickly become obsolete. Keeping data on a company’s servers poses a significant, unnecessary risk in the instance of a data breach. Data minimization is a best practice which companies can use to eliminate this risk.

In a sense deidentification may be considered a form of data minimization since it minimizes unnecessary identifying information from the data set. However, systematic minimization goes beyond deidentification and requires companies to truly consider the actual purpose of retaining data. Lean, well-understood data inventories are generally preferable to unruly, maximalist data sets. Data minimization ultimately minimizes risk while maximizing the usefulness of valuable data. Data that is retained “just in case” is more likely to lead to legal liabilities then add value to a businesses’ overall operations. Effective data management necessitates minimization, particularly as data collection continues to grow exponentially.

Data minimization has become particularly important in recent years because it has become a requirement for recent data privacy laws. Personal data regulated under the EU’s GDPR must be “adequate, relevant, and not excessive in relation to the purpose or purses for which they are processed.” This means collection and retention of personal data must be minimized to data that fulfil a specific business purpose. While previous US data privacy laws like HIPAA and GLBA do not require data minimization, Virginia’s new CDPA has taken a similar approach to GDPR. §59.1-574 of the CDPA requires a controller to “limit the collection of personal data to what is adequate, relevant, and reasonably necessary in relation to the purposes for which such data is processed, as disclosed to the consumer.” This requires a one-to-one parity between any data collected and processing activities disclosed in a privacy policy. California’s recent amendments to the CCPA, the CPRA introduces a data minimization requirement in Section 1798.100, prohibiting the retention of personal information for longer than reasonably necessary. As consumers grow more interested in how their data is being used, states are bound to follow Virginia and Europe’s example by requiring some sort of data minimization standard to reduce the risks inherent to maximalist data collection approaches. In order to comply with these laws and effectively manage a data inventory, companies must integrate data minimization as a key principle for data processing activities.

What is Deidentification?

Deidentification is a process which removes personal identifiers from stored data sets to protect individual privacy. This can take many forms, but generally deidentification is designed to effectively eliminate direct identifiers (name, address, contact numbers, account numbers, etc.) as well as indirect identifiers (age, health conditions, race, gender, uncommon characteristics, etc.) from existing data sets. This method can preserve useful data for research purposes while getting rid of sensitive information that would put individuals at risk in the instance of a breach. Deidentification is also used extensively because deidentified data is often exempt from data privacy regulations.

Deidentification is particularly important for companies to understand because it may provide a safe harbor for processing activities for certain data privacy laws. Under the Heath Insurance Portability and Accountability Act (HIPPA) deidentified data may be exempt from regulation where regulated entities have eliminated individually identifiable health information and the existing data presents a low risk of re-identification. There are two accepted methods to prove a low risk of re-identification. Under the first method, an expert must determine, using accepted statistical and scientific principles, that the risk of re-identification is low. Under the second method, known as the HIPAA Safe Harbor provision, entities must eliminate 18 categories of data and have no actual knowledge that residual information could identify individuals.

Similar to HIPAA, the Gramm-Leach-Bliley Act (GLBA) privacy rule exempts financial information from regulation where such information cannot be used to identify a consumer. This can be achieved by aggregating consumer information or eliminating personal identifies such as account numbers, names, or addresses. Under the Family Educational Rights and Privacy Act (FERPA) deidentified educational records which no longer contain direct and indirect identifiers may disclosed for research purposes without prior consent from parents or students.

De-identification may have its benefits, but companies should also be aware of the potential risks. The data left over after deidentification may still be specific enough to an individual that data can be re-identified when cross referenced with another data set. The more variables eliminated from a company’s data set the lower the risk of re-identification, but this also presents another problem. De-identification can often reduce the value of a company’s data by eliminating useful variables that contain identifying information. Additionally, de-identification can make compliance with data privacy laws more difficult in some instances. When a consumer asserts their right to have a company delete their data under GDPR or CCPA, de-identification prior to the request can lead to difficulties associating any existing individual data that continues to be stored.

Companies must balance the risks inherent to holding on to identifiers with the value those identifies create in processing activities. This balance is key to every data privacy program. Companies should use de-identification in a way that best fits its business goals, accounting for the level of risk it is willing to take on when identifiers are valuable to data processing.

What is Pseudonymization?

Pseudonymization is a process where direct identifiers are transformed or replaced by a unique pseudonym to protect the personal privacy of a data subject. Often this method requires the entity to use separately kept additional information to re-identify the data subject. The most common method of pseudonymization is a form of encryption where sensitive data is altered by an algorithm. To access the original sensitive information, a processor must provide a “key” which reverse engineers the algorithm. Unlike deidentification, pseudonymization generally does not affect indirect identifiers.

Pseudonymization can address some of the issues discussed above that companies and researchers face when deidentifying data. Re-identification with the key can be regulated through a mix of administrative and technical measures that ensure only those who require access to reidentified data have the tools to do so. This helps preserve valuable data while keeping unwanted eyes from seeing sensitive data. This can also assist compliance with data privacy regulations by allowing the controller to reidentify data needed to process a consumer request. However, just like deidentified data, pseudonymized data may be reidentified if data sets are cross referenced. A key that undoes pseudonymization may also present another cybersecurity challenge. If an unwanted party gains access to the key, then the pseudonymization has failed to protect personal data. Strong physical and administrative security measures are needed to ensure that keys do not fall into the wrong hands.

Pseudonymization can be an important tool for research and compliance purposes. Re-identification is sometimes needed to assist operations and pseudonymization can provide a layer of security that allows processors to retain useful PII. Unlike de-identification, pseudonymization does not establish a safe harbor for data processing and will not shield a company from compliance with data privacy regulations. Under the GDPR pseudonymized data is still subject to all requirements if it meets the definition of personal data. However, pseudonymization can be used by a company as a method to demonstrate appropriate technical and organizational security measures under Article 32.

Are Deidentification and Pseudonymization Enough?

Generally, these strategies will not be enough to effectively protect consumer data. In a 2015 study from Imperial College analyzing credit card metadata, researchers found that four random pieces of information on any given data subject were enough to re-identify 90% of shoppers as unique individuals even after strides were taken to anonymize data. A 2013 study from the same researchers found that only four anonymized variables of location data were needed to re-identify 95% of individuals. Neither de-identification nor pseudonymization can perfectly prevent re-identification because an individual’s data tends to be so unique that miscellaneous variables become identifiers.

These methods may lead some companies to believe they have done enough to protect data privacy, but this is a false sense of security by assuming perfection. Protecting data through de-identification is not always possible and therefore it should not be used as a replacement for systematic data minimization. Companies that solely rely on forms of de-identification and pseudonymization will have too much data, more than they can meaningfully manage. Minimization prevents purposeless data from becoming a liability to the company. Even when companies employ de-identification and pseudonymization they should also practice data minimization. Using these strategies together will strengthen a company’s data governance, allowing it to meet the moral and legal obligations that come with data collection and processing.

Can privacy preserving or enhancing technologies (PET) could address these issues?

Promising PET technologies such as homomorphic encryption, differential privacy and similar technologies are coming faster than before, and they are greatly beneficial to protect sensitive data. For example, homomorphic encryption allows a computer to perform computations on encrypted data without requiring decryption. Since data would not need to be decrypted, this process would protect sensitive information from being exposed even when data sets were shared among different teams. There are really great use cases of that in medical, education and other fields but it is naïve to assume that we will be able to apply that to all the data or large data lakes. Complexity, high cost of implementing and maintaining security technologies has resulted into minimal adoption of complex technologies in the past. Minimal adoption of Public Key Cryptography in entirety is great example of that. While innovation in PETs is crucial and progressing, following best practices of data minimization should be first step before implementation of PETs.

Conclusion

To reduce regulatory risk, a company’s data privacy program must incorporate a privacy by design principal at the point of collection. While deidentification and pseudonymization are important data protection tools, they should not be used to compensate for the risks of unnecessary data collection and retention. Storing all the data your company collects “just in case” presents huge risks to your business and a great incentive for hackers to find weaknesses in your system. Neither deidentification nor pseudonymization are the perfect strategy to prevent reidentification.

Data minimization can shrink a company’s data footprint which lessens the impact of a data breach on businesses and consumers alike. Data minimization is a practice business must put in place, even if they are already using another strategy to prevent identification of stored personal data. Minimization is more than a best practice; it is a standard that allows companies to show due diligence and a commitment to protecting consumer data.

About Ardent Privacy

Ardent Privacy is an "Enterprise Data Privacy Technology" solutions provider based in the Maryland/DC region of the United States and Pune, India. Ardent harnesses the power of AI to enable companies with data discovery and automated compliance with DPB (India), RBI Security Guidelines, GDPR (EU), CCPA/CPRA (California), and other global regulations by taking a data-driven approach. Ardent Privacy's solution utilizes machine learning and artificial intelligence to identify, inventory, map, minimize, and securely delete data in enterprises to reduce legal and financial liability.

For more information visit https://ardentprivacy.ai/and for more resources here.

Ardent Privacy articles should not be considered legal advice on data privacy regulations or any other specific facts or circumstances.