The number and complexity of data protection regulations has grown rapidly in recent years. The passage and enactment of laws such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) have put new requirements in place for organizations processing the personal and protected data of their constituents.
These new regulations have driven a need for strong but usable data anonymization techniques. Under the law, organizations must receive explicit consent for data processing activities from the data subject. However, this same requirement does not apply if the data has been properly anonymized or de-identified.
The challenge with this requirement is that it is often extremely difficult to completely anonymize data given enough features in the data. Research performed by the Imperial College London demonstrated that it was possible to uniquely identify 99.98% of Americans given a dataset containing at least 15 features that was anonymized with techniques in common use today.
Pros and Cons of Data Anonymization Techniques
Achieving and maintaining regulatory compliance requires an understanding of the abilities and limitations of different data anonymization techniques. Generally, applying a greater level of anonymization and de-identification requires sacrificing a certain level of usability of the data. Different techniques make different tradeoffs, making it essential to understand the pros and cons of each technique available in order to make the right choice for a given application or use case.
- Data Masking
Data masking is designed to provide complete protection to certain values within a database. Any information that is considered sensitive (such as credit card numbers, addresses, etc.) is encrypted or replaced with something that gives no clue to the real value (like an asterisk).
Data masking is effective at protecting the privacy of masked data, but it can be challenging to determine which data should be masked. Aggregation of unmasked data values (like zip codes, ages, gender, etc.) can be used to uniquely identify an individual, removing the benefits of data masking.
- Pseudonymization
Pseudonymization replaces a piece of sensitive or unique data, like a name, with a randomly selected alternative. For example, the name James Miller could be replaced with Kevin Smith, but the other data associated with the user’s record remains unchanged.
Pseudonymization is useful since it preserves the underlying statistics and distributions of the data, making it useful for analysis. However, it can be difficult to design a pseudonymization scheme that is compliant with data protection regulations since poorly anonymized (or pseudonymized) data can be de-anonymized given access to sufficient features.
- Generalization
Generalization is designed to remove some of the more identifying features of data while leaving the overall data unchanged. For example, an exact age may be replaced with a range of ages or the house number of an address is removed. Generalization is useful for preserving the overall distributions of data, but it only makes deanonymization more difficult, not impossible. Access to sufficient generalized features could enable someone to deanonymize data.
- Data Swapping
Data swapping involves leaving data values unchanged but associating them with a different record than the original record. This raises the difficulty of deanonymizing a particular record since certain data values are deliberately misleading. Done properly, data swapping can provide a good balance between data anonymization and usability. The overall counts of certain values are still accurate (i.e. a swapped dataset contains the same set of birthdays as the original) but data is not associated with the true record. If analysis is designed to take this into account, useful insights can still be gleaned from the data.
- Data Perturbation
Data perturbation uses operations like rounding to add random noise to a dataset, making deanonymization more difficult. Since the operation performed can be based upon the (unknown) original value (i.e. rounding up or down), it requires more work and access to external sources of data to undo the effects of perturbation.
With well-chosen perturbation algorithms and parameters, this technique can leave data in a state where statistical analysis still yields usable results. Since most analysis uses age ranges anyway, having a user recorded as 25 or 30 can make little difference to analysis but have a significant impact on anonymity and regulatory compliance.
- Synthetic Data
Synthetic data is generated to be realistic but have no true basis in reality. Synthetic data can be designed to fit statistical trends but removes any unique details of customer records. In general, synthetic data has limited utility for statistical analysis since any trends that exist in the data were probably known and built in when the dataset was being created. However, for applications like software testing and quality assurance, where realism is more important than accuracy, synthetic data can be invaluable since it removes all potential security and privacy concerns.
Choose Wisely
With the new regulatory landscape, data anonymization is essential to ensure that an organization can operate effectively without becoming non-compliant. However, it is often challenging to balance the needs for realistic data with the need to protect customer privacy and security.
In some cases, where data realism is essential, organizations will have no choice but to seek customer consent for processing. In others, fully synthetic data may be appropriate, removing security concerns while enabling normal processing. In all cases, organizations will need to carefully consider their choice of data anonymization algorithm to ensure that it fits their needs but also complies with regulatory requirements.