Anonymisation and Pseudonymisation
Obscuring or removing personal data from datasets allows information to be used more widely and may be particularly useful to the University in the areas of research and management information and reporting. Obscuring or hiding the personal data elements can be achieved in a number of different ways depending on the nature of the data and the need to use or share it. The common terms are anonymisation and pseudonymisation, which are described as:
- Anonymisation is the “process of rendering data into a form which does not identify individuals and where identification is not likely to take place”
- Pseudonymisation is the “processing or personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person”.
Data protection legislation does not apply to data that has been rendered anonymous when it has been carried out in such a way that the data subjects are no longer identifiable. However, complete anonymisation is not always possible and the potential for that anonymity to be undone needs to be considered when using or disclosing the information. If the risk of identification is reasonably likely, the information should be regarded as personal data. The ICO uses the term ‘re-identification’ (sometimes known as ‘de-anonymisation’) which is described as the “process of analysing data or combining it with other data with the result that individuals become identifiable”.
Anonymisation is most effective when considered early in the data lifecycle as it supports data protection principles and the requirement for ‘Data Protection by Design and Default’. Anonymisation is particularly useful in research and when disclosing data outside the institution, but can also be used internally to protect personal data and reduce the risks of inappropriate use or disclosure.
- Anonymisation Techniques
- Identification and Re-identification
- Prior Knowledge
- The ‘Motivated Intruder’
- Creating Personal Data From Anonymised Data
- Spatial Information
- Publication vs Limited Access
- Freedom of Information
- Further Reading
Several common anonymisation techniques exist that have key differences and certain attributes that would direct the user to selecting one over the other. Common anonymisation techniques include:
- Data masking, where personal data elements are removed to create a dataset where personal identifiers are not present. Partial data removal is a typical example of this but carries higher risks. Depending on the information removed, it is possible for this dataset to be recreated with other data, for example using the Electoral Register to determine individuals based on date of birth and postcode data in a dataset, or postcode data that relates to very low numbers of dwellings in a postcode area. Data quarantining is another variation whereby data is supplied to a person who is unlikely or unable to have access to the additional data required to aid re-identification. This is less likely in a connected world as it is difficult to know with some certainty what information the recipient may be able to access.
- Aggregation is a common example of anonymisation used for data analysis where data is displayed as totals rather than individual values. Low value totals are often excluded completely or may be grouped together to produce a larger group, for example a survey that results in a group of less than five data subjects would not be reported. This is a common approach with staff surveys where other data from the survey responses could identify a single person within a department. Within aggregation, there are different ways of achieving anonymisation, partly depending on the size of dataset and the degree of accuracy required for the results. Typically this is not a useful approach for research at an individual-level, but is good for large scale analysis, such as modelling people movements or social trends. It is generally low risk from a disclosure point-of-view (obviously source data is still high-risk) as it is intentionally difficult or impossible to relate any results to a particular individual.
- Derived data uses values of a less granular nature, typically through ‘banding’, to hide the exact values, for example using a year of birth instead of date of birth. This is a lower-risk technique because data matching is more difficult. This means that the data can be relatively rich and still useful for individual-level research but with lower risks for re-identification.
- Pseudonymisation is the practice of extracting and replacing actual data with a coded reference (a ‘key’), where there is a means of using that key to re-identify an individual. This approach is typically used where the use of the data needs to relate to individual records, but also needs to retain security and privacy for that individual. Pseudonymised data carries a higher privacy risk and security of the key is essential. Because the data is not truly anonymised, personal data that has been pseudonymised can fall within the scope of data protection legislation depending on how difficult it is to attribute the pseudonym to a particular individual.
Identification and Re-identification
Identification is achieved in one of two basic ways:
- Direct identification, where only a single data source is necessary to identify an individual.
- Indirect identification, where two or more data sources are needed to be combined to allow an individual to be identified.
The difficulty in determining whether the anonymised data you hold or wish to publish is personal data lies with not knowing what other information is available to a third party that might allow re-identification to take place. This requires a case-by-case judgement of the data. Although absolute certainty cannot be guaranteed that no individual would ever be identifiable from the anonymised data, which does not necessarily mean that personal data would be disclosed. UK case law has determined that the risk of identification must be greater than remote and for it to be reasonably likely that the data is personal data is defined in legislation.
Re-identification may be possible from information held by other organisations or that is publically available, such as with an internet search. This can be achieved by trying to match a record from an anonymised dataset with other information, or by trying to match personal data already held with a match in an anonymised dataset. Whilst the former is often seen as the more likely scenario, both have a similar result. Re-identification risk can be mitigated by employing data minimisation principles and only disclosing the anonymised data necessary for the purpose.
The risks of re-identification can change over time, particularly as more information becomes available online, computing power increases and data analysis tools become available to the consumer market. Data that is currently anonymous may not remain that way, so it is important that anonymisation policies, procedures and techniques are regularly reviewed. It is often difficult to totally remove publically available information, so once the anonymised data is published, it may not be possible to recover it to prevent data re-identification.
Identification does rely on more than making an educated guess that information is about someone in particular. Making an educated guess as to someone’s identity may present a privacy risk but not a data protection one where no personal data has actually been disclosed to the one making the guess. Making an accurate guess based on anonymised data does not mean that personal data has been disclosed. However, the impact of guesswork should be considered through the anonymisation and publication process. Many circumstances of wrong identification applied through guesswork have arisen, with individuals blamed for things they did not do, for example. This is particularly potential when combined with ‘prior knowledge’.
If an individual knows a lot about another individual, re-identification is a good possibility although the same would not be possible for an ordinary member of the public. Consider family members, work colleagues and professional roles such as doctors. Such people might be able to learn something new about the data subject from the anonymised data, perhaps through confirmation of existing suspicions.
When considering prior knowledge, assumptions should not be made of what individuals may already know, even among family members. Professionals are likely to be covered by confidentiality and ethical conduct rules and are less likely to fit the role of ‘motivated intruder’ with some gain to be made from use of the knowledge.
When considering releasing anonymised data, assess:
- The likelihood of individuals having and using prior knowledge to aid re-identification. This may not be possible for individual data subjects in any dataset, so a more general assessment may be required. Consider even whether those individuals would see or seek out the published information.
- The impact of re-identification on the data subjects. Again this may not be possible at an individual level, but could be inferred from the data sensitivity.
The ‘Motivated Intruder’
The ‘motivated intruder’ forms the basis of a test used by the ICO and Tribunal that hears DPA and FOI appeals. The ‘motivated intruder’ is a person who:
- Starts without any prior knowledge but who wishes to identify an individual from an anonymised dataset.
- Is reasonably competent
- Has access to resources such as the internet, libraries and all public documents
- Employs investigative techniques, including questioning people who may have additional knowledge of the individual.
- Is not assumed to have any specialist knowledge such as computer hacking skills, or to have access to specialist equipment, or to resort to criminality such as burglary in order to gain access to data that is kept securely.
The ‘motivated intruder’ is likely to be more interested in some types of information that would support their ‘cause’, whether for financial gain, political or newsworthiness, activism or ‘hacktivism’, causing embarrassment to individuals, or even just curiosity around local events. Data with the potential to have a high impact on individuals is likely to attract a ‘motivated intruder’.
The test therefore goes beyond considering whether an inexpert member of the public can achieve re-identification, but not as far as a knowledgeable determined attacker with specialist expertise, equipment and potentially prior knowledge.
It is possible to replicate a ‘motivated intruder’ attempt on your anonymised data to test its potential adequacy. Consider:
- Using the edited or full Electoral Register to try to link anonymised data to someone’s identity
- Using social media to try to link anonymised data to a user’s profile
- Conducting an internet search to use combinations of data, such as date of birth and postcode, to identify an individual.
Third party organisations exist that are able to do this (subject to the necessary contractual controls, of course). They may have knowledge of and access to data resources, techniques or vulnerabilities that you are not aware of.
Creating Personal Data From Anonymised Data
With a range of research and other activities in similar or overlapping areas it is possible that personal data can be created through the combination, analysis or matching of information, or for the information to be linked to existing personal data within the University.
This would require the University to fulfil its responsibilities under data protection legislation for that personal data, potentially starting with informing the individuals of the data processing, which may then become problematic where that processing was not expected by the individuals – more so if they then object.
Where an organisation collects personal data through re-identification without individuals’ knowledge or consent, it will be obtaining personal data unlawfully and could be subject to enforcement action.
Information about a place can easily constitute personal data as it can be associated with an individual. This relates to electronic devices as much as buildings. Mobile devices such as smartphones contain and generate large amounts of spatial information and have been used effectively in traffic and travel surveys to map origins and destinations as well as through times. Unique identifiers, such as IP addresses, are now considered personal data in data protection legislation.
If trying to anonymise datasets using UK postcode information, the following may be useful to determine potential data groups of anonymised data:
- Full postcode – approximately 15 households, although some postcodes could relate just to a single property
- Postcode minus the last digit – approx. 120-200 households
- Postal sector (4 outbound digits + 1 inbound) – approx. 2,600 households
- Postal district (4 outbound digits only) – approx. 8,600 households
- Postal area (2 outbound digits) – approx. 194,000 households
A digital equivalent is removing the final ‘octet’ on IP addresses to degrade the location data they contain.
Publication vs Limited Access
The more that anonymised data is aggregated and non-linkable, the more possible it is to publish it. Of course, not everything is intended for public disclosure and access to a smaller group of people may be intended. Pseudonymised data is often valuable to researchers because of the granularity it affords, but carries a higher risk of re-identification. Release of this data to a closed community is possible where there are intended to be a finite number of researchers or institutions with access to the data. Typically these are controlled by restrictions in contacts and non-disclosure agreements. This allows more data to be disclosed than is possible with wider or public disclosure. Information security controls still need to be in place and managed.
Freedom of Information
The Freedom of Information (FOI) Act, under Section 40, includes a test for determining whether disclosure of personal data to a member of the public would breach the data protection principles. The University would need to consider the additional information that a particular member of the public might have or be able to access in order to combine the data to produce personal data – something that relates to and identifies a particular individual. It can be difficult to assess what other information may be available, so this is where the ‘motivated intruder’ test is useful.
Where an FOI request is refused under Section 40, it may be possible to release some information in an anonymised form that would satisfy the requestor and avoid an appeal and review process. The Information Governance Unit can advise in this regard.
Should you wish to explore any of these topics in more detail, the ICO provides a comprehensive guide to understanding anonymisation. Although this was written around the Data Protection Act, as a guide it remains relevant and useful.