Does protecting sensitive data mean that you also need to compromise the performance of your machine learning model?

If you study machine learning in university, or take an online course, you will normally work with a set of publicly available datasets such as the Titanic Dataset, Fisher’s Iris Flower Dataset, or the Labelled Faces in the Wild Dataset. For example, you may train a face recognition model on a set of celebrity faces that are already in the public domain, rather than private or sensitive data such as CCTV images. These public datasets have often been around a very long time and serve as useful benchmarks that everyone agrees on.

However, in a commercial setting, we often have to train machine learning models from private or sensitive data. With the exception of new startups, large companies may have databases of extremely personal data, such as millions of people’s addresses, social security numbers, financial information, healthcare history, or more. Companies are extremely reluctant to allow access to this kind of data when it is not strictly necessary. This presents a problem for a data scientist who needs to train a model on sensitive information. Some machine learning projects are simple in a technical sense but complicated immensely by data protection obligations.

The challenges of sensitive data in AI

Let me present an example from our own portfolio. We recently developed an NLP model for a large client in the UK which receives thousands of customer emails per day. The client wanted an email triage system so that customers with some of the common queries (such as a change of home address) could be automatically directed towards a web form that would resolve their query. The plan was to reduce the workload of the customer service staff, and free up their capacity for non-routine queries.

An email triage machine learning AI system can only be trained with highly sensitive data.
How an AI email triage system works: incoming emails are classified according to customer intent, and those customers who can solve their problem with a simple web form are directed to the correct form. This is technically simple to build but tricky because of the sensitive data needed to train the model.

On the face of it, this problem is technically quite straightforward: you need to take a sample of emails, annotate them manually, and then train a text classifier and deploy it. However, due to data protection legislation, we needed to consider the following:

  • the users have not consented for their emails to be stored. If I manually annotate a training dataset (a time-consuming process), I cannot store it indefinitely. It must be deleted.
  • if under the GDPR’s right to be forgotten, a user can request that the organisation delete all of their personal data. If a user were to submit such a request, how would I track down all places that the personal data has permeated to in the machine learning pipeline? It must be possible to trace all copies of an email in all datasets.
  • the dataset cannot leave the client’s computer systems. I cannot download a file to my computer and experiment with different machine learning models. All model development must take places on servers controlled by the client.
  • can any sensitive data be reproduced from the model? For example, if a customer’s email address was stored in an NLP model as a word in its vocabulary, then some customer data has polluted the model. We must take care to ensure that nobody could reconstruct any sensitive information from a trained model.

The data subject shall have the right to obtain from the controller the erasure of personal data concerning him or her without undue delay

Art. 17 GDPR

A machine learning model that remembers too much?

A risk that is sometimes overlooked is that a machine learning model could accidentally memorise sensitive parts of its training data. In 2017 a team at Cornell University/Cornell Tech trained a number of face recognition deep learning models on celebrity faces. They were then able to extract the original face images from the neural network, albeit with the quality slightly degraded.

With their technique, an adversary with access to a trained machine learning model that has learnt to classify sensitive data could potentially reconstruct a part of that data. For example, a machine learning model that has learnt to classify jobseekers’ CVs (résumés) may have stored tokens specific to individuals’ names, or react unusually to a particular character combination in an address, allowing a route in for a malicious attacker to deduce the address that appeared in the training data.

Possible strategies

There is a number of strategies that a data scientist can pursue in the face of extremely challenging data privacy requirements.

1. Delete the dataset

pexels pixabay 73909 min 1

For the duration of the project, annotated data can be used. Only one copy of the dataset can exist. However, once the machine learning model has been trained, the data scientist must delete the complete dataset.

Deleting all your annotations means you are “throwing away the mould”: if the project were to resume in future, you would need to re-annotate a new dataset. However, if all data is truly deleted, then there is no way that the data can leak, and the “right to be forgotten” is no longer an issue.

2. Anonymise (mask) data

We can attempt to make a non-sensitive dataset. For example, we process all emails using a data anonymisation algorithm to remove names, addresses or other sensitive information. This means that our dataset becomes a scrubbed set of emails, with no personally identifiable information left in.

Example of an anonymised email (job application) with sensitive data removed.
Once an email has been anonymised and all sensitive data removed, what remains may not be sufficient to train an accurate machine learning model.

There are a number of third-party products which can be used to anonymise text. For example, Anonymisation App, or Microsoft Azure Text Analytics.

The risk of this approach is that text anonymisation is difficult, time-consuming, and it is possible to accidentally leave a sensitive piece of information in. In addition, the anonymised dataset may be too far from the real thing to produce the most accurate model.

On the plus side, if no sensitive data goes anywhere near the machine learning model, it cannot remember anything it shouldn’t, and it will not be possible for an attacker to reconstruct the sensitive information by poking the model.

3. Store only IDs which can be used to reconstruct data

You may be able to annotate the data and then delete it, storing only a hash or ID of the original information, so that the training data can be easily reconstructed but it is not stored in your machine learning system. For example, you may choose to store the ID of each email and the label of each email, so that the training data can be re-built provided the emails have not been deleted from the email server. This means that the machine learning project does not rely on any extra copies of data.

If you store hashes of email addresses, you need to be cautious, because if a hacker got hold of your hashed database as well as a database of email addresses from another company, they could hash all those email addresses and cross check them against your database, and reconstruct the original email addresses.

hashing min
How emails can be hashed. The email address [email protected] is converted to the string c379bd7a05f1af4522d5ad28af10d623 by a hashing algorithm. This can’t be easily reversed, but a hacker who got hold of another list of emails (such as the Ashley Madison data breach) could apply the same hashing function to that list, and identify the original email in the dataset.

4. Update privacy policy and obtain full client consent

privacy policy blue small min

Another approach is to obtain permission to hold customer data for longer and to use it for training an AI. This may not always be an option, but if enough customers consent, we may be able to build a training dataset only from the consenting customers. We must be careful, however, as consenting customers may not be representative of the entire customer base (they may spend more, spell better, belong to a different demographic group, etc), and this could introduce a bias into our trained model. In addition, this strategy would normally only work going forward with new customers, but a company may want to use the entire existing customer base for training their machine learning model.

5. Encrypt or transform the data and work on it in encrypted space (homomorphic encryption)

Thomas Wood left index fingerprint showing some minutiae
My left index fingerprint showing some minutiae (feature points such as junctions or culs-de-sac). The minutiae coordinates could then be encrypted or transformed according to a one way process, allowing machine learning on obfuscated data.

It is sometimes possible to obfuscate a sensitive dataset in such a way, that the sensitive data can’t be reconstructed, but machine learning can still learn from it. This is called homomorphic encryption. For example, a biometric fingerprint dataset could be converted into a set of minutiae (feature points), which could then be transformed by a non-reversible operation. The operation would have the property that fingerprints which are similar in real life, remain similar after encryption, but the encryption still cannot be reversed.

Homomorphic encryption is often very hard to do. A simple way of achieving the same result is to transform numeric fields using Principal Component Analysis. For example, a transformed value could be 2 * age + 1.5* salary + 0.9 * latitude, which would be very hard to map back to an individual due to the many-to-one nature of the transformation.

5. Strengthen security measures in communication

In addition to ensuring that no data is copied unnecessarily, or checked into repositories, there are other routine security measures which need to be taken in the case of sensitive training data. For example, any API endpoints must be secured with SSL and HTTPS, and you should not share data over third party services such as Github or Gmail.

6. Remove sensitive data that is unimportant to the model

You may find that a particular field, such as date of birth, is highly sensitive but contributes little to the model accuracy. In these cases there is a tradeoff between security, and machine learning model performance, and there may be a business decision to sacrifice accuracy in order to maintain good data governance.

7. Coarsen sensitive data

Data can be altered, or coarsened, so that it is still of use for machine learning, but the potential for identifying sensitive data is reduced. For example, latitude/longitude locations can be rounded or jittered, postcodes (W9 3JP) can be reduced to the first half (W9), and numeric quantities can be binned. The USA’s HIPAA regulation specifies that in certain circumstances ages should be stored only as years rather than dates, unless if the age is over 90 in which case the year should also be hidden.

no jitter minjitter min

Establishing a policy for sensitive data

If a data scientist is working on sensitive data of any kind, it would be wise to take legal advice, or contact the organisation’s Data Protection Officer (in the UK/EU) for guidance and to establish a governance policy and best practices documentation. This would include establishing a secure location, documenting all sources of sensitive data and all copies taken, and establish a process to scan for sensitive data. Under the GDPR, all copies of a data instance must be tracked so that the individual concerned can request complete erasure. Processes must also be established to allow employees to request access to data, and for tracking the roles of who has access at a given time.

Conclusion

Dealing with extremely valuable but sensitive data can be a minefield. Data scientists must be very careful not to dismiss clients’ concerns about data protection, and aim to find the appropriate middle ground between compromising privacy and compromising model performance. Often, some of the most commercially successful models are trained on highly sensitive data.

References

Google, Considerations for Sensitive Data within Machine Learning Datasets (2020)

Quintanilla et al, What is responsible machine learning?, Microsoft (2021)

GDPR, Right to be Forgotten, EU law (2016)

Song et al, Machine Learning Models that Remember Too Much, Cornell University (2017)

Leave a Reply