Part 3: Identify Personally Identifiable Information in text

Abhinay Mehta
2 min readOct 8, 2021

Click here for Part 2

Finding personally identifiable information (PII) in text documents can be useful for several reasons, but one use case I’ve come across multiple times is to help anonymize text in order to:

  • Share the data with third parties
  • Comply with regulatory requirements such as GDPR
  • Use as training data for Machine Learning and other exploratory analysis by replacing PII with simulated data

I’ll try to automate the process of finding PII and in this series of articles, we’re going to explore some popular open source tools and techniques in order to identify different types of PII in our own data.

So far we’ve found ways of finding names of people, email addresses, phone numbers, and credit card numbers. Let’s see what other types of PII we can find.

Introducing Hugging Face

Hugging Face is a popular Python library with pre-trained AI models that are useful for a variety of natural language processing (NLP) tasks, including, Named Entity Recognition (NER). NER as we’ve discussed in the previous articles, is an incredibly useful technique for detecting PII in text.

Example With Python

Let’s see how we would use Hugging Face.

Prerequisite:

At least one of TensorFlow 2.0 or PyTorch should be installed. Then type this into a terminal of your choice:

pip install transformers

We’re going to try and see how well Hugging Face does at trying to identify locations in some text like this:

Let’s see what this code prints:

[
{'entity_group': 'LOC', 'score': 0.99.., 'word': 'Philadelphia', 'start': 8, 'end': 20},
{'entity_group': 'LOC', 'score': 0.99.., 'word': 'Bel Air',
'start': 194, 'end': 201}
]

It detected both location words pretty nicely using just the default English language model and settings. Impressive.

In my experience Hugging Face is better than the other tools we’ve discussed so far at detecting locations, so if finding locations is important to you then definitely give it a try. It’s very flexible, has a big community around it, good documentation, and is widely used.

Conclusion

Not only are we continuing to add to the list of PII types we’re able to find but we’re slowly improving the quality too, giving ourselves enough knowledge to choose the best tool for the job.

We’ll continue on this journey in subsequent articles so watch this space for more thrills (well .. some of us enjoy this stuff!).

--

--