Part 1: Identify Personally Identifiable Information in text

Abhinay Mehta
2 min readMay 26, 2021

--

Finding personally identifiable information (PII) in text documents can be useful for several reasons, but one use case I’ve come across multiple times is to help anonymize text data in order to:

  • Share the data with third parties
  • Comply with regulatory requirements such as GDPR
  • Use as training data for Machine Learning and other exploratory analysis
  • You’re Facebook and you finally want to do the right thing (/S)

I’ll try to automate the process of finding PII and in this series of articles, we’re going to explore some popular open source tools and techniques in order to identify different types of PII in our own data.

Introducing spaCy

Named Entity Recognition (NER) tries to identify words in text data that have meaning, like names of people, locations, dates, etc. There are several open source tools that employ NER to help identify meaningful words, a very popular project which we’ll focus on in this article is called spaCy.

spaCy is a free open-source python library for Natural Language Processing and has a NER feature that can help us identify names of people, locations, and other potentially useful bits of information.

Example With Python

Prerequisite:

pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm

Now let’s pretend we have some unstructured text and want to see if it contains any PII:

Let’s see what this code prints:

PERSON: John McClane
GPE: Boston

Hooray! We’ve detected names of a person and a location (GPE = Geopolitical entity, i.e. countries, cities, states), now we can choose to do something with that information, maybe scrub it from the text before storing it or use industrial strength encryption to secure the lives of our secret agents.

spaCy isn’t always this accurate, let’s take a look at another example in which spaCy doesn’t do as well:

Now the result of the above is not what we had hoped it would be:

ORG: Ludwig von Mises

Oh no! It’s detected my favorite Austrian Economist as an Organization. Now to be fair, there is an Organization called The Ludwig von Mises Institute for Austrian Economics, or Mises Institute so maybe spaCy got confused, and you will see such mistakes in real world data.

Conclusion

So we now have a way of finding names of people in text, it’s not perfect but spaCy will do a better job than most people trying to do this by themselves from scratch. It’s actually pretty good at finding names of people, countries, cities, states, companies, etc.

It’s a good starting point and you can (and should) build on top of it to make your process more accurate for your own data and domain.

In subsequent articles we’ll see how other tools perform and what other types of PII they can help us find.

Click here for Part 2

--

--

Abhinay Mehta
Abhinay Mehta

Responses (1)