Part 2: Identify Personally Identifiable Information in text
Finding personally identifiable information (PII) in text documents can be useful for several reasons, but one use case I’ve come across multiple times is to help anonymize text in order to:
- Share the data with third parties
- Comply with regulatory requirements such as GDPR
- Use as training data for Machine Learning and other exploratory analysis by replacing PII with simulated data
I’ll try to automate the process of finding PII and in this series of articles, we’re going to explore some popular open source tools and techniques in order to identify different types of PII in our own data.
In the first part we found a way of finding names of people in text, let’s see what other types of PII we can find.
Introducing Duckling
Duckling is a Haskell library, open sourced by Facebook, that parses text into structured data. Duckling can help us find different types of information inside text including credit card numbers, email addresses, and phone numbers.
Now don’t worry if you’re not one of the three people that know Haskell, we can use Duckling with any programming language.
Example With Python
Let’s see how we would use Duckling with a language that doesn’t require a lecture on the evils of side-effects.
Prerequisite:
Install Git, Docker, and docker-compose
Step 1:
git clone git@github.com:facebook/duckling.git
Step 2:
Make a docker-compose file inside the cloned Duckling repo.
docker-compose.yml:
version: '3'
services:
duckling:
build:
context: .
ports:
- 8000:8000
Step 3:
Start Duckling as a docker service:
docker-compose up duckling
Now the Duckling service is available through a HTTP API via the port 8000 on our localhost. Let’s start making some calls to the API and see what we get back:
This prints the following:
email: spy@ninja.com
phone-number: +1 (650) 123-4567
Nice! Duckling found the email address and phone number inside our text and has confirmed this text contains PII. Now let’s see how it does with credit card numbers:
Can’t wait to see that sweet sweet credit card number being printed. Let’s see what it prints:
credit-card-number: 4111-1111-1111-1111
phone-number: 4111-1111-1111-1111
Err … it detected our number as a phone number and a credit card number. Better safe than sorry I suppose.
There are other types of data, or in Duckling language, ‘dimensions’, that Duckling can help us find so feel free to explore the project’s Github page to see what else is available.
Conclusion
We can now add to the list of PII types we’re able to find: names of people, email addresses, phone numbers, and credit card numbers. We’ve already seen there’s still room for improvement, for example we can use the Luhn algorithm to confirm a number is a credit card number and not a phone number but that’s outside the scope for this series as everyone will need to build on top of the topics being discussed here for their own use cases.
In subsequent articles we’ll see how other tools perform and what other types of PII they can help us find.