Basic PII Detection and Masking in Snowflake Using Java

5 minute read

(This was ported from my original Medium post.)

Hi Folks,

For my first foray into Medium, I wanted to share some code that I’ve used previously in demos. The examples here do basic detection and masking of personally-identifying information (PII) using Java’s built-in regular expression support.

Now, I make no assertion that these routines are good: if you really want to do robust PII detection, you probably want something more sophisticated than a few regexes. Snowflake is even working on data classification as a built-in feature.

But I like these examples because they do a good job of illustrating the basic pattern of Snowflake’s Java functions. And they’re pretty malleable: you should be able to modify these examples to work for any situation where you need to detect or mask based on a set of regexes.

Let’s start with the code and then tear it apart. If you’re running on Snowflake and have Java functions enabled – any AWS account, for now – then you can define them right inline using this create function command:

create function haspii(s string)
returns boolean
language java
returns null on null input
handler = 'PIIDetector.hasPII'
as
$$
import java.util.regex.*;
import java.util.*;public class PIIDetector {
    
    static final String[] TARGETS = {
        "\\d{3}-\\d{2}-\\d{4}",                 // SSN
        "[\\w-\\.]+@([\\w-]+\\.)+[\\w-]{2,4}",  // email
        "[2-9]\\d{2}-\\d{3}-\\d{4}"             // phone
    };    
    
    ArrayList<Pattern> patterns;    
    
    public PIIDetector() {
        patterns = new ArrayList<Pattern>();
        for(String s : TARGETS) {
            patterns.add(Pattern.compile(s));
        }
    }    
    
    public boolean hasPII(String s) {
        for(Pattern p : patterns) {
            if (p.matcher(s).find()) {
                return true;
            }
        }
        return false;
    }
}
$$

With this in hand, anyone with permissions on the function can issue queries that use it without any knowledge of Java:

select id, haspii(body)
from emails

So let’s take the definition apart. The first section defines how the function will show up in SQL:

create function haspii(s string)
returns boolean
language java
returns null on null input
handler = 'PIIDetector.hasPII'

Most of this is pretty self explanatory: it’s a function that takes a string and returns a Boolean, and the language is Java. The null on null input bit lets me skip any null handling in my routine: nulls inputs will be handled without calling into Java at all.

The handler directive is new, and specifies where in the Java code to actually make a call. You may have many potential entry points, but in this case, Snowflake is going to call the hasPII method defined on the PIIDetector class.

The actual Java code is contained between the pairs of dollar signs. After a little boilerplate, we see a few regular expressions:

static final String[] TARGETS = {
    "\\d{3}-\\d{2}-\\d{4}",                 // SSN
    "[\\w-\\.]+@([\\w-]+\\.)+[\\w-]{2,4}",  // email
    "[2-9]\\d{2}-\\d{3}-\\d{4}"             // phone
};

These (highly USA-centric) expressions match the basic forms of Social Security numbers, email addresses, and phone numbers. You can very easily augment this list with more patterns to match your definition of PII.

Next, we see some initialization code:

ArrayList<Pattern> patterns;public PIIDetector() {
    patterns = new ArrayList<Pattern>();
    for(String s : TARGETS) {
        patterns.add(Pattern.compile(s));
    }
}

Our handler points to an instance method in the PIIDetector class. When Snowflake runs a query that requires an instance of this class, Snowflake will will look for a default constructor to use to generate this instance. This provides a really easy way to do one-time initialization: in this case we compile up the regular expressions so they’re ready to go once per query, rather than doing so on each invocation – it should be much faster.

Finally, we have the actual method we’re binding to:

public boolean hasPII(String s) {
    for(Pattern p : patterns) {
        if (p.matcher(s).find()) {
            return true;
        }
    }
    return false;
}

This just loops over the patterns and fires if any match. Easy peasy!

And there we have it: a simple PII detection routine that you can customize to your requirements (and local phone-number formats). But really, this is good for any situation where you have a number of regular expressions to match.

And with a little tweaking, you can mask out these matches instead. Here’s the code; I’ll let you dig into the details.

create function maskpii(s string)
returns string
language java
returns null on null input
handler = 'PIIDetector.maskPII'
as
$$
import java.util.regex.*;
import java.util.*;

public class PIIDetector {
    
    static final String[] TARGETS = {
        "\\d{3}-\\d{2}-\\d{4}",                 // SSN
        "[\\w-\\.]+@([\\w-]+\\.)+[\\w-]{2,4}",  // email
        "[2-9]\\d{2}-\\d{3}-\\d{4}"             // phone
    };    
    
    static final String MASK = "###";    

    ArrayList<Pattern> patterns;    
    
    public PIIDetector() {
        patterns = new ArrayList<Pattern>();
        for(String s : TARGETS) {
            patterns.add(Pattern.compile(s));
        }
    }    
    
    public String maskPII(String s) {
        for(Pattern p : patterns) {
            s = p.matcher(s).replaceAll(MASK);
        }
        return s;
    }
}
$$

Happy hacking!