Iterating Over Metadata With Snowpark
(This was ported from my original Medium post.)
Hi Folks,
Last time we saw how to create simple Java functions to detect and mask personally identifying information (PII). For example, we could take a table containing some email messages and mask out the PII in the bodies with a simple query:
But let’s say we wanted to mask out all of the PII. And let’s say that we had many more fields like you might find in something like survey results.
In this case, masking out the PII would be easy, but tedious: we’d have to apply the function manually to each column. And if the schema of our table were to change – or if we wanted to run this masking routine on a different table – we’d have to rewrite the query.
What we’ve run into is a pretty fundamental limitation in SQL: the query is very tied to the underlying schema. There’s no way to pass a type parameter to the query or iterate over metadata. Snowpark doesn’t have this limitation: we can write code to inspect metadata and dynamically generate queries based on what we find.
To get started with Snowpark, you can follow the instructions on how to get it set up in your existing Scala development environment. Or you can follow the nice directions Zohar Nissare-Houssen has outlined here to get going using Docker.
Now using Snowpark for Scala, we can write a fully generic PII masking function:
val maskAllPii = (df: DataFrame) => {
val toMask = df.schema
.filter(_.dataType.typeName == "String")
.map(_.name)
df.withColumns(toMask,
toMask.map(c => callUDF("maskpii", df.col(c))))
}
This function takes in a DataFrame, inspects the schema, and applies the PII masking function we already have registered in Snowflake to each string column it finds, leaving non-string columns untouched. The result is just another DataFrame.
Now we can very easily run this on our email data…
val df = maskAllPii(sess.table("emails"))
…and fetch the results:
df.show(3,100) // get the first three lines, format wide
As you can see, the maskAllPii()
call has touched all of the String columns. Under the covers, Snowpark has dynamically generated a plan that corresponds a SQL query:
SELECT "ID",
maskpii("SENDER") AS "SENDER",
maskpii("SUBJECT") AS "SUBJECT",
maskpii("BODY") AS "BODY"
FROM ( SELECT * FROM (emails))
When show()
runs, it generates and issues the SQL,
wrapping this in an outer LIMIT
clause and pretty-printing the result – that’s what show()
does.
Of course, this query isn’t a hard one to write, though doing so does start to get a bit tedious as the column count goes up. And you have to do it again for each table or query you want to mask. Moreover, writing this yourself means more chances to make a mistake and miss a column.
In contrast, the Snowpark alternative is simple, robust, and reusable. And as a simple exercise, you can retool the example above to take a different function — or better yet, take an arbitrary function as a parameter.
Happy hacking!