Jekyll2023-05-29T10:23:15-07:00https://blog.stdin.org/feed.xmlstdinIsaac KunenTwo Unequal Products2022-10-03T00:00:00-07:002022-10-03T00:00:00-07:00https://blog.stdin.org/2022/10/two-unequal-productsI’ve been watching some of Timothy Gowers’ videos in which he documents his attempts to solve various mathematics problems. Gowers’ goal is to provide some examples of the mathematical thought process for other to study. I don’t have any deep insights on this to share, but watching the mental process of a serious mathematician as he tackles a problem is certainly interesting. And the problems are interesting themselves.

The second problem Gowers tackles is the topic of this post. He solves it, but the solution doesn’t feel particularly satisfying. It doesn’t feel satisfying to him, either, so he tries another path towards a simpler solution that doesn’t pan out. Here, I take a pass.

## The Problem

Here’s Gowers’ statement of the problem:

Prove that for every positive integer $n$, there do not exist positive integers $a$, $b$, $c$, $d$ with $ad=bc$ and $n^2 < a < b < c < d < (n+1)^2$.

I suggest that you take some time to think this through and go watch Gowers’ videos before reading on. Below is my solution. I took a lot longer to get to this than Gowers, but the result seems reasonably elegant.

## Some Intuition

Before jumping into it, I want to say a few words about my intuition for the problem. Clearly, if the numbers $a$, $b$, $c$, and $d$ were arbitrary reals or rationals, then it would be easy to come up with values that make this work. So for this to fail, we’re going to have to make use of properties that are special to the integers.

In particular, I want to use the inequality to generate some extra space that I can use to show that the gap between $n^2$ and $(n+1)^2$ isn’t large enough to hold our numbers. My initial attempts were to observe that over the integers, $a>n^2$ means that $a\geq n^2+1$, that $b\geq n^2+2$, etc. But I wasn’t able to use this by itself to generate a large enough gap for the proposition to fail.

The other property of integers is that they factor. And putting this together with the observation above does generate enough space. Let’s see how this works.

## My Solution

Assume that the statement were true; we’ll derive a contradiction. Given that $ad=bc$, we can write

$\tag{1} {ad \over b} = c$

Since these are all positive integers, we can expand out $a$ and $d$ as products of (non-distinct) primes: $a = p_1 p_2 \ldots p_m$ and $d = q_1 q_2 \ldots q_n$. And since the result of the division is an integer, we can see that $b$ must be the product of a subset of these $p$ and $q$ values, with $c$ being the product of the remaining factors. Explicitly, we can rewrite equation (1) as:

${ { p_1 p_2 \ldots p_m q_1 q_2 \ldots q_n } \over { p_{\alpha_1}\ldots p_{\alpha_k} q_{\beta_1}\ldots q_{\beta_l} } } = { p_{\gamma_1}\ldots p_{\gamma_i} q_{\delta_1}\ldots q_{\delta_j} }$

Where the $p_\alpha$s and $p_\gamma$s account for all of the $p_1,\ldots,p_m$ and $q_\beta$s and $q_\delta$s account for all of the $q_1,\ldots,q_n$. If we collect up all of the $p$ terms used to create $b$ as $a_1$, and the leftover ones as $a_2$, and do likewise for the $q$ terms to create $d_1$ and $d_2$, we can rewrite the whole thing as:

$\tag{2} { {a_1 a_2 d_1 d_2} \over {a_1 d_1} } = a_2 d_2 \quad\text{where}\quad \begin{cases} a = a_1 a_2\\ b = a_1 d_1\\ c = a_2 d_2\\ d = d_1 d_2 \end{cases}$

All of these terms are still positive integers (possibly 1), but we now have:

$n^2 < \overbrace{a_1 a_2 < \underbrace{a_1 d_1} } < a_2 d_2 < \underbrace{d_1 d_2} < (n+1)^2$

Comparing the indicated terms, we can extract:

\begin{align}\tag{3} d_1 > a_2 &\implies d_1 \geq a_2 +1\\ d_2 > a_1 &\implies d_2 \geq a_1 +1 \end{align}

These implications make use of the fact that the terms are all integers. Now we can see that:

\begin{aligned} \boxed{n^2 + 2n + 1} = (n+1)^2 &> d \\ &= d_1 d_2 \\ &\geq (a_2 + 1)(a_1 + 1) \\ &= a_1 a_2 + a_1 + a_2 + 1 \\ &> \boxed{n^2 + a_1 + a_2 + 1} \end{aligned}

Has this forced enough space to generate a contradiction? Together, the boxed terms tell us that:

\begin{aligned} 2n &> a_1 + a_2 \\ 4n^2 &> a_1^2 + 2a_1 a_2 + a_2^2 \\ 4a_1 a_2 &> a_1^2 + 2a_1 a_2 + a_2^2 \\ 0 &> a_1^2 - 2a_1 a_2 + a_2^2 \\ 0 &> (a_1 - a_2)^2 \end{aligned}

And this last statement cannot hold for positive integers $a_1$ and $a_2$, so our assumption that $ad = bc$ must fail.

$\blacksquare$

## Discussion

Making use of a few properties of the integers – factorization and discreteness – pays off. By cleanly factoring them in step (2), and developing an inequality on the factors in step (3), we’re able to then amplify the difference of the product enough to generate a contradiction.

]]>
Isaac Kunen
Au Revoir, Snowflake!2022-09-06T00:00:00-07:002022-09-06T00:00:00-07:00https://blog.stdin.org/2022/09/au-revoir-snowflakeJust reading this blog, you might guess that all I do is leave jobs. First leaving Tableau, and now, four years later, departing Snowflake. I’m incredibly proud of what we accomplished at Snowflake, particularly with Snowpark. Snowpark not only expands what customers and partners can do with the platform, but also provides a lot of flexibility for Snowflake itself. I expect this to pay dividends for a long time.

Moreover, the Snowpark team – and Snowflake engineering in gereral – was absolutely top notch and a joy to work with.

So why leave?

Certainly not because of the people or for lack of interesting work. Nor for doubts in the company: Snowflake is absolutley crushing it. (And as a stockholder, I look forward to them continuing to crush it.)

This was a much more personal decison. I’ve had a longstanding ambivalence towards the software industry. Software has provided me with a lot of interesting, worthwhile problems to solve, and smart, engaging people to solve them with. And it has paid the bills quite handsomly.

On the other hand, I’ve always found myself drawn to the less practical side of computing, mathematics, and the sciences – maybe it runs in the family. I was in academia once: a graduate student for all the wrong reasons, and a poor one as a result. Now I’m in a position to explore again, this time with a bit more perspective.

Exactly how will this exploration play out? I have some ideas, but the truth is that I’m not yet entirely sure.

In the short term, my plans are to take a little time off, get a little more involved in my kids’ schools, and start thinking about the future. I’ll also try to write a bit more about non-employment topics here, as well as get some pictures posted on our new family blog.

Stay tuned!

]]>
Isaac Kunen
Iterating Over Metadata With Snowpark2021-08-17T00:00:00-07:002021-08-17T00:00:00-07:00https://blog.stdin.org/2021/08/iterating-over-metadata-with-snowpark(This was ported from my original Medium post.)

Hi Folks,

Last time we saw how to create simple Java functions to detect and mask personally identifying information (PII). For example, we could take a table containing some email messages and mask out the PII in the bodies with a simple query: But let’s say we wanted to mask out all of the PII. And let’s say that we had many more fields like you might find in something like survey results.

In this case, masking out the PII would be easy, but tedious: we’d have to apply the function manually to each column. And if the schema of our table were to change – or if we wanted to run this masking routine on a different table – we’d have to rewrite the query.

What we’ve run into is a pretty fundamental limitation in SQL: the query is very tied to the underlying schema. There’s no way to pass a type parameter to the query or iterate over metadata. Snowpark doesn’t have this limitation: we can write code to inspect metadata and dynamically generate queries based on what we find.

To get started with Snowpark, you can follow the instructions on how to get it set up in your existing Scala development environment. Or you can follow the nice directions Zohar Nissare-Houssen has outlined here to get going using Docker.

Now using Snowpark for Scala, we can write a fully generic PII masking function:

val maskAllPii = (df: DataFrame) => {
.filter(_.dataType.typeName == "String")
.map(_.name)
}


This function takes in a DataFrame, inspects the schema, and applies the PII masking function we already have registered in Snowflake to each string column it finds, leaving non-string columns untouched. The result is just another DataFrame.

Now we can very easily run this on our email data…

val df = maskAllPii(sess.table("emails"))


…and fetch the results:

df.show(3,100)  // get the first three lines, format wide As you can see, the maskAllPii() call has touched all of the String columns. Under the covers, Snowpark has dynamically generated a plan that corresponds a SQL query:

SELECT "ID",
FROM ( SELECT  *  FROM (emails))


When show() runs, it generates and issues the SQL, wrapping this in an outer LIMIT clause and pretty-printing the result – that’s what show() does.

Of course, this query isn’t a hard one to write, though doing so does start to get a bit tedious as the column count goes up. And you have to do it again for each table or query you want to mask. Moreover, writing this yourself means more chances to make a mistake and miss a column.

In contrast, the Snowpark alternative is simple, robust, and reusable. And as a simple exercise, you can retool the example above to take a different function — or better yet, take an arbitrary function as a parameter.

Happy hacking!

]]>
Isaac Kunen
Basic PII Detection and Masking in Snowflake Using Java2021-07-28T00:00:00-07:002021-07-28T00:00:00-07:00https://blog.stdin.org/2021/07/basic-pii-detection-using-java(This was ported from my original Medium post.)

Hi Folks,

For my first foray into Medium, I wanted to share some code that I’ve used previously in demos. The examples here do basic detection and masking of personally-identifying information (PII) using Java’s built-in regular expression support.

Now, I make no assertion that these routines are good: if you really want to do robust PII detection, you probably want something more sophisticated than a few regexes. Snowflake is even working on data classification as a built-in feature.

But I like these examples because they do a good job of illustrating the basic pattern of Snowflake’s Java functions. And they’re pretty malleable: you should be able to modify these examples to work for any situation where you need to detect or mask based on a set of regexes.

Let’s start with the code and then tear it apart. If you’re running on Snowflake and have Java functions enabled – any AWS account, for now – then you can define them right inline using this create function command:

create function haspii(s string)
returns boolean
language java
returns null on null input
handler = 'PIIDetector.hasPII'
as
$$import java.util.regex.*; import java.util.*;public class PIIDetector { static final String[] TARGETS = { "\\d{3}-\\d{2}-\\d{4}", // SSN "[\\w-\\.]+@([\\w-]+\\.)+[\\w-]{2,4}", // email "[2-9]\\d{2}-\\d{3}-\\d{4}" // phone }; ArrayList<Pattern> patterns; public PIIDetector() { patterns = new ArrayList<Pattern>(); for(String s : TARGETS) { patterns.add(Pattern.compile(s)); } } public boolean hasPII(String s) { for(Pattern p : patterns) { if (p.matcher(s).find()) { return true; } } return false; } }$$


With this in hand, anyone with permissions on the function can issue queries that use it without any knowledge of Java:

select id, haspii(body)
from emails


So let’s take the definition apart. The first section defines how the function will show up in SQL:

create function haspii(s string)
returns boolean
language java
returns null on null input
handler = 'PIIDetector.hasPII'


Most of this is pretty self explanatory: it’s a function that takes a string and returns a Boolean, and the language is Java. The null on null input bit lets me skip any null handling in my routine: nulls inputs will be handled without calling into Java at all.

The handler directive is new, and specifies where in the Java code to actually make a call. You may have many potential entry points, but in this case, Snowflake is going to call the hasPII method defined on the PIIDetector class.

The actual Java code is contained between the pairs of dollar signs. After a little boilerplate, we see a few regular expressions:

static final String[] TARGETS = {
"\\d{3}-\\d{2}-\\d{4}",                 // SSN
"[\\w-\\.]+@([\\w-]+\\.)+[\\w-]{2,4}",  // email
"[2-9]\\d{2}-\\d{3}-\\d{4}"             // phone
};


These (highly USA-centric) expressions match the basic forms of Social Security numbers, email addresses, and phone numbers. You can very easily augment this list with more patterns to match your definition of PII.

Next, we see some initialization code:

ArrayList<Pattern> patterns;public PIIDetector() {
patterns = new ArrayList<Pattern>();
for(String s : TARGETS) {
}
}


Our handler points to an instance method in the PIIDetector class. When Snowflake runs a query that requires an instance of this class, Snowflake will will look for a default constructor to use to generate this instance. This provides a really easy way to do one-time initialization: in this case we compile up the regular expressions so they’re ready to go once per query, rather than doing so on each invocation – it should be much faster.

Finally, we have the actual method we’re binding to:

public boolean hasPII(String s) {
for(Pattern p : patterns) {
if (p.matcher(s).find()) {
return true;
}
}
return false;
}


This just loops over the patterns and fires if any match. Easy peasy!

And there we have it: a simple PII detection routine that you can customize to your requirements (and local phone-number formats). But really, this is good for any situation where you have a number of regular expressions to match.

And with a little tweaking, you can mask out these matches instead. Here’s the code; I’ll let you dig into the details.

create function maskpii(s string)
returns string
language java
returns null on null input
as
$$import java.util.regex.*; import java.util.*; public class PIIDetector { static final String[] TARGETS = { "\\d{3}-\\d{2}-\\d{4}", // SSN "[\\w-\\.]+@([\\w-]+\\.)+[\\w-]{2,4}", // email "[2-9]\\d{2}-\\d{3}-\\d{4}" // phone }; static final String MASK = "###"; ArrayList<Pattern> patterns; public PIIDetector() { patterns = new ArrayList<Pattern>(); for(String s : TARGETS) { patterns.add(Pattern.compile(s)); } } public String maskPII(String s) { for(Pattern p : patterns) { s = p.matcher(s).replaceAll(MASK); } return s; } }$$


Happy hacking!

]]>
Isaac Kunen
A leopard can’t change his spots. (But he may change jobs.)2018-07-15T00:00:00-07:002018-07-15T00:00:00-07:00https://blog.stdin.org/2018/07/15/a-leopard-cant-change-his-spots-but-he-may-change-jobsI won't bury the lede: My last day at Tableau was July 6th, and tomorrow I start a new gig at Snowflake. I joined Tableau in June of 2015, and spent most of my three years there starting, building, and ultimately shipping Tableau Prep. I'm incredibly proud of the Prep team, the product we put together, and the awesome functionality yet to come.

As I move on, I've been thinking a bit about the past projects that really excited me. In addition to Prep, my favorites were probably StreamInsight, which was a system for dealing with time-aware queries and streaming data, and the spatial types in SQL Server. (Those types are still going strong and motivating new integrations ten years later.)

A common theme through of all of these projects has been making it easy to do complex things with data. And Snowflake is most certainly out to do that with data warehousing. It feels like a wonderful match.

I'm going to miss Tableau — it's a wonderful company — and I'm going to miss Prep. But I'm incredibly excited to be starting at Snowflake. (And a special thanks to those Preppies that slipped Snowflake support into the latest Prep release. That should save me some awkward moments.)

I'll try to keep writing here — maybe with a broader set of topics, and hopefully with a bit more regularity. So do please check in and drop me a note.

]]>
Isaac Kunen
Tableau Prep: The Power of Composability2018-05-09T00:00:00-07:002018-05-09T00:00:00-07:00https://blog.stdin.org/2018/05/09/tableau-prep-the-power-of-composabilityWhen we built Tableau Prep, we put a premium on ensuring composability of operations: you can take the operations Prep supports and string them together in any combination you need. There are no restrictions based on where the data came from, or what operations came before.

This means that you never need to think about whether a particular operation is supported in your particular situation: if Prep supports it ever, Prep supports it always. Moreover, this gives you a lot of power to do what you need to with your data. In the rest of this post, we'll walk through a Superstore example that highlights this power.

# The Problem

Let's start with the sample Superstore data from Tableau Desktop. This data set is a list of order details: each row represents one item from an order, with multiple line items accruing to each order.

Given these data, let's try to fulfill what seems like a simple request:

Get the order details for customers with fewer than the median number of orders.

This seems relatively straightforward... or is it? In cases like this, I often find it helpful to think backwards to come up with a solution:

Step 4
If we had a list of customers with fewer than the median number of orders, we could cull the order details down to just those from customers on the list. But we don't have a list of these sub-median customers.

Step 3
If we knew the median number of orders, we could prune the list of customers down to those with fewer than the median. But we don't have the median number of orders.

Step 2
If we knew the count of orders for each customer, we could aggregate it to find the median number of orders over all customers. But we don't have the number of orders for each customer.

Step 1
If we had the list of orders for each customer, we could aggregate to get the count for each customer. And we do have the order list!

Now we have a plan: we'll start with the order details we have, and climb the ladder outlined above to get to the solution.

# The Solution

We start by loading the Superstore data: As we've already observed, these are order details. Each order has a distinct Order ID, but may have more than one line.

## Step 1

Following our plan, the first thing we need to is get the count of orders for each customer. To do this we introduce an aggregate: we group by customer and count the distinct number of Order IDs: The distinct makes it so repeated Order IDs — which come from having more than one order detail line per order — are only counted once.

So we don't confuse ourselves later, we'll rename Order ID to Number of Orders: ## Step 2

Now that we have the list orders for each customer, we can aggregate again to find the median number of orders per customer: This aggregate is a little funny: There's no grouping field, so we don't partition the table at all. The result is an odd little table with one row and one column, but this record represents the median over all customers we were looking for.

We'll rename this once again: ## Step 3

With the median number of orders in hand, we can join it with our list of customers and order counts to filter down that list. I.e., we'll join it with the result of our first aggregate: Note the join clause here: we're doing an inner join, but matching when the median is greater than the customer's order count. We also have an error: the types don't match because the result of the median is a floating-point number, not an integer.

If we correct the type, we get our list of customers with fewer than the median number of orders: ## Step 4

Now that we have our customer list, we're ready to cull the line items. We'll again use a join as a filter, but this time we're joining our latest table with the original input: You can see that there are a bunch of records dropping out from the right: those were the customers with more than the median number of orders. What remain are the line items we care about: # Wrapping Up

At this point, you might want to clean up a few of the columns we created along the way, but our data are ready to output to Tableau or anywhere else you want to take them.

This may seem a little complex — and it's clearly stretching our flow layout algorithm — but it makes a perfectly fine flow. There was no operator that solved our problem out of the box, but composability made it possible to mix-and-match the operations present to build a computational machines for our task.

We certainly aren't done adding operations to Prep, but there's a rich set already present. And with a little composition, you can make them do some pretty cool tricks.

]]>
Isaac Kunen
Tableau Prep: The Flow2018-05-07T00:00:00-07:002018-05-07T00:00:00-07:00https://blog.stdin.org/2018/05/07/tableau-prep-the-flowI've been a bit quiet lately, but Tableau Prep out the door and it's time to make a little noise.

Clark recently wrote an excellent post on the basic UX architecture of Prep. Here I'd like to cover a key concept underlying Prep that may be a bit foreign to people coming from Tableau: the flow. This isn't the most glamorous part of Prep, but it is one of the most fundamental concepts in the tool, so it seems worth spending some quality time on.

Data In; Data Out

To understand flows, we start with steps, which are the conceptual unit of work in Tableau Prep. Every time you take an action on your data in Prep, you're adding a step. For example, if we take the world consumer price index data included with the product and add a filter, we find that a new step is added to the flow: Each item in the flow pane represents a step, and each step works in the same way: data come in from the left, are modified by the step, and leave to the right: Some steps — cleaning steps — may have multiple sub-steps, or changes. These are just like steps in the flow, but are smaller increments of work. They flow top to bottom: We group these together to help conceptually simplify the flow, but each change acts just like any other step: rows come in, they're modified, and they go out.

Some steps — such as joins — have multiple inputs, but they work the same way: two sets of data come in from the left, they're put together, and the result leaves to the right: And where do they go? On to the next step! Some steps may even have multiple outputs, with the data going to multiple targets: Step-by-step we build up a flow: an ordered sequence of steps that does what we want. Clarity and Control

That ordering is a key aspect of flows. If you're coming from Tableau, you may be aware that it performs operations in a particular order, but the system doesn't advertise this, and generally you don't need to think about it.

But order sometimes matters, and we designed Prep with those times in mind. The CPI data contain both a food index and a general index. Let's say that we've pivoted the data, and now want to compare each country's CPI to the global average for each year — except we only care about the food index.

To do this, we'll first filter to keep only the food index: And then we'll aggregate by year: Order matters: if we did the aggregate first, we would have folded in the general CPI as well.

This kind of ordering is explicit in Prep. You don't have to guess, and you don't need to coax the system into doing what you want: you just build your flow in the order fits your problem.

And with Prep, you can always go back and see your data at any point along the flow. Just click back and look. This way you can see and control what the flow is doing to your data every step along the way.

Prep is a Competent Cook

We can add another metaphor: think of a flow as a recipe, and let's take a moment to bake some cookies. We've already mixed the wet ingredients — the eggs, the vanilla, the butter — when we get to this part of the recipe:

1. ...
2. Measure 1.5 cups flour
4. Add 1/2 teaspoon baking powder
5. Mix thoroughly
6. Add dry ingredients to wet ingredients

A competent cook would mix these dry ingredients before adding them to the wet, but they would take the liberty of combining them in any convenient order: they know it's irrelevant.

Tableau Prep is a competent cook. It can figure out many cases where the order won't matter, and can rearrange them to make your flow run more efficiently. But it will only do this when the reordering won't affect the results that you intended.

So while the flow give a conceptual order to the operations and their execution order, they may not be run that way at all. The result is that you can ignore order when it doesn’t matter, but rely on it when it does.

More than Just Flows

The notion of a flow is not unique to Tableau Prep, and it isn't Prep's most distinguishing feature. The way that Prep uses samples to give you immediate feedback, the way we use analytics to help you see what needs to be done, and the direct manipulation all more directly contribute to what makes Prep special.

But understanding flows is central to understanding how to make Prep do exactly what you want, and it can be a bit of a leap for folks coming from Tableau Desktop. I hope this helps make that leap a little easier.

Happy hacking!

]]>
Isaac Kunen
When Live Beats an Extract2018-03-14T00:00:00-07:002018-03-14T00:00:00-07:00https://blog.stdin.org/2018/03/14/when-live-beats-an-extractWhen using Tableau, taking an extract is always better than using a live query, right?

Well, no.

Of course. Obviously, when your data are changing and you want to get all of the latest updates in your viz, you'll want to use a live query. But if that's not the case, then an extract is clearly better, especially with Hyper in 10.5, right?

Well, no!

Shoot! This is complicated? When will live beat an extract? Let's take a look at a few cases.

## A Few Basics

To understand what's going on, you should have a basic understanding of how live and extracted data sources are used by the system. If you feel a bit shaky here, I'd recommend my previous post on live vs extracts. But in a nutshell:

• When you're using an extract, the query defined by the data source is run and the whole resulting table is persisted in either a TDE (in Tableau 10.4 or before) or a Hyper database (in 10.5 and later). The queries produced by your workbook are then run against this table.
• When you're running live, the queries from your workbook are composed with the data source query. In simple cases, at least, this will result in a single query that is pushed down to the target database system, and only the results needed for the viz are returned.

We're going to look at a few cases where live can do better than an extract. As we look at them, pay particular attention to:

• The time to run the remote query,
• The time to transfer the data, and
• The time to run the local query.

These aren't rigorous perf numbers, but to give you a sense of scale, here's my setup:

• Tableau 10.5 (with Hyper) running on a i5-2500 with 8GB of RAM.
• SQL Server 2017 Express Edition running on an i7-3770 with 16GB of RAM.
• All wired together over gigabit Ethernet.

So nothing too grand. In any case, the lessons here should carry over to other hardware.

The data set is a stock history set from Kaggle that records daily stats for large number of stocks and ETFs. The schema looks like:

history(ticker, type, date, open, high, low, close, volume, openInt)

Loaded into SQL Server and indexed on (ticker, date), this results in 17.4M rows and about 1.5GB of storage. (I have no idea what the provenance or accuracy of these data are, but for this work only the size is relevant.)

Let's try to beat an extract!

## Nail The Index

Let's start with an easy case: let's find the yearly average close for Tableau's stock. I'll drag the ticker into filters, years into columns, and Avg(Close) into rows. It's an award-worthy viz: This is also an almost ideal query for our SQL Server database: it makes excellent use of the index, so the query is exceptionally fast to run; and because the aggregation happens remotely, there are almost no results to send over the wire. By looking in the log, I find that it takes a whole 0.006 seconds to run this query and fetch the results. How can we possibly beat that?

Indeed, if we recreate the same viz with an extract, Hyper takes more like 0.2 seconds to compute the viz. So SQL Server is faster than Hyper? Well, in this case it is, but we've almost cheated by practically tuning it to answer this query quickly. Hyper, on the other hand, doesn't require (and doesn't allow) us to tune its setup. So we're comparing the best case for SQL Server to a case for Hyper.

But the lesson is still sound: if your query (a) lines up well with the setup of your remote database, and (b) transfers very little data, then we can actually beat a Hyper extract.

Let's try to avoid pandering to SQL Server quite so much and just ask for the number of records my data set has each year: Now SQL Server takes a bit longer: 5.53 seconds. Trying this against the extract shows what Hyper can do: 0.193 seconds. In this case, both engines have to do roughly the same amount of work, but with it's column-based, in-memory execution, Hyper is the clear winner!

Except that we haven't taken into account the cost of generating the extract. When we refresh it, we find that it takes us 67.8 seconds to generate a 435MB extract. If we add that in, SQL Server starts looking pretty good: Applying a little algebra, that means that to recoup the cost of our extract, we'd need to run our viz query a hair over 15 times. Often times this will be worth it, but if the query is truly one off, I'd rather spend 5.53 seconds than 68.

## Blow Up the Extract

Let's try something more horrible. Let's say that in addition to the historical stock prices, we have a table of customer holdings. We'll keep it simple; our customers have static holdings that look like:

customerholdings(customer, ticker, amount)

(I don't actually have any customers, so I randomly generated 20 holdings for each of 20,000 imaginary customers.)

We want to do things like look at the total value of all customers' holdings over time, so we join the holdings to the price history. We then create a calc to compute the value each customer's holdings and make a viz: In case you're interested, that giant spike is caused by a few odd stocks like DryShips Inc. (DRYS), which somehow peaked at $1,442,048,636.45 in 2007. I don't comprehend. The graph looks funny, but again, this doesn't matter for our analysis. What we care about is that this query takes 133 seconds to run—it's a fair bit of work for SQL Server to do. How about the extract? Well, let's do a little back of the envelope computation. If we execute the full join in SQL Server and don't aggregate anything down, instead of the 17 million records in our history table, the result set will have about 441 million records. And these records are larger than the history rows because they have customer information as well. Optimistically, this will end up being something like 10 gigabytes of data that I have to move over the wire, and store in a local extract. And that's all before I even get to ask my query. So unless I'm doing this a lot, I'm simply not going to bother. ## Wrapping Up So we've seen a few cases where live queries may be preferable to extracts, leaving aside the obvious cases where you simply want the most current data. One thing we didn't talk about is federated queries: queries that span multiple data sources. As a general rule, federation makes extracts look better relative to live, because live starts to look worse. Live works best when the engine can push operations that reduce data volumes off to the remote system—operations like aggregations and filters—and federation tends to interfere with that pushdown. But that's another ball of wax. I'll write more on federation soon. ]]> Isaac Kunen A Visual Guide to Telescope Eyepieces2018-03-05T00:00:00-08:002018-03-05T00:00:00-08:00https://blog.stdin.org/2018/03/05/a-visual-guide-to-telescope-eyepiecesNot too long ago, I bought a telescope. I guess that makes me an amateur astronomer. If you buy a camera, you are a photographer. If you buy a flute, you own a flute. - Bob Kolbrener So maybe I just own a telescope. I certainly need some help when it comes to choosing things like eyepieces, so I was thrilled to come upon a very thorough list of eyepieces assembled by Starman1 (Don) over at Cloudy Nights. But a spreadsheet is one thing—a viz is better. My take at an explorative viz is online over at Tableau Public. (I wish I could figure out how to embed something here, but all I can manage is a screenshot.) Go ahead and tweak the parameters to find the eyepiece you're looking for. Some details on how I built it are below the fold. ## Build Details Don's data are pretty clean, but I did have to massage them a bit: • A lot of missing values were null, but some were listed as "?". I normalized them all to null so I could treat numeric values as numeric. • There are some eyepieces available in multiple formats that I had to split up and pivot. • There were some zoom and multi-focal-length eyepieces that I had to figure out how to handle. All of this was straightforward to see and handle using Project Maestro. I'd recommend it even if it wasn't my baby. I did have to make a few choices in the viz: • My two primary axes represent the things that seem most important to me: focal length and the amount of the sky you can see. As focal length increases, the magnification decreases, and the resulting trend towards larger FoV is clear. Still, there is a wide variation depending on eyepiece construction. • I used mark size to indicate eye relief. This is super important to me because I wear glasses; anything less than about 20mm makes it hard. • Color indicates price; everyone cares about price. • I had to decide how to handle zoom eyepieces. I show them as their average focal length. Full details are in the tooltip: Suggestions for improvement are welcome. Enjoy! ]]> Isaac Kunen The Fourier Series via Linear Algebra2018-02-27T00:00:00-08:002018-02-27T00:00:00-08:00https://blog.stdin.org/2018/02/27/the-fourier-series-via-linear-algebraI didn't post last week because I was on vacation. But on vacation I decided to write about something a little out of my comfort zone: Fourier series. (Yeah. Try being my wife.) Fourier series (and the related Fourier transform) made some sense to me for, but I never really learned how to derive them so they always seemed a bit magical. As I was going through Arthur Mattuck's excellent differential equations course at MIT's Open Courseware, the Fourier series clicked for me, so I thought I'd distill this out. I'm tempted to categorize this post under "things Isaac should have learned in school (had he been paying attention), but I don't think I ever had a course that taught this. If you're still paying attention, I will assume that you recall your basic linear algebra and have some idea (and interest in) what a Fourier series is. But given that, this isn't hard. ## It's All Linear Algebra Recall that the Fourier series represents periodic functions as a sum of sines and cosines. In this post, we'll deal with functions with a period that lies in the interval$ [-\pi, \pi]$. The generalization to arbitrary periods is straightforward, but this interval illustrates the scheme well. The fundamental "click" for me was that this was all linear algebra. The Fourier series: 1. Looks at functions over an interval as a vector space with an inner product; 2. Picks an orthonormal basis for the space; and 3. Represents an arbitrary function in this basis by projecting it out on the basis. Given that we're working over the interval$[-\pi, \pi]$, we'll define a vector space$V$where the scalars are taken from$ \mathbb{R}$and the vectors are functions over the interval$ [-\pi,\pi].$In addition, we'll define an inner product by:$\langle f,g\rangle = \frac{1}{2\pi}\displaystyle\int_{-\pi}^\pi f(x)g(x) dx$I'll leave it to you to check that these meet the requirements of a vector space and an inner product. This is worth doing, particularly if using functions as vectors seems odd to you. In any case, that's all for step 1. ## Step 2: Choose an Orthonormal Basis We're going to choose an orthonormal basis$ S$for our vector space. The vectors of$ V$are functions, so our basis will be a set of functions. The basis will be infinite, meaning that$ V$is infinite dimensional. To build our basis, we're going to use the constant function$ 1(x)$as well as sines and cosines of a positive integral multiple of$ x$. In particular, our basis is:$ S = \begin{Bmatrix}\vphantom{\displaystyle\int}\sqrt{2}\sin(x), \sqrt{2}\sin(2x), \sqrt{2}\sin(3x), ...,
\\ \vphantom{\displaystyle\int}\sqrt{2}\cos(x), \sqrt{2}\cos(2x), \sqrt{2}\cos(3x), ...,
\\ \vphantom{\displaystyle\int}1(x)\end{Bmatrix}$For these to be an orthonormal basis, we first have to show that any two of these are orthogonal. I.e., that for all$ f,g \in S$with$ f\ne g, \int_{-\pi}^\pi f(x)g(x) dx = 0$. There are a few cases to check: • If one of$ f(x)$and$ g(x)$is$ 1(x)$then the inner product is just an integral of a sine or cosine function over a whole number of periods, which is zero. • If$ f = \sin(kx)$and$ g = \cos(nx)$, then$ f(x)g(x)$is an odd function, so the integral is zero. • If both$ f(x)$and$ g(x)$are both sines (or cosines) with different coefficients, then this is also zero. As Mattuck suggests, you can work this out via complex exponentials or trig identities; he does it in his lecture via differential equations and a nice symmetry argument. We don't need to show linear independence separately because it's implied by orthogonality. But we do need to check that all of the vectors have unit length: • For$ 1(x)$, we find that$|1(x)|^2 = \langle 1(x),1(x)\rangle = \frac{1}{2\pi}\int_{-\pi}^\pi 1 dx = 1$. • For$ \sqrt{2}\sin(kx)$, we find that$ |\sqrt{2}\sin(kx)|^2 = \frac{1}{2\pi}\int_{-\pi}^\pi 2\sin^2(kx) dx = 1$. So our sine vectors are normalized. • This same argument holds for the cosines. So$ S$is an orthonormal basis. We haven't shown that$ S$actually spans the space$ S$of functions. I'll mention this again, but for now, it's sufficient to know that it spans some space. ## Step 3: Represent a Function in this Basis Now that we have a basis, we can take an arbitrary vector$ F(x)$and write it as a linear combination of the basis vectors:$ F(x) = \displaystyle\sum_{s\in S} t_s s$All we need to do is figure out the constants$ t_s$. But since$ V$is an inner product space, we can find the coefficient$ t_s$for each$ s\in S$as:$ t_s = \langle F(x),s\rangle = \frac{1}{2\pi}\displaystyle\int_{-\pi}^{\pi}F(x)s(x)dx$Instead of using$ t_s$, we'll use$ a_k$as the constant term for the$ \cos(kx)$term, and$ b_k$for the$ \sin(kx)$term. For now, we'll use$ c$as the constant for the$ 1(x)$term. Now:$ \begin{array}{rcl}
F(x) &=& \displaystyle\sum_{s\in S} t_s s = \displaystyle\sum_{s\in S} \langle F(x),s\rangle s\\
&&\\
&=&c*1(x) + \displaystyle\sum_{k=1}^\infty a_k \cos(kt) + b_k \sin(kt)
\end{array}$Where:$ \begin{array}{rcl}\vphantom{\displaystyle\int}a_k & = & \frac{1}{\pi}\int_{-\pi}^{\pi}F(x)\cos(kx)dx \\
\vphantom{\displaystyle\int}b_k & = & \frac{1}{\pi}\int_{-\pi}^{\pi}F(x)\sin(kx)dx  \\
\vphantom{\displaystyle\int}c & = & \frac{1}{2\pi}\int_{-\pi}^{\pi}F(x)dx \end{array}$And that's the Fourier series. Note that we've folded in two$ \sqrt{2}$factors into the the$ a$and$ b$terms, which is why they are missing the leading$ \frac{1}{2}$. This isn't quite the form you usually see. We can tweak this slightly by noticing that$ 1(x) = \cos(0x)$, so we can replace that constant term and its associated constant with a cosine term—we just have to watch out for that$ \frac{1}{2}$. Doing so, we get the usual formulation:$ F(x) = \frac{1}{2}a_0 + \displaystyle\sum_{k=1}^\infty a_k \cos(kt) + b_k \sin(kt)$Where the$ a$and$ b$terms are as above. ## A Few Loose Ends One thing I glossed over is the space spanned by our basis. A quick counting argument shows that there are$ |V| = \mathfrak{c}^{\mathfrak{c}} = 2^{2^{\aleph_0}}$functions in our vector space, but only$ |\text{span}(S)| = \mathfrak{c}^{\aleph_0} = (2^{\aleph_0})^{\aleph_0} = 2^{\aleph_0}$functions in the span of our "basis". Since$ 2^{\aleph_0} < 2^{2^{\aleph_0}} $, our basis doesn't actually span the space of functions. We're missing a lot of functions—but which ones? It turns out that our basis actually spans the set$ L^2\$ of square integrable functions, but I'm afraid that showing this is still beyond me.

Also still beyond me is the extension of this to the full real line: the Fourier transform. Hopefully I'll be back sometime to explain that one, too.

]]>
Isaac Kunen