Continuing last-week’s trend, we’ll again take a look at an aspect of Tableau that people often find confusing: the difference between live and extracted data sources. And again, we’re going to take a bit of a database perspective to clarify the situation.
The impetus for this post is a number of statements I’ve seen along the lines of:
A live data source is just a real-time extract of your data.
This is my favorite kind of wrong: subtly wrong.
We’ll come back to extracts shortly, but first I want to take a digression through views. That will be our database perspective for the day.
The Database Perspective
For the purpose of this example, we’re going to imagine that we have two super-creative tables named products and sales. For this post, we’re not going to care much about the data, but let’s assume a schema for each of these tables:
products(pid, description, price) sales(pid, customer, count)
Here, pid is the product identifier—and the join key between the two tables.
We might want to ask questions of these data like, “how much did each customer spend?” If we were going to jump right in and use these in Tableau, we’d start by creating a new data source that joins these tables together. After joining the tables, we’d also create a new column with the total spend for each sale.
But we’re going to stick to database-land, so we’ll write this data source as a SQL query:
SELECT products.pid, description, price, customer, count, count * price AS total_sale FROM products JOIN sales ON products.pid = sales.pid
To use this in a viz, we need to be able to run other queries against the results of this query. To do that, we need to create a view, which caches the query and gives it a name that we can write other queries against:
CREATE VIEW full_sales SELECT products.pid, description, price, customer, count, count * price AS total_sale FROM products JOIN sales ON products.pid = sales.pid
Going forward, we’ll call this our “data-source query”.
Now we could use this in Tableau to create a viz that looked at total sales per customer. Using what we learned last time, we can also write the query that our viz will generate as:
SELECT customer, SUM(total_sale) FROM full_sales GROUP BY customer
We’ll call this our “viz query”.
So far this has all been a long preface for the real meat of this post. We want to understand what exactly the database engine does with this last query. And the answer depends a lot on what we actually mean mean by “caching a query”.
The Meat of the Post
When we said we’d cache a query, we could have meant two very different things. The simplest is that we could cache the result of the query. I.e., when we create the view, we could effectively create a new table called full_sales, run the data-source query, and populate full_sales with the result.
This is sometimes called a materialized view, and makes it pretty simple to understand what our viz query does: it just runs against the data in the table we’ve cached.
But there’s another possibility: we could cache the query itself. I.e., when we create the view, we won’t actually run the query at all; we’ll just stash the SQL and use it to produce results on demand later.
This is generally called a virtual view, and it would be pretty uninteresting if we just reran the data-source query every time we used it. But that’s not what databases do.
Instead, when we run our viz query against a virtual view, the database engine will compose the two queries together: it will effectively do an internal rewrite of the query so that it looks something like:
SELECT customer, SUM(total_sale) FROM ( SELECT products.pid, description, price, customer, count, count * price AS total_sale FROM products JOIN sales ON products.pid = sales.pid ) AS temp GROUP BY customer
The full_sales reference has just been rewritten with the subquery that defined it.
There are a few relatively obvious trade offs between these approaches:
- A materialized view can take a lot of space, depending on how much data the query produces.
- A virtual view takes almost no space at all, since all it stores is the query.
- A materialized view may take a while to build, but can ultimately save time if it precomputes data that speed up the queries that are run against it.
- A virtual view takes almost no time to create, but also doesn’t precompute anything.
But this misses the main event: in many cases, the fully-composed query will be much faster to execute than each half individually.
There are two big reasons for this. First, because the database system gets the whole, composed query at once, it can optimize the whole thing globally. To give a simple example, assume that the viz query filtered out a lot of the underlying data—maybe the viz only shows information on a single customer. With separate view and viz queries, the data-source query has to produce data for all the customers, just to have them filtered out by the view query. By considering them together, that filter can be pushed down and executed early, short-cutting a lot of the work.
Second, assume that the data-source query was executed on one system—say a SQL Server database—and the viz query was executed some where else—like Tableau. If the data-source query produces a lot of data, all of it needs to be sent over the network. If the queries can be composed and the whole thing can be executed by the source database, then only the final viz query results need to be sent. And through aggregation, Tableau viz queries usually reduce the data substantially.
For these reasons, running the viz query on top of the virtual view may perform significantly better than taking an extract and then running the viz query separately.
Bringing This Back to Tableau
Let’s connect this back to Tableau. But with the heavy lifting complete, the connection is straightforward:
- An extracted data source is a materialized view: the data-source query is run to pull data from your database, and the data are materialized into a TDE or (with Tableau 10.5 or higher) a Hyper database. View queries then run against this materialized extract.
- A live data source is a virtual view: the data source just stores the query, and Tableau composes it on the fly with whatever query your viz generates. This composed query is issued in one shot to your database.
Done. But now you can see how the explanation at the top is subtly wrong.
A live data source is just a real-time extract of your data.
If this were true, then every time you modified or refreshed your workbook, the system would first run the data-source query and suck down the results, and then run the viz query to aggregate it all down. But this striation doesn’t happen: with a live data source, the whole view+viz query is composed, and this whole query is then executed. Every time.
In the right circumstances, this can mean the query that’s run is more optimized and flows much less data down from the database. And that can perform very well.
Other times, the cost to materialize the extract is worth it because the construction cost is amortized over a whole lot of viz queries, all of which benefit from the extract. But that would be for naught if the extract were re-created for each viz query.
After all of this, it’s worth popping up to recognize the obvious benefit to live data sources: liveliness. Since live data sources go directly against your database, they see the changes to your data as those changes occur. This can be valuable unto itself, regardless of the performance trade-offs.