No items found.
Advanced SQL

Reuse Calculations in the Same Query with Lateral Joins

Reusing Calculations

Reusing parts of computations has long been a wart in SQL. Let’s say you want to compute confidence intervals for your signup rate. You’ll have to use the standard error calculation twice: once for the upper bound, and once for the lower bound.

Your options are to just do the math twice, or to put the calculation in an ugly — and possibly inefficient — subquery.

This is where lateral joins shine. Postgres’s latest join type solves this problem by letting you make the calculation in one join, and then reference the results of that calculation in subsequent joins.

Computed Columns with Duplication or Subqueries

Let’s start with our input data, a roll-up table of signups by month:

Now let’s compute the confidence intervals around our signup rate each month. The math is:

n = number of users
x = number of signups
p = probability of signup = (x / n)
se = standard error of p = sqrt((p * (1 - p)) / n)
confidence interval = p ± (1.96 * se)

If we just recalculate the signup rate and standard error each time, the naive SQL is:

  (signups / visitors) as avg,
  (signups / visitors) 
    - (1.94 * sqrt((signups / visitors) 
       * (1 - (signups / visitors)) / visitors)) as low,
  (signups / visitors) 
    + (1.94 * sqrt((signups / visitors) 
       * (1 - (signups / visitors)) / visitors)) as high
from signups_by_month

What a mess! This is way too hard to read, and we’re obviously duplicating a bunch of work in the low and high calculations.

Nonetheless, the results are handy:

In Periscope, we can see those same results visually:

Let’s pick apart the issues in our naive SQL. We calculate the probability five times and the standard error twice. That’s way too much duplication for a simple calculation.

We can remove needless duplication by adding subqueries and having each layer build upon previous calculations:

  p as avg,
  p - 1.94 * se as low, 
  p + 1.94 * se as high
from (
    sqrt(p * (1 - p) / visitors) as se
  from (
      signups / visitors as p
    from signups_by_month
  ) probability
) standard_error

This query is probably more efficient because it only does each stage of the calculation once. But despite removing all the duplication, it’s arguably even harder to read.

Until now, these were our only two options for a calculation like this. But as of Postgres 9.3, there’s a better way!

Computed Columns with Lateral Joins

lateral join is a join that allows subqueries in the right-hand-side of the join to reference columns defined in the left-hand-side of the join.

For our confidence interval calculation, we’ll define probability first, then reference probability when defining standard error, and then reference standard error when defining lower and upper bounds.

Here’s the new query:

  p as avg
from signups_by_month,
lateral (select signups / visitors as p) probability,
lateral (select sqrt(p * (1 - p) / visitors) as se) std_error,
lateral (select p - 1.94 * se as low) lower_bound,
lateral (select p + 1.94 * se as high) upper_bound

Careful readers will notice we’re using a cartesian join. This is a join of the form from a, b, c, which is shorthand for from a join b on true join c on true. It creates a row for every possible combination of rows in the joined table. In our case, where each joined table computes a single value, this just has the effect of appending calculations and then reusing them for the next calculation.

The final version is almost as easy to read as a mathematical formula! We simply define a variable, use it in the next calculation, rinse and repeat.

All in all, a much cleaner, easier query. Thanks Postgres!

Want to discuss this article? Join the Periscope Data Community!

Periscope Data
The world’s fastest visualization platform for data analysts.