No items found.

How to Calculate Confidence Intervals in SQL

Statistical overconfidence: Dangerous and easy

Imagine you have a small online business. This month 200 users signed up on your website, and 10 of them bought your \$800 service. Great! You’ve made \$8k of income. How much should you expect to make this year?

The straightforward answer is \$8k * 12 = \$96k. But how confident should you be? Will your conversion rate always be so close to 5%? You could pad the estimate ±20% for safety, guessing at \$77k to \$115k. If \$77k would cover all your expenses, should you feel secure?

This is a question of binomial probability. Using our favorite binomial confidence interval calculator, the 95% confidence interval for your conversion rate is about 2.5% to 9%.

With a confidence interval that wide, you should expect to make somewhere between \$48k and \$172k. Yikes! You could end up with half of your simple guess, and that’s if your business doesn’t change.

Automating statistics: Calculating confidence intervals in SQL

These confidence intervals are very informative, but turning to a calculator for every metric is tedious. If you’ve got hundreds of metrics across dozens of dashboards, it’s downright unsustainable.

Fortunately, the math for calculating confidence interval is simple to implement:

The Normal Approximation Interval formula for binomial confidence intervals

n = number of users
x = number of conversions
p = probability of conversion = (x / n)
se = standard error of p = sqrt((p * (1 - p)) / n)
confidence interval = p ± (1.96 * se)

See Normal approximation interval on wikipedia. Note the 1.96 constant specifies a 95% interval on a two-tailed normal distribution.

Implementing the formula in SQL

Let’s start with a table of the total number of users, and how many converted. Any data that represents a rate — conversions per user, server errors per request, etc. — will also work.

select
count
(1) as n
,
sum(case when converted then 1 else 0 end) as x from users
groupby
date_trunc('month', created_at);

With our basic data in hand, we want to implement the above formula in SQL. To keep things clear, we wrap each step of the calculation separately:

1. Calculate the conversation rate, p.
2. Using p, calculate the standard error, se.
3. Compute the low and high confidence intervals.
4. Include the original p conversion rate as our mid estimate.

select
rates.n as users
,  rates.x as conversions
,  p - se * 1.96 as low
,  intervals.p as mid
,  p + se * 1.96 as high from (
select
rates.*
,    sqrt(p * (1 - p) / n) as se -- calculate se
from
(select
conversions.*
,      x / n::float as p -- calculate p
from

(
-- Our conversion rate table from above
select
count(1) as n
,        sum(case when converted then 1 else 0 end) as x
from
users
group by
date_trunc('month', created_at);
) conversions
) rates
) intervals

You might be wondering why we’re seeing 8% on the high end, rather than the 9% mentioned in the introduction. We used the Adjusted Wald method in the introduction, which produces more accurate estimates for small amounts of data.

A refinement for little data: The Adjusted Wald method

The math explained above, though quite accurate with hundreds of users and a healthy conversion rate, becomes increasely biased with less data or extremely high or low rates. A rule of thumb is to avoid using it with fewer than 5 conversions or 100 users.

One way to adjust for these shortcomings is to use a more robust binomial proportion confidence interval technique like the Adjusted Wald method. In short, it adds a bit of fuzziness to the estimated probability to smooth out the extremely high or low rates which are more common with few datapoints.

Given the z-score needed to reach a certain confidence level (1.96 for 95% confidence), add 0.5 * z^2 to the number of conversions, and z^2 to the number of users. This is roughly +2 and +4 for the 1.96 z-score for 95%. You can read the original journal paper for a deeper explanation.

select
rates.n as users
, rates.x as conversions
, p - se * 1.96 as low
, intervals.p as mid
, p + se * 1.96 as high
from (
select
rates.*
, sqrt(p * (1 - p) / n) as se -- calculate se
from (
select
conversions.*
, (x + 1.92) / (n + 3.84)::float as p -- calculate p
from (
-- Our conversion rate table from above
select
count(1) as n
, sum(case when converted then 1 else 0 end) as x
from users
group by date_trunc('month', created_at);
) conversions
) rates
) intervals

The important adjustment is here, where we add the constants to the numerator and denominator when calculating p:

(x + 1.92) / (n + 3.84)::float as p -- calculate p

This isn’t a magical solution to not enough data: If you have an expected 1% conversion rate and only 100 users, this adjustment will triple the estimated conversion rate, giving you a confidence interval of 0-6%. More data is the answer. At 10 conversions and 1,000 users, the interval shrinks to 0.5% to 1.9%.

In general, the more data you have, the more statistical approaches like these will be helpful to you.

Who are we?

We’re Periscope, and we make a tool that makes data analysis on large SQL databases fast and easy. In Periscope, you could use our Snippets feature to implement this logic once, and apply it to any dataset.

If you have a database with many millions or billions of rows, and running hundreds of analyses is getting slow and cumbersome, we think you’ll really love it. Sign up on our homepage for a free demo. If you like it, we’ll set you up with a free 7-day trial the same day!

Want to discuss this article? Join the Periscope Data Community! Periscope Data
The world’s fastest visualization platform for data analysts.