Aggregating data in SQL, or how to make counting even harder
They say that counting stuff is hard, especially when you want to count very different things at the same time. In this post I’ll review a few methods for extracting aggregations simultaneously by combining simple counts sums and counts to more distinct, order-related aggregations. Practically every aggregation discussed below can be achieved, on it’s own, with much simpler queries (and I will show some examples below). If all you need is to extract a single number at a time you should definitely opt for simpler syntax. But the challenge I set in this post is to achieve all these sums & counts with in a single query - this is often more efficient and definitely simpler to write & maintain.
Raw data
To demonstrate how these methods work, let’s generate some data: we are looking at the orders
table in e-commerce our system, where each line contains user ID, order ID, time of payment, time of delivery, device used for making the order (we distinguish mobile from desktop) and the total amount paid in USD.
SELECT *
FROM (VALUES
(1001, 95, CAST('2021-01-01 10:00:00' AS TIMESTAMP), CAST('2021-01-10 10:00:00' AS TIMESTAMP), 'mobile', 10.45),
(1001, 96, CAST('2021-01-01 15:00:00' AS TIMESTAMP), CAST('2021-01-10 10:00:00' AS TIMESTAMP), 'mobile', 9.64),
(1002, 106, CAST('2021-01-03 05:00:00' AS TIMESTAMP), CAST('2021-01-07 15:00:00' AS TIMESTAMP), 'desktop', 10.45),
(1002, 109, CAST('2021-01-03 15:00:00' AS TIMESTAMP), CAST('2021-01-11 15:00:00' AS TIMESTAMP), 'mobile', 10.27),
(1001, 145, CAST('2021-01-05 10:00:00' AS TIMESTAMP), CAST('2021-01-06 10:00:00' AS TIMESTAMP), 'desktop', 22.55)
) AS t(user_id, order_id, paid_at, delivered_at, device_type, amount)
user id | order id | paid at | delivered at | device type | amount |
---|---|---|---|---|---|
1001 | 95 | 2021-01-01 10:00:00 | 2021-01-10 10:00:00 | mobile | 10.45 |
1001 | 96 | 2021-01-01 15:00:00 | 2021-01-10 10:00:00 | mobile | 9.64 |
1002 | 106 | 2021-01-03 05:00:00 | 2021-01-07 15:00:00 | desktop | 10.45 |
1002 | 109 | 2021-01-03 15:00:00 | 2021-01-11 15:00:00 | mobile | 10.27 |
1001 | 145 | 2021-01-05 10:00:00 | 2021-01-06 10:00:00 | desktop | 22.55 |
Simple sums & counts
We’ll start with simple sums and counts. We want to know, per user, what was the total amount, how many line items and some stats about the delivery time (in days):
SELECT
user_id
,SUM(amount) AS amount_total
,COUNT(1) AS order_count
,MIN(delivered_at - paid_at) AS shortest_delivery
,MAX(delivered_at - paid_at) AS longest_delivery
FROM orders
GROUP BY user_id
user id | amount total | order count | shortest delivery | longest delivery |
---|---|---|---|---|
1001 | 42.64 | 3 | 1 00:00:00 | 9 00:00:00 |
1002 | 20.72 | 2 | 4 10:00:00 | 8 00:00:00 |
COUNT DISTINCT
The next thing we want to find out is how many orders did each user make? In standard SQL we simply use COUNT DISTINCT
, but this solution does not scale nicely. Depending on the size of the data and how accurate we have to be, we may want to use a faster approximation. Practically every implementation of SQL has it’s own implementation of a faster-but-less-accurate approx_distinct
, usually using the HyperLogLog approximation. Just make sure to check your documentation so you understand the trade-offs (see this example)
SELECT
user_id
,COUNT(DISTINCT CAST(paid_at AS DATE)) AS active_day_count
,approx_distinct(CAST(paid_at AS DATE)) AS active_day_count_approx
FROM orders
GROUP BY user_id
user id | active day count | active day count approx |
---|---|---|
1001 | 2 | 2 |
1002 | 1 | 1 |
Conditional aggregations & pivoting around
So far we used every row in our table for aggregation. But what if we only want to count some rows and not others? A straight forward approach would be to simply filter out the irrelevant rows, which is exactly what you should do if there is only one condition to filter by. For example, let’s look at mobile orders:
SELECT
user_id
,SUM(amount) AS total_amount_mobile
FROM orders
WHERE device_type = 'mobile'
GROUP BY user_id
user id | total amount mobile |
---|---|
1001 | 20.09 |
1002 | 10.27 |
But what if there’s more than one condition we want to use? We can create separate filtered queries for every single condition and JOIN the tables together on user_id
, but this may not be very efficient, and definitely not fun to write and maintain. We can achieve the same result with a single query: for each filter we use a CASE WHEN
statement that replaces the values we don’t want to aggregate with either 0’s or NULL values. This works not just for simple aggregations but also for distinct counts (accurate or approximate).
SELECT
user_id
,COUNT(1) AS order_count
,SUM(
CASE
WHEN device_type = 'mobile'
THEN 1
ELSE 0
END
) AS mobile_order_count
,SUM(
CASE
WHEN device_type = 'mobile'
THEN amount
ELSE 0
END
) AS mobile_amount
,APPROX_DISTINCT(
CASE
WHEN device_type = 'mobile'
THEN CAST(paid_at AS DATE)
ELSE NULL
END
) AS mobile_active_day_count
,SUM(
CASE
WHEN delivered_at - paid_at <= INTERVAL '1' day
THEN 1
ELSE 0
END
) AS single_day_delivery
FROM orders
GROUP BY user_id
user id | order count | mobile order count | mobile amount | mobile active day count | single day delivery |
---|---|---|---|---|---|
1001 | 3 | 2 | 20.09 | 1 | 1 |
1002 | 2 | 1 | 10.27 | 1 | 0 |
Most SQL dialects will have some convenience functions to make the syntax shorter, but typically they will not be very flexible (for example many dialects will have COUNT_IF(<filter>)
, which is equivalent to SUM(CASE WHEN <filter> THEN 1 ELSE 0 END)
, but do not have SUM_IF
or provisions for distinct counts.
SELECT
user_id
,COUNT(1) AS order_count
,COUNT_IF(device_type = 'mobile') AS mobile_order_count
FROM orders
GROUP BY user_id
user id | order count | mobile order count |
---|---|---|
1001 | 3 | 2 |
1002 | 2 | 1 |
By the way, we can use the same mechanism to pivot the data by hard-coding specific values (especially if our SQL dialect does not have the PIVOT
functionality). Pretty ugly, but works:
SELECT
user_id
,COUNT_IF(device_type = 'mobile') AS device_mobile
,COUNT_IF(device_type = 'desktop') AS device_desktop
FROM orders
GROUP BY user_id
user id | device mobile | device desktop |
---|---|---|
1001 | 2 | 1 |
1002 | 1 | 1 |
Order aggregations, with a single sorting order
Now let’s make counting even harder: What if we are not interested just “overall” aggregations (like sum/count/min/max/…) but in aggregations that depend on the order of the data? A good example for our case would be “first order amount” or “last delivery date from a mobile device”.
Finding first / last values with pre-sorting
As long as we order data using a single “index”, we can simplify things by pre-sorting the data. This will allow us to use aggregation functions like FIRST
or LAST
or NTH
aggregation functions which exist in some dialects (although not in Presto, which is what I’m using to test these queries). Typically, these functions do not ensure order on their own but rely on either explicit pre-sorting of the data or have some kind of ORDER BY
clause to guarantee “tie-braking” - they will return a single line even if multiple lines have the same sorting index (no guarantee which line though…)
WITH sorted_orders AS (
SELECT *
FROM orders
ORDER BY user_id, paid_at, order_id
)
SELECT
user_id
,FIRST(amount) AS first_order_amount
,LAST(device_type) AS last_order_device
FROM sorted_orders
GROUP BY user_id
user id | first order amount | last order device |
---|---|---|
1001 | 10.45 | desktop |
1002 | 10.45 | mobile |
Finding first / last values with convenience functions
Some SQL dialects will have convenience functions that will do the pre-sorting for you. For example the presto syntax makes things short and clear in my opinion. These functions will return a single line, but typically only accept a single column for sorting. If we need to sort by multiple columns we either have to generate a sorting column or use a concatenated expression. In our example, orders can by paid at the same time, so we add order_id
to the sorting expression (lower ID’s are created earlier)
SELECT
user_id
,MIN_BY(
amount,
CAST(paid_at AS VARCHAR) || lpad(CAST(order_id AS VARCHAR), 10, '0')
) AS first_order_amount
,MAX_BY(
device_type,
CAST(paid_at AS VARCHAR) || lpad(CAST(order_id AS VARCHAR), 10, '0')
) AS last_delieverd_device
FROM orders
GROUP BY user_id
Order aggregations, with completely different sorting
Convenience functions
If you’re very lucky, your SQL dialect will allow you to use different sorting expressions in a single query inside the convenience function.
SELECT
user_id
,MIN_BY(
amount,
CAST(delivered_at AS VARCHAR) || lpad(CAST(order_id AS VARCHAR), 10, '0')
) AS first_delivered_amount
,MAX_BY(
device_type,
CAST(paid_at AS VARCHAR) || lpad(CAST(order_id AS VARCHAR), 10, '0')
) AS last_payment_device
FROM orders
GROUP BY user_id
user id | first delivered amount | last payment device |
---|---|---|
1001 | 22.55 | desktop |
1002 | 10.45 | mobile |
Array syntax
If you’re slightly less lucky (but still lucky enough), your SQL dialect will allow sorting inside arrays. You’ll have to use separate array aggregations, but the sorting expression syntax is usually nicer. They “trick” is to sort in a way that will always place the number you need at the top of the array (so you don’t need to know the array’s length in advance).
SELECT
user_id
,array_agg(amount ORDER BY delivered_at, order_id)[1] AS first_delivered_amount
,array_agg(device_type ORDER BY paid_at DESC, order_id DESC)[1] AS last_payment_device
FROM orders
GROUP BY user_id
There are some problems with array syntax:
- The value of the first cell in arrays changes from dialect to dialect. For example, in SparkSQL arrays start at
[0]
while in Presto they start at[1]
- Relying on absolute positions is a source for a lot of confusion and bugs…
- It’s definitely less clear and readable that the convenience function syntax
Order aggregations using window functions
And if you’re very unlucky… You’ll have to use a different strategy. Practically all SQL dialects support window functions, and most of them allow different window expressions in the same query (which allows sorting the data by more than one order).
We break down the process to 3 stages to make things a bit more readable (isn’t aliasing great?)
- Add sorting expressions
sort_by_paid
andsort_by_delivered
. This is just to make the code a bit more readable. - For each user, replicate the first value of
sort_by_paid
and the last value ofsort_by_delivered
. - When we do the aggregation we use a
CASE WHEN
statement. This way we keep only one row per user, and therefore we can use theARBITRARY
aggregation function (althoughMIN
orMAX
would work here too).
WITH orders_with_sorting AS (
-- Stage 1
SELECT
*
,CAST(delivered_at AS VARCHAR) || lpad(CAST(order_id AS VARCHAR), 10, '0') AS sort_by_delivered
,CAST(paid_at AS VARCHAR) || lpad(CAST(order_id AS VARCHAR), 10, '0') AS sort_by_paid
FROM orders
)
,orders_with_pointers AS (
-- Stage 2
SELECT
*
,MIN(sort_by_delivered) OVER (PARTITION BY user_id) AS first_delivered_pointer
,MAX(sort_by_paid) OVER (PARTITION BY user_id) AS last_paid_pointer
FROM orders_with_sorting
)
-- Stage 3
SELECT
user_id
,ARBITRARY(
CASE
WHEN sort_by_delivered = first_delivered_pointer
THEN amount
ELSE NULL
END
) AS first_delivered_amount
,ARBITRARY(
CASE
WHEN sort_by_paid = last_paid_pointer
THEN device_type
ELSE NULL
END
) AS last_payment_device
FROM orders_with_pointers
GROUP BY user_id
Combining multiple sorting orders and filters
For example finding the very first order amount by pay date, vs. finding the first amount of an order made from a mobile device:
SELECT
user_id
,MIN_BY(device_type, paid_at) AS first_payment_device
,MIN_BY(
amount,
CASE WHEN device_type = 'desktop' THEN delivered_at ELSE NULL END
) AS first_desktop_delivered_amount
FROM orders
GROUP BY user_id
ORDER BY user_id
user id | first payment device | first desktop delivered amount |
---|---|---|
1001 | mobile | 22.55 |
1002 | desktop | 10.45 |
Getting more than one value at a time (shortcuts)
TBD
Full SQL code
You can find the full code in this file