Skip to the content.

Back to my projects page

Back to Coding in SQL

Defining user-sessions from events & pageViews in SQL

What’s in a session

Good news! Your engineers finally got around to implementing pageView tracking and maybe some custom events (buttons clicked, backend events like checkout confirmation, etc). It’s time to start figuring out what your users are actually doing with your product.

Counting tokens & clicks is great place to start (and quite often, all you really need), but you want to dive a bit deeper. KPIs like [total clicks] / [total product paveViews] are good overall metrics, but will not tell you that 90% of your users always click while 10% of your users just browse around “aimlessly” (hello crawler bots!). Even at the user level (GROUP BY user ID, right?) you are still missing part of the picture - for example which of your “click & buy” campaigns lead users to making a purchase? Are there landing pages that work better than others? etc…

Enters “the session”

Oh, but there’s more than one definition

A “session” is a is a much (ab)used term. This post looks at sessions from a behavioural perspective: A user performing a task or going through a flow in your product (in our example - users buying products). Intuitively I think we all get it, but the definition becomes harder when you mix in the other definitions of a “session”:

Session fencing

So what we’re dealing here is a mixed stream of events from different sources, combined with some time limits we have to apply. In my experience, it’s usually easier to define when a session ends than when a session starts under these circumstances. Daily logins are no longer a thing for most of us (unless you’re working on a banking app, which will kick you out in the name of security) and multi-tab browsing makes it event more complicated to figure out when users begin a process. On the other hand, some events in our systems are “absolute” indicators of session end (“checkout confirmation”, “flow completed” etc) or come from the backend side (your shopping cart is cleared after X days).

Think about it this way: aggregating event / ledger data into meaningful sessions is an exercise in fencing - not the sword kind but the kind that makes good neighbours. We recognise the fact that we are looking at a “messy” stream of data - forgotten tabs, users wandering about, different systems sending different signals etc. What we want is to find strong indication of sessions ending (a “fence”), and in between those endings we have the “behavioural session”.

Example

In this example I will demonstrate a mix of two of these “fences”:

We observe the following behaviours in our data:

We generate the data trail to reflect these behaviours:

SELECT *
FROM (VALUES
    (CAST('2021-01-01 09:12:12' AS TIMESTAMP), 'uid001', 'A1', 'pageView A'),
    (CAST('2021-01-01 09:12:15' AS TIMESTAMP), 'uid001', 'A1', 'pageView B'),
    (CAST('2021-01-01 09:13:13' AS TIMESTAMP), 'uid001', 'A1', 'Add product A'),
    (CAST('2021-01-01 09:15:01' AS TIMESTAMP), 'uid001', 'A1', 'Add product B'),
    (CAST('2021-01-01 16:35:14' AS TIMESTAMP), 'uid002', 'A2', 'pageView B'),
    (CAST('2021-01-01 17:15:01' AS TIMESTAMP), 'uid002', 'A2', 'Add product B'),
    (CAST('2021-01-01 20:23:15' AS TIMESTAMP), 'uid001', 'B1', 'Checkout'),
    (CAST('2021-01-01 20:23:16' AS TIMESTAMP), 'uid001', NULL, 'Approved'),
    (CAST('2021-01-01 20:25:44' AS TIMESTAMP), 'uid001', 'B1', 'Add product C'),
    (CAST('2021-01-05 23:55:19' AS TIMESTAMP), 'uid002', 'B2', 'Add product B'),
    (CAST('2021-01-06 00:01:21' AS TIMESTAMP), 'uid002', 'B2', 'Checkout'),
    (CAST('2021-01-06 00:01:21' AS TIMESTAMP), 'uid002', NULL, 'Approved'),
    (CAST('2021-01-06 00:05:19' AS TIMESTAMP), 'uid002', 'B2', 'Add product C'),
    (CAST('2021-01-06 21:35:47' AS TIMESTAMP), 'uid002', 'A2', 'Checkout'),
    (CAST('2021-01-06 21:25:37' AS TIMESTAMP), 'uid001', 'A3', 'Add product C'),
    (CAST('2021-01-06 21:35:47' AS TIMESTAMP), 'uid001', 'A3', 'Checkout'),
    (CAST('2021-01-06 21:33:48' AS TIMESTAMP), 'uid001', NULL, 'Approved'),
    (CAST('2021-01-06 21:34:08' AS TIMESTAMP), 'uid001', 'A3', 'pageView D')
  ) AS t(timestamp_utc, user_id, token, event_type)

The event ledger (rows come in by timestamp) looks like this:

SELECT * FROM events_table
timestamp utc user id token event type
2021-01-01 09:12:12 uuid001 A1 pageView A
2021-01-01 09:12:15 uuid001 A1 pageView B
2021-01-01 09:13:13 uuid001 A1 Add product A
2021-01-01 09:15:01 uuid001 A1 Add product B
2021-01-01 16:35:14 uuid002 A2 pageView B
2021-01-01 17:15:01 uuid002 A2 Add product B
2021-01-01 20:23:15 uuid001 B1 Checkout
2021-01-01 20:23:16 uuid001   Approved
2021-01-01 20:25:44 uuid001 B1 Add product C
2021-01-05 23:55:19 uuid002 B2 Add product B
2021-01-06 00:01:21 uuid002 B2 Checkout
2021-01-06 00:01:21 uuid002   Approved
2021-01-06 00:05:19 uuid002 B2 Add product C
2021-01-06 21:35:47 uuid002 A2 Checkout
2021-01-06 21:25:37 uuid001 A3 Add product C
2021-01-06 21:35:47 uuid001 A3 Checkout
2021-01-06 21:33:48 uuid001   Approved
2021-01-06 21:34:08 uuid001 A3 PageView D

There’s no need to actually sort the data by user & date, but it would help our visualisation:

SELECT * FROM events_table ORDER BY user_id, timestamp_utc
timestamp utc user id token event type
2021-01-01 09:12:12 uuid001 A1 pageView A
2021-01-01 09:12:15 uuid001 A1 pageView B
2021-01-01 09:13:13 uuid001 A1 Add product A
2021-01-01 09:15:01 uuid001 A1 Add product B
2021-01-01 20:23:15 uuid001 B1 Checkout
2021-01-01 20:23:16 uuid001   Approved
2021-01-01 20:25:44 uuid001 B1 Add product C
2021-01-06 21:25:37 uuid001 A3 Add product C
2021-01-06 21:35:47 uuid001 A3 Checkout
2021-01-06 21:33:48 uuid001   Approved
2021-01-06 21:34:08 uuid001 A3 PageView D
2021-01-01 16:35:14 uuid002 A2 pageView B
2021-01-01 17:15:01 uuid002 A2 Add product B
2021-01-05 23:55:19 uuid002 B2 Add product B
2021-01-06 00:01:21 uuid002 B2 Checkout
2021-01-06 00:01:21 uuid002   Approved
2021-01-06 00:05:19 uuid002 B2 Add product C
2021-01-06 21:35:47 uuid002 A2 Checkout

Let’s flag our “fences” - when a checkout is complete or when the cart is older than 4 days:

SELECT
  timestamp_utc,
  user_id,
  event_type,
  token,
  CASE
    WHEN
      event_type = 'Approved'
      OR date_diff(
        'day',
        timestamp_utc,
        COALESCE(LEAD(timestamp_utc, 1) OVER (PARTITION BY user_id ORDER BY timestamp_utc), CURRENT_TIMESTAMP)
      ) >= 4
    THEN 1
    ELSE 0
  END AS session_end_indicator
FROM events_table
timestamp_utc user_id token event_type session_end_indicator
2021-01-01 09:12:12 uuid001 A1 pageView A 0
2021-01-01 09:12:15 uuid001 A1 pageView B 0
2021-01-01 09:13:13 uuid001 A1 Add product A 0
2021-01-01 09:15:01 uuid001 A1 Add product B 0
2021-01-01 20:23:15 uuid001 B1 Checkout 0
2021-01-01 20:23:16 uuid001   Approved 1
2021-01-01 20:25:44 uuid001 B1 Add product C 1
2021-01-06 21:25:37 uuid001 A3 Add product C 0
2021-01-06 21:35:47 uuid001 A3 Checkout 0
2021-01-06 21:33:48 uuid001   Approved 1
2021-01-06 21:34:08 uuid001 A3 PageView D 0
2021-01-01 16:35:14 uuid002 A2 pageView B 0
2021-01-01 17:15:01 uuid002 A2 Add product B 1
2021-01-05 23:55:19 uuid002 B2 Add product B 0
2021-01-06 00:01:21 uuid002 B2 Checkout 0
2021-01-06 00:01:21 uuid002   Approved 1
2021-01-06 00:05:19 uuid002 B2 Add product C 0
2021-01-06 21:35:47 uuid002 A2 Checkout 0

We then move the “end of session” market a step forward so we get a “start of next session” indicator

WITH end_of_sessions AS (
  SELECT
    timestamp_utc,
    user_id,
    event_type,
    token,
    CASE
      WHEN
        event_type = 'Approved'
        OR date_diff(
          'day',
          timestamp_utc,
          COALESCE(LEAD(timestamp_utc, 1) OVER (PARTITION BY user_id ORDER BY timestamp_utc), CURRENT_TIMESTAMP)
        ) >= 4
      THEN 1
      ELSE 0
    END AS session_end_indicator
  FROM events_table
)

SELECT
  timestamp_utc,
  user_id,
  event_type,
  token,
  COALESCE(
    LAG(session_end_indicator, 1)
      OVER (PARTITION BY user_id ORDER BY timestamp_utc),
    0
  ) AS session_start_indicator
FROM end_of_sessions
timestamp utc user id token event type session end indicator
2021-01-01 09:12:12 uuid001 A1 pageView A 0
2021-01-01 09:12:15 uuid001 A1 pageView B 0
2021-01-01 09:13:13 uuid001 A1 Add product A 0
2021-01-01 09:15:01 uuid001 A1 Add product B 0
2021-01-01 20:23:15 uuid001 B1 Checkout 0
2021-01-01 20:23:16 uuid001   Approved 0
2021-01-01 20:25:44 uuid001 B1 Add product C 1
2021-01-06 21:25:37 uuid001 A3 Add product C 1
2021-01-06 21:35:47 uuid001 A3 Checkout 0
2021-01-06 21:33:48 uuid001   Approved 0
2021-01-06 21:34:08 uuid001 A3 PageView D 1
2021-01-01 16:35:14 uuid002 A2 pageView B 0
2021-01-01 17:15:01 uuid002 A2 Add product B 0
2021-01-05 23:55:19 uuid002 B2 Add product B 1
2021-01-06 00:01:21 uuid002 B2 Checkout 0
2021-01-06 00:01:21 uuid002   Approved 0
2021-01-06 00:05:19 uuid002 B2 Add product C 1
2021-01-06 21:35:47 uuid002 A2 Checkout 0

And finally, we run a cumulative sum over the indicator, which will give us a per-user session ID, starting with 0

end_of_sessions AS (
  SELECT
    timestamp_utc,
    user_id,
    event_type,
    token,
    CASE
      WHEN
        event_type = 'Approved'
        OR date_diff(
          'day',
          timestamp_utc,
          COALESCE(LEAD(timestamp_utc, 1) OVER (PARTITION BY user_id ORDER BY timestamp_utc), CURRENT_TIMESTAMP)
        ) >= 4
      THEN 1
      ELSE 0
    END AS session_end_indicator
  FROM events_table
),

start_of_sessions AS (
  SELECT
    timestamp_utc,
    user_id,
    event_type,
    token,
    COALESCE(
      LAG(session_end_indicator, 1)
        OVER (PARTITION BY user_id ORDER BY timestamp_utc),
      0
    ) AS session_start_indicator
  FROM end_of_sessions
),

SELECT
  timestamp_utc,
  user_id,
  event_type,
  token,
  SUM(session_start_indicator)
    OVER (PARTITION BY user_id ORDER BY timestamp_utc ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
  AS user_session_id
FROM start_of_sessions
timestamp utc user id token event type user session id
2021-01-01 09:12:12 uuid001 A1 pageView A 0
2021-01-01 09:12:15 uuid001 A1 pageView B 0
2021-01-01 09:13:13 uuid001 A1 Add product A 0
2021-01-01 09:15:01 uuid001 A1 Add product B 0
2021-01-01 20:23:15 uuid001 B1 Checkout 0
2021-01-01 20:23:16 uuid001   Approved 0
2021-01-01 20:25:44 uuid001 B1 Add product C 1
2021-01-06 21:25:37 uuid001 A3 Add product C 2
2021-01-06 21:35:47 uuid001 A3 Checkout 2
2021-01-06 21:33:48 uuid001   Approved 2
2021-01-06 21:34:08 uuid001 A3 PageView D 3
2021-01-01 16:35:14 uuid002 A2 pageView B 0
2021-01-01 17:15:01 uuid002 A2 Add product B 0
2021-01-05 23:55:19 uuid002 B2 Add product B 1
2021-01-06 00:01:21 uuid002 B2 Checkout 1
2021-01-06 00:01:21 uuid002   Approved 1
2021-01-06 00:05:19 uuid002 B2 Add product C 2
2021-01-06 21:35:47 uuid002 A2 Checkout 2

And we can continue our analysis by session

SELECT
  user_id,
  user_session_id
  SUM(CASE WHEN event_type = 'Approved' THEN 1 ELSE 0 END) AS purchase_count,
  SUM(CASE WHEN event_type = 'Checkout' THEN 1 ELSE 0 END) AS checkout_attempts,
  COUNT( DISTINCT CASE WHEN event_type LIKE 'pageView%' THEN event_type ELSE NULL END) AS distinct_pageviews_count
  ...
FROM user_sessions
GROUP BY user_id, user_session_id
ORDER BY user_id, user_session_id

And get:

user id user session id purchase count checkout attempts distinct pageviews count
uid001 0 1 1 2
uid001 1 0 0 0
uid001 2 1 0 0
uid001 3 0 1 1
uid002 0 0 0 1
uid002 1 1 1 0
uid002 2 0 1 0

Full SQL code

You can find the full code in this file