Postgres Window Functions

Simon Taranto

We recently ran into a case where a join was getting out of hand and we were struggling with how to rein in the size of it. Then we found window functions. Window functions (Oracle calls them analytic functions) are a part of the SQL standard and this post will explore how to use them in Postgres. Let’s see how they work and what kind of problems they can help us solve.

Our problem

Let’s say we have a blogging platform where we have posts that have many comments. Here’s our setup:

CREATE TABLE posts (
  id integer PRIMARY KEY,
  body varchar,
  created_at timestamp DEFAULT current_timestamp
);

CREATE TABLE comments (
 id INTEGER PRIMARY KEY,
 post_id integer NOT NULL,
 body varchar,
 created_at timestamp DEFAULT current_timestamp
);

/* make two posts */
INSERT INTO posts VALUES (1, 'foo');
INSERT INTO posts VALUES (2, 'bar');

/* make 4 comments for the first post */
INSERT INTO comments VALUES (1, 1, 'foo old');
INSERT INTO comments VALUES (2, 1, 'foo new');
INSERT INTO comments VALUES (3, 1, 'foo newer');
INSERT INTO comments VALUES (4, 1, 'foo newest');

/* make 4 comments for the second post */
INSERT INTO comments VALUES (5, 2, 'bar old');
INSERT INTO comments VALUES (6, 2, 'bar new');
INSERT INTO comments VALUES (7, 2, 'bar newer');
INSERT INTO comments VALUES (8, 2, 'bar newest');

Here’s a SQL Fiddle with our example if you’d like to query the data.

Our Goal

We want to get each post‘s three most recent comments. We could start by writing a query such as the one below to pull all the posts and comments. We could then filter this result in whatever application language we’re using (Ruby, PHP, Go, whatever) on top of our database:

SELECT posts.id AS post_id, comments.id AS comment_ids, comments.body AS body
FROM posts LEFT OUTER JOIN comments ON posts.id = comments.post_id;

resulting in:

| POST_ID | COMMENT_IDS |       BODY |
|---------|-------------|------------|
|       1 |           1 |    foo old |
|       1 |           2 |    foo new |
|       1 |           3 |  foo newer |
|       1 |           4 | foo newest |
|       2 |           5 |    bar old |
|       2 |           6 |    bar new |
|       2 |           7 |  bar newer |
|       2 |           8 | bar newest |

This solution may work for awhile but what happens when we have 10’s of thousands of posts that each have 10’s of thousands of comments (it’s a really popular blog)? That resulting join table just got really big.

We want to see just these results without requiring any further filtering:

| POST_ID | COMMENT_ID |       BODY |
|---------|------------|------------|
|       1 |          4 | foo newest |
|       1 |          3 |  foo newer |
|       1 |          2 |    foo new |
|       2 |          8 | bar newest |
|       2 |          7 |  bar newer |
|       2 |          6 |    bar new |

To get this subset we need to some how rank or order the comments for each post and then limit the set. How do we go about this more targeted task in SQL? Enter window functions.

What are window functions

Window functions are a tool to perform advanced sorting and limiting on a subset of a joined table of data (hence, the name window). We’ll look at one particular function, dense_rank(), but all built-in (sum, for example) and user-defined aggregate functions can act as window functions by calling the OVER keyword. Some other popular functions include row_number(), rank(), and percent_rank(). A complete list of available window functions can be found here.

Back to our problem

How can we use this new knowledge to write our query? We want the three most recent comments so we know we’ll need to do some sort of sorting by created_at and then limit the number of results.

Here’s how we can get the ranking information we need using dense_rank():

SELECT posts.id AS post_id, comments.id AS comment_id, comments.body AS body,
  dense_rank() OVER (
    PARTITION BY post_id
    ORDER BY comments.created_at DESC
  ) AS comment_rank
FROM posts LEFT OUTER JOIN comments ON posts.id = comments.post_id;

Resulting in:

| POST_ID | COMMENT_ID |       BODY | COMMENT_RANK |
|---------|------------|------------|--------------|
|       1 |          4 | foo newest |            1 |
|       1 |          3 |  foo newer |            2 |
|       1 |          2 |    foo new |            3 |
|       1 |          1 |    foo old |            4 |
|       2 |          8 | bar newest |            1 |
|       2 |          7 |  bar newer |            2 |
|       2 |          6 |    bar new |            3 |
|       2 |          5 |    bar old |            4 |

We have the same query from before but we’ve added an additional field in our SELECT statement:

dense_rank() OVER (
  PARTITION BY post_id
  ORDER BY comments.created_at DESC
) AS comment_rank

OVER is the keyword that triggers the use of the window function. Inside the OVER statement we’re saying:

  • rank the comments by created_at (ORDER BY comments.created_at) and
  • scope the ranking to each post (PARTITION BY post_id)

This result is still returning the full set but now we have the added comment_rank field showing us the comments for each post in order of age. Given this ranking, now we need to fetch just those comments that have a comment_rank less than 4 (since we want only the three most recent comments). We can achieve this goal by using our new ranking query as a sub-select statement like so:

SELECT comment_id, post_id, body FROM (
  SELECT posts.id AS post_id, comments.id AS comment_id, comments.body AS body,
    dense_rank() OVER (
      PARTITION BY post_id
      ORDER BY comments.created_at DESC
    ) AS comment_rank
  FROM posts LEFT OUTER JOIN comments ON posts.id = comments.post_id
) AS ranked_comments
WHERE comment_rank < 4;

resulting in:

| POST_ID | COMMENT_ID |       BODY |
|---------|------------|------------|
|       1 |          4 | foo newest |
|       1 |          3 |  foo newer |
|       1 |          2 |    foo new |
|       2 |          8 | bar newest |
|       2 |          7 |  bar newer |
|       2 |          6 |    bar new |

And that’s our answer. Our query has become pretty hard to read though.

Refactoring using Common Table Expressions

We can gain back some of the readability by using a Common Table Expression (or CTE). Our query would become:

WITH ranked_comments AS (
  SELECT posts.id AS post_id, comments.id AS comment_id, comments.body AS body,
    dense_rank() OVER (
      PARTITION BY post_id
      ORDER BY comments.created_at DESC
    ) AS comment_rank
  FROM posts LEFT OUTER JOIN comments ON posts.id = comments.post_id
)

SELECT
  post_id,
  comment_id,
  body
FROM ranked_comments
WHERE comment_rank < 4;

Window functions introduce some additional syntax but are a really powerful tool when we need more complex sorting or limiting behavior.

Want to learn more

For some more help on the syntax and options for using these functions check out these docs or this guide. There are some gotchas around dealing with NULL values and how ordering works so be sure to read through this section and test your code. Happy querying.