Finding the Oldest/Youngest Records Within a Group

A common query that crops up in web apps is finding the oldest or the most recent record in a single table. This is straightforward in SQL. You can even write the relevant part of the query without knowing anything about the table (other than the fact that it has a timestamp column called created_at):

ORDER BY created_at LIMIT 1

Oldest or Most Recent records relative to another field

Let’s start with the following simple DB schema:

  • Users: [ id, username ]
  • Orders: [ id, user_id, quantity, price_in_cents, created_at ]

And let’s say you need to display a list of users next to the date of their most recent order. GROUP BY exists to solve this problem:

SELECT
  user_id, MAX(created_at)
FROM
  orders
GROUP BY
  user_id

What if you wanted to display the user’s name instead of the id? This adds a bit of complexity, but with a JOIN and a sub-select the query remains pretty straightforward:

SELECT
  users.username, latest_orders.created_at
FROM
  (SELECT
     user_id, MAX(created_at) AS created_at
   FROM
     orders
   GROUP BY
     user_id) AS latest_orders
INNER JOIN
  users
ON
  users.id = latest_orders.user_id

Now let’s say you actually need the entire order record. Maybe you need to display the price and quantity or maybe you need to perform a calculation on them. This can be achieved by joining the result of the sub-select back to the orders table itself.

SELECT
  orders.*
FROM
  (SELECT
     user_id, MAX(created_at) AS created_at
   FROM
     orders
   GROUP BY
     user_id) AS latest_orders
INNER JOIN
  orders
ON
  orders.user_id = latest_orders.user_id AND
  orders.created_at = latest_orders.created_at

Alright, this is slightly less intuitive. The key idea is that you select only the orders that match the user_id and the created_at timestamp of the latest_orders sub-selected table. If you accept that the sub-select will return only the subset of user_ids paired with their most recent order this becomes relatively clear.

One caveat: there’s a slight issue with using a datetime as an identifier here. What if two orders have the same datetime (specifically the most recent)? In the case of users making orders it’s pretty unlikely short of a double-submit bug, but depending on your use-case for the table in question and the scale of your application it is possible. The upshot is that this query would include 2 entries for that user. This is likely a problem. A small one if you are displaying the latest orders to the admins of your application (they see two orders for one user). A larger one if you are doing something user-facing. In this situation you’d really just want to pick one of the two and adding a DISTINCT clause on user_id in the above example would do just that.

This strategy can be expanded for any number of similar situations, even those that require grouping by several fields. Let’s look at another, slightly more complicated example:

  • Users: [ id, username … ]
  • Companies: [ id, company_name … ]
  • StockHoldings: [ id, user_id, company_id, quantity, purchase_price ]
  • BondHoldings: [ id, user_id, company_id, purchase_price, par_value, interest_rate ]
  • HoldingValueStamps: [ user_id, holding_id, holding_type, value, created_at ]

Here StockHolding and BondHolding represent ownership of stock and bonds. If you bought 10 shares of AAPL we would create a StockHolding for you with company_id: 'AAPL', quantity: 10. The value of those AAPL shares would fluctuate over time. HoldingValueStamp will be how we track their value. Every minute we’ll create a HoldingValueStamp with the current value of your 10 shares of AAPL. These stamps are created to keep an historical record of the value of our user’s holdings. Having these stamps would also allow us to show our users a trend-line of their portfolio’s value overtime.

We also need to be able to tell the user the current value of their portfolio. To do this we need the most recent HoldingValueStamps for each holding (at this point only StockHoldings and BondHoldings) that belongs to the user.

Sub-select Method

SELECT
  holding_value_stamps.*
FROM
  (SELECT
     holding_id,
     holding_type,
     MAX(created_at) as created_at
   FROM
     holding_value_stamps
   GROUP BY
     holding_id, holding_type) AS most_recent_stamps
INNER JOIN
  holding_value_stamps
ON
  holding_value_stamps.created_at = most_recent_stamps.created_at
  holding_value_stamps.holding_id = most_recent_stamps.holding_id AND
  holding_value_stamps.holding_type = most_recent_stamps.holding_type
WHERE
  holding_value_stamps.user_id = #{current_user.id}

This example is very similar to the sub-select above but specific to a certain user.

Here the sub-select finds the most recent created_at timestamp per holding (holding_id and holding_type together ensure a unique StockHolding or BondHolding). Unfortunately, we can’t select the value field from this query because we don’t want to group by it or aggregate it.

We can, however, match the created_at timestamps from most_recent_timestamps to the full row of holding_value_stamps. It’s very important that we include all of the fields we used in the GROUP BY as well as the selected created_at timestamp in the ON clause. If we didn’t include those fields we may wind up with the wrong data (due to some BondHoldings or StockHoldings having HoldingValueStamps with identical created_at timestamps).

Finally, the WHERE clause ensures that we don’t select any that don’t belong to the current user.

Outer Join Method

Another way to achieve the same result:

SELECT
  holding_value_stamps1.*
FROM
  holding_value_stamps AS holding_value_stamps1
LEFT OUTER JOIN
  holding_value_stamps AS holding_value_stamps2
ON
  holding_value_stamps1.holding_id = holding_value_stamps2.holding_id AND
  holding_value_stamps1.holding_type = holding_value_stamps2.holding_type AND
  holding_value_stamps1.created_at < holding_value_stamps2.created_at
WHERE
  holding_value_stamps1.user_id = #{current_user.id} AND
  holding_value_stamps2.id IS NULL

Here we ensure that we are not selecting any rows from holding_value_stamps1 when a row in holding_value_stamps2 (the same table) exists with a more recent created_at.

This solution has the same issue when multiple records have the same created_at that we discussed earlier, both of those records would find their way into the result. This is because there are no other records to satisfy the ON clause for the OUTER JOIN for either of the two records.. i.e. there is no record with a created_at that is greater than their own. Therefore, they would not join to any actual data from holding_value_stamps2 and holding_value_stamps2.id would be NULL (which in turn causes them to be included in the result set per the holding_value_stamps2.id IS NULL in the WHERE clause).

One way that this issue can be dealt with by favoring the stamp with the smaller id:

SELECT
  holding_value_stamps1.*
FROM
  holding_value_stamps AS holding_value_stamps1
LEFT OUTER JOIN
  holding_value_stamps AS holding_value_stamps2
ON
  holding_value_stamps1.holding_id = holding_value_stamps2.holding_id AND
  holding_value_stamps1.holding_type = holding_value_stamps2.holding_type AND
  (
    holding_value_stamps1.created_at < holding_value_stamps2.created_at OR
    (
      holding_valuation_stamps1.created_at = holding_valuation_stamps2.created_at AND
      holding_valuation_stamps.id < h2.id
    )
  )
WHERE
  holding_value_stamps1.user_id = #{current_user.id} AND
  holding_value_stamps2.id IS NULL

Other solutions

If performance isn’t a concern, there are other, possibly-clearer ways to handle this kind of filtering and ordering in Ruby:

HoldingValueStamp.where(user_id: current_user.id).
  group_by { |stamp| [stamp.holding_id, stamp.holding_type] }.
  map { |_holding, stamps| stamps.max(&:created_at) }

That said, the database level solutions in this post have been performant in real production environments.