GIANT ROBOTS SMASHING INTO OTHER GIANT ROBOTS

Written by thoughtbot

PostgreSQL performance considerations

There are a number of variables that allow a DBA to tune a PostgreSQL database server for specific loads, disk types and hardware. These are fondly called the GUCS (Global Unified Configuration Settings) and you can take a look via the pg_settings view. There are also a few of things that you can do in your application to get the most out of Postgres:

Know the postgres index types

By default CREATE INDEX will create B-tree indexes which will serve well for most cases where we use equality, inequality and range operators. However there are cases where you can build different indexing strategies with GiST (Generalized Search Tree) indexes. For example, Postgres ships with built in GiST operator classes for geometric operators — for dealing with the geometric types like point, box, polygon, circle, and others. There are more interesting GiST index examples in the contrib packages for things like textual search, tree structures, and more.

Consider multicolumn indexes, when it makes sense

When your query filters the data by more than one column, be it with the WHERE clause or JOINs, multicolumn indexes may prove useful. If you create an index on columns (a, b), the Postgres planner can use it for queries

WHERE a = 1
WHERE a = 1 AND b = 2

However, it will not use it for queries using:

WHERE a = 1 OR b = 2
WHERE b = 2.

But Postgres also has the ability to use multiple indexes in a single query. This may come in handy if you are using the OR operator, but will also make use of it for AND queries. So it boils down to what the most common case is according to your application’s read patterns and optimize for that, either with an an index on (a, b) and another on (b), or two separate single column indexes.

Partial indexes

Simply put, a partial index is an index with a WHERE clause. It will only index rows that match the supplied predicate. You can use them to exclude values from an index that you hardly query against.

For example, you have an orders table with a completed flag. The sales people want to know what orders over $100,000.00 haven’t been completed because they want to collect their bonuses, so you build a view in your app to show them just that (and negotiate a cut on the bonus). You could create the following index:

CREATE INDEX orders_incomplete_amount_index
   on orders (amount) WHERE complete is not true;

Which will be used by queries of the form:

SELECT * FROM orders
  where amount > 100000 AND complete is not true;

Don’t over index

Part of maintaining a healthy database is going back and making sure you don’t have any unused indexes. It’s common to add indexes to address a specific performance issue for a particular query, but in many cases indexes start to pile up becoming dead weight. Remember that the more indexes you have, the slower INSERTs will become because more writes will need to happen to keep the indexes updated.

Keep statistics updated

Make sure to run VACUUM ANALYZE to keep data statistics up to date — as well as recover disk space. In addition, Postgres ships with a built in auto-vacuum daemon whose purpose is to automate the execution of VACUUM ANALYZE. You should read up on considerations for setting the auto-vacuum daemon’s frequency according to your database size and usage characteristics.

Make sure you ANALYZE when creating a new index, otherwise Postgres will not have analyzed the data and determined that the new index may help for the query.

Use more joins

Postgres is perfectly capable of joining multiple tables in a single query. In a running app, queries with five joins are completely acceptable, and will help bring in the data required by your app, reducing the number of trips to the database. In most cases, joins are also a better solution than subqueries — Postgres will even internally “rewrite” a subquery, creating a join, whenever possible, but this of course increases the time it takes to come up with the query plan. So be a pal and use joins instead of subselects.

Prefer INNER JOINs

If the cardinality of both tables in a join is guaranteed to be equal for your result set, always prefer doing an INNER JOIN instead of a LEFT OUTER JOIN. A lot of research and code has gone into optimizing outer joins in Postgres over the years. But the reality is that especially as more joins are added to a query, left joins limit the planner’s ability optimize the join order.

Know how to understand the explain output

Paste the output of explain analyze [some query] into explain.depesz.com to help identify the most costly nodes in the query plans.

Understanding EXPLAIN output is a very extensive topic, but these are some general guidelines when reading plans:

  • Are the cost estimates vs. actuals close, or are there discrepancies? Typically a sign of not having ANALYZEd recently.
  • Is an index not being used? The planner may be choosing not to use it for good reason.
  • Is there query using the some_string LIKE pattern? If so, make sure the pattern is anchored at the beginning of the string. Postgres can use an index when doing some_string LIKE 'pattern%' but not for some_string LIKE '%pattern%'
  • Have you vacuumed recently? Have you indexed foreign keys?
  • Are there table scans that should use an index instead? Not all table scans are bad — there are cases where it will perform better than an index scan
  • Good database schema design yields better query plans. Read up on database normalization.

Never try to optimize queries on your development machine

The Postgres planner collects statistics about your data that help identify the best possible execution plan for your query. In fact, it will just use heuristics to determine the query plan if the table has little to no data in it. Not only do you need realistic production data in order to analyze reasonable query plans, but also the Postgres server’s configuration has a big effect. For this reason it’s required that you run your analysis on either the production box, or on a staging box that is configured just like production, and where you’ve restored production data.

Experimentation is key

There are no hard and fast rules to a perfectly optimized system. The best advice is to try out different configurations, use a tool like NewRelic to find out what the bottlenecks are, and liberally try out different combinations of indexes and queries that yield best results for your particular situation.